TED2011

Deb Roy: The birth of a word

Filmed:

MIT researcher Deb Roy wanted to understand how his infant son learned language -- so he wired up his house with videocameras to catch every moment (with exceptions) of his son's life, then parsed 90,000 hours of home video to watch "gaaaa" slowly turn into "water." Astonishing, data-rich research with deep implications for how we learn.

- Cognitive scientist
Deb Roy studies how children learn language, and designs machines that learn to communicate in human-like ways. On sabbatical from MIT Media Lab, he's working with the AI company Bluefin Labs. Full bio

Imagine if you could record your life --
00:15
everything you said, everything you did,
00:19
available in a perfect memory store at your fingertips,
00:22
so you could go back
00:25
and find memorable moments and relive them,
00:27
or sift through traces of time
00:30
and discover patterns in your own life
00:33
that previously had gone undiscovered.
00:35
Well that's exactly the journey
00:38
that my family began
00:40
five and a half years ago.
00:42
This is my wife and collaborator, Rupal.
00:44
And on this day, at this moment,
00:47
we walked into the house with our first child,
00:49
our beautiful baby boy.
00:51
And we walked into a house
00:53
with a very special home video recording system.
00:56
(Video) Man: Okay.
01:07
Deb Roy: This moment
01:10
and thousands of other moments special for us
01:11
were captured in our home
01:14
because in every room in the house,
01:16
if you looked up, you'd see a camera and a microphone,
01:18
and if you looked down,
01:21
you'd get this bird's-eye view of the room.
01:23
Here's our living room,
01:25
the baby bedroom,
01:28
kitchen, dining room
01:31
and the rest of the house.
01:33
And all of these fed into a disc array
01:35
that was designed for a continuous capture.
01:38
So here we are flying through a day in our home
01:41
as we move from sunlit morning
01:44
through incandescent evening
01:47
and, finally, lights out for the day.
01:49
Over the course of three years,
01:53
we recorded eight to 10 hours a day,
01:56
amassing roughly a quarter-million hours
01:58
of multi-track audio and video.
02:01
So you're looking at a piece of what is by far
02:04
the largest home video collection ever made.
02:06
(Laughter)
02:08
And what this data represents
02:11
for our family at a personal level,
02:13
the impact has already been immense,
02:17
and we're still learning its value.
02:19
Countless moments
02:22
of unsolicited natural moments, not posed moments,
02:24
are captured there,
02:27
and we're starting to learn how to discover them and find them.
02:29
But there's also a scientific reason that drove this project,
02:32
which was to use this natural longitudinal data
02:35
to understand the process
02:39
of how a child learns language --
02:41
that child being my son.
02:43
And so with many privacy provisions put in place
02:45
to protect everyone who was recorded in the data,
02:49
we made elements of the data available
02:52
to my trusted research team at MIT
02:55
so we could start teasing apart patterns
02:58
in this massive data set,
03:01
trying to understand the influence of social environments
03:04
on language acquisition.
03:07
So we're looking here
03:09
at one of the first things we started to do.
03:11
This is my wife and I cooking breakfast in the kitchen,
03:13
and as we move through space and through time,
03:17
a very everyday pattern of life in the kitchen.
03:20
In order to convert
03:23
this opaque, 90,000 hours of video
03:25
into something that we could start to see,
03:28
we use motion analysis to pull out,
03:30
as we move through space and through time,
03:32
what we call space-time worms.
03:34
And this has become part of our toolkit
03:37
for being able to look and see
03:40
where the activities are in the data,
03:43
and with it, trace the pattern of, in particular,
03:45
where my son moved throughout the home,
03:48
so that we could focus our transcription efforts,
03:50
all of the speech environment around my son --
03:53
all of the words that he heard from myself, my wife, our nanny,
03:56
and over time, the words he began to produce.
03:59
So with that technology and that data
04:02
and the ability to, with machine assistance,
04:05
transcribe speech,
04:07
we've now transcribed
04:09
well over seven million words of our home transcripts.
04:11
And with that, let me take you now
04:14
for a first tour into the data.
04:16
So you've all, I'm sure,
04:19
seen time-lapse videos
04:21
where a flower will blossom as you accelerate time.
04:23
I'd like you to now experience
04:26
the blossoming of a speech form.
04:28
My son, soon after his first birthday,
04:30
would say "gaga" to mean water.
04:32
And over the course of the next half-year,
04:35
he slowly learned to approximate
04:38
the proper adult form, "water."
04:40
So we're going to cruise through half a year
04:43
in about 40 seconds.
04:45
No video here,
04:47
so you can focus on the sound, the acoustics,
04:49
of a new kind of trajectory:
04:52
gaga to water.
04:54
(Audio) Baby: Gagagagagaga
04:56
Gaga gaga gaga
05:08
guga guga guga
05:12
wada gaga gaga guga gaga
05:17
wader guga guga
05:22
water water water
05:26
water water water
05:29
water water
05:35
water.
05:39
DR: He sure nailed it, didn't he.
05:41
(Applause)
05:43
So he didn't just learn water.
05:50
Over the course of the 24 months,
05:52
the first two years that we really focused on,
05:54
this is a map of every word he learned in chronological order.
05:57
And because we have full transcripts,
06:01
we've identified each of the 503 words
06:04
that he learned to produce by his second birthday.
06:06
He was an early talker.
06:08
And so we started to analyze why.
06:10
Why were certain words born before others?
06:13
This is one of the first results
06:16
that came out of our study a little over a year ago
06:18
that really surprised us.
06:20
The way to interpret this apparently simple graph
06:22
is, on the vertical is an indication
06:25
of how complex caregiver utterances are
06:27
based on the length of utterances.
06:30
And the [horizontal] axis is time.
06:32
And all of the data,
06:35
we aligned based on the following idea:
06:37
Every time my son would learn a word,
06:40
we would trace back and look at all of the language he heard
06:43
that contained that word.
06:46
And we would plot the relative length of the utterances.
06:48
And what we found was this curious phenomena,
06:52
that caregiver speech would systematically dip to a minimum,
06:55
making language as simple as possible,
06:58
and then slowly ascend back up in complexity.
07:01
And the amazing thing was
07:04
that bounce, that dip,
07:06
lined up almost precisely
07:08
with when each word was born --
07:10
word after word, systematically.
07:12
So it appears that all three primary caregivers --
07:14
myself, my wife and our nanny --
07:16
were systematically and, I would think, subconsciously
07:19
restructuring our language
07:22
to meet him at the birth of a word
07:24
and bring him gently into more complex language.
07:27
And the implications of this -- there are many,
07:31
but one I just want to point out,
07:33
is that there must be amazing feedback loops.
07:35
Of course, my son is learning
07:38
from his linguistic environment,
07:40
but the environment is learning from him.
07:42
That environment, people, are in these tight feedback loops
07:45
and creating a kind of scaffolding
07:48
that has not been noticed until now.
07:50
But that's looking at the speech context.
07:54
What about the visual context?
07:56
We're not looking at --
07:58
think of this as a dollhouse cutaway of our house.
08:00
We've taken those circular fish-eye lens cameras,
08:02
and we've done some optical correction,
08:05
and then we can bring it into three-dimensional life.
08:07
So welcome to my home.
08:11
This is a moment,
08:13
one moment captured across multiple cameras.
08:15
The reason we did this is to create the ultimate memory machine,
08:18
where you can go back and interactively fly around
08:21
and then breathe video-life into this system.
08:24
What I'm going to do
08:27
is give you an accelerated view of 30 minutes,
08:29
again, of just life in the living room.
08:32
That's me and my son on the floor.
08:34
And there's video analytics
08:37
that are tracking our movements.
08:39
My son is leaving red ink. I am leaving green ink.
08:41
We're now on the couch,
08:44
looking out through the window at cars passing by.
08:46
And finally, my son playing in a walking toy by himself.
08:49
Now we freeze the action, 30 minutes,
08:52
we turn time into the vertical axis,
08:55
and we open up for a view
08:57
of these interaction traces we've just left behind.
08:59
And we see these amazing structures --
09:02
these little knots of two colors of thread
09:05
we call "social hot spots."
09:08
The spiral thread
09:10
we call a "solo hot spot."
09:12
And we think that these affect the way language is learned.
09:14
What we'd like to do
09:17
is start understanding
09:19
the interaction between these patterns
09:21
and the language that my son is exposed to
09:23
to see if we can predict
09:25
how the structure of when words are heard
09:27
affects when they're learned --
09:29
so in other words, the relationship
09:31
between words and what they're about in the world.
09:33
So here's how we're approaching this.
09:37
In this video,
09:39
again, my son is being traced out.
09:41
He's leaving red ink behind.
09:43
And there's our nanny by the door.
09:45
(Video) Nanny: You want water? (Baby: Aaaa.)
09:47
Nanny: All right. (Baby: Aaaa.)
09:50
DR: She offers water,
09:53
and off go the two worms
09:55
over to the kitchen to get water.
09:57
And what we've done is use the word "water"
09:59
to tag that moment, that bit of activity.
10:01
And now we take the power of data
10:03
and take every time my son
10:05
ever heard the word water
10:08
and the context he saw it in,
10:10
and we use it to penetrate through the video
10:12
and find every activity trace
10:15
that co-occurred with an instance of water.
10:18
And what this data leaves in its wake
10:21
is a landscape.
10:23
We call these wordscapes.
10:25
This is the wordscape for the word water,
10:27
and you can see most of the action is in the kitchen.
10:29
That's where those big peaks are over to the left.
10:31
And just for contrast, we can do this with any word.
10:34
We can take the word "bye"
10:37
as in "good bye."
10:39
And we're now zoomed in over the entrance to the house.
10:41
And we look, and we find, as you would expect,
10:43
a contrast in the landscape
10:46
where the word "bye" occurs much more in a structured way.
10:48
So we're using these structures
10:51
to start predicting
10:53
the order of language acquisition,
10:55
and that's ongoing work now.
10:58
In my lab, which we're peering into now, at MIT --
11:00
this is at the media lab.
11:03
This has become my favorite way
11:05
of videographing just about any space.
11:07
Three of the key people in this project,
11:09
Philip DeCamp, Rony Kubat and Brandon Roy are pictured here.
11:11
Philip has been a close collaborator
11:14
on all the visualizations you're seeing.
11:16
And Michael Fleischman
11:18
was another Ph.D. student in my lab
11:21
who worked with me on this home video analysis,
11:23
and he made the following observation:
11:26
that "just the way that we're analyzing
11:29
how language connects to events
11:31
which provide common ground for language,
11:34
that same idea we can take out of your home, Deb,
11:36
and we can apply it to the world of public media."
11:40
And so our effort took an unexpected turn.
11:43
Think of mass media
11:46
as providing common ground
11:48
and you have the recipe
11:50
for taking this idea to a whole new place.
11:52
We've started analyzing television content
11:55
using the same principles --
11:58
analyzing event structure of a TV signal --
12:00
episodes of shows,
12:03
commercials,
12:05
all of the components that make up the event structure.
12:07
And we're now, with satellite dishes, pulling and analyzing
12:10
a good part of all the TV being watched in the United States.
12:13
And you don't have to now go and instrument living rooms with microphones
12:16
to get people's conversations,
12:19
you just tune into publicly available social media feeds.
12:21
So we're pulling in
12:24
about three billion comments a month,
12:26
and then the magic happens.
12:28
You have the event structure,
12:30
the common ground that the words are about,
12:32
coming out of the television feeds;
12:34
you've got the conversations
12:37
that are about those topics;
12:39
and through semantic analysis --
12:41
and this is actually real data you're looking at
12:44
from our data processing --
12:46
each yellow line is showing a link being made
12:48
between a comment in the wild
12:51
and a piece of event structure coming out of the television signal.
12:54
And the same idea now
12:57
can be built up.
12:59
And we get this wordscape,
13:01
except now words are not assembled in my living room.
13:03
Instead, the context, the common ground activities,
13:06
are the content on television that's driving the conversations.
13:10
And what we're seeing here, these skyscrapers now,
13:13
are commentary
13:16
that are linked to content on television.
13:18
Same concept,
13:20
but looking at communication dynamics
13:22
in a very different sphere.
13:24
And so fundamentally, rather than, for example,
13:26
measuring content based on how many people are watching,
13:28
this gives us the basic data
13:31
for looking at engagement properties of content.
13:33
And just like we can look at feedback cycles
13:36
and dynamics in a family,
13:39
we can now open up the same concepts
13:42
and look at much larger groups of people.
13:45
This is a subset of data from our database --
13:48
just 50,000 out of several million --
13:51
and the social graph that connects them
13:54
through publicly available sources.
13:56
And if you put them on one plain,
13:59
a second plain is where the content lives.
14:01
So we have the programs
14:04
and the sporting events
14:07
and the commercials,
14:09
and all of the link structures that tie them together
14:11
make a content graph.
14:13
And then the important third dimension.
14:15
Each of the links that you're seeing rendered here
14:19
is an actual connection made
14:21
between something someone said
14:23
and a piece of content.
14:26
And there are, again, now tens of millions of these links
14:28
that give us the connective tissue of social graphs
14:31
and how they relate to content.
14:34
And we can now start to probe the structure
14:37
in interesting ways.
14:39
So if we, for example, trace the path
14:41
of one piece of content
14:44
that drives someone to comment on it,
14:46
and then we follow where that comment goes,
14:48
and then look at the entire social graph that becomes activated
14:51
and then trace back to see the relationship
14:54
between that social graph and content,
14:57
a very interesting structure becomes visible.
14:59
We call this a co-viewing clique,
15:01
a virtual living room if you will.
15:03
And there are fascinating dynamics at play.
15:06
It's not one way.
15:08
A piece of content, an event, causes someone to talk.
15:10
They talk to other people.
15:13
That drives tune-in behavior back into mass media,
15:15
and you have these cycles
15:18
that drive the overall behavior.
15:20
Another example -- very different --
15:22
another actual person in our database --
15:24
and we're finding at least hundreds, if not thousands, of these.
15:27
We've given this person a name.
15:30
This is a pro-amateur, or pro-am media critic
15:32
who has this high fan-out rate.
15:35
So a lot of people are following this person -- very influential --
15:38
and they have a propensity to talk about what's on TV.
15:41
So this person is a key link
15:43
in connecting mass media and social media together.
15:46
One last example from this data:
15:49
Sometimes it's actually a piece of content that is special.
15:52
So if we go and look at this piece of content,
15:55
President Obama's State of the Union address
15:59
from just a few weeks ago,
16:02
and look at what we find in this same data set,
16:04
at the same scale,
16:07
the engagement properties of this piece of content
16:10
are truly remarkable.
16:12
A nation exploding in conversation
16:14
in real time
16:16
in response to what's on the broadcast.
16:18
And of course, through all of these lines
16:21
are flowing unstructured language.
16:23
We can X-ray
16:25
and get a real-time pulse of a nation,
16:27
real-time sense
16:29
of the social reactions in the different circuits in the social graph
16:31
being activated by content.
16:34
So, to summarize, the idea is this:
16:37
As our world becomes increasingly instrumented
16:40
and we have the capabilities
16:43
to collect and connect the dots
16:45
between what people are saying
16:47
and the context they're saying it in,
16:49
what's emerging is an ability
16:51
to see new social structures and dynamics
16:53
that have previously not been seen.
16:56
It's like building a microscope or telescope
16:58
and revealing new structures
17:00
about our own behavior around communication.
17:02
And I think the implications here are profound,
17:05
whether it's for science,
17:08
for commerce, for government,
17:10
or perhaps most of all,
17:12
for us as individuals.
17:14
And so just to return to my son,
17:17
when I was preparing this talk, he was looking over my shoulder,
17:20
and I showed him the clips I was going to show to you today,
17:23
and I asked him for permission -- granted.
17:25
And then I went on to reflect,
17:28
"Isn't it amazing,
17:30
this entire database, all these recordings,
17:33
I'm going to hand off to you and to your sister" --
17:36
who arrived two years later --
17:38
"and you guys are going to be able to go back and re-experience moments
17:41
that you could never, with your biological memory,
17:44
possibly remember the way you can now?"
17:47
And he was quiet for a moment.
17:49
And I thought, "What am I thinking?
17:51
He's five years old. He's not going to understand this."
17:53
And just as I was having that thought, he looked up at me and said,
17:55
"So that when I grow up,
17:58
I can show this to my kids?"
18:00
And I thought, "Wow, this is powerful stuff."
18:02
So I want to leave you
18:05
with one last memorable moment
18:07
from our family.
18:09
This is the first time our son
18:12
took more than two steps at once --
18:14
captured on film.
18:16
And I really want you to focus on something
18:18
as I take you through.
18:21
It's a cluttered environment; it's natural life.
18:23
My mother's in the kitchen, cooking,
18:25
and, of all places, in the hallway,
18:27
I realize he's about to do it, about to take more than two steps.
18:29
And so you hear me encouraging him,
18:32
realizing what's happening,
18:34
and then the magic happens.
18:36
Listen very carefully.
18:38
About three steps in,
18:40
he realizes something magic is happening,
18:42
and the most amazing feedback loop of all kicks in,
18:44
and he takes a breath in,
18:47
and he whispers "wow"
18:49
and instinctively I echo back the same.
18:51
And so let's fly back in time
18:56
to that memorable moment.
18:59
(Video) DR: Hey.
19:05
Come here.
19:07
Can you do it?
19:09
Oh, boy.
19:13
Can you do it?
19:15
Baby: Yeah.
19:18
DR: Ma, he's walking.
19:20
(Laughter)
19:24
(Applause)
19:26
DR: Thank you.
19:28
(Applause)
19:30

▲Back to top

About the Speaker:

Deb Roy - Cognitive scientist
Deb Roy studies how children learn language, and designs machines that learn to communicate in human-like ways. On sabbatical from MIT Media Lab, he's working with the AI company Bluefin Labs.

Why you should listen

Deb Roy directs the Cognitive Machines group at the MIT Media Lab, where he studies how children learn language, and designs machines that learn to communicate in human-like ways. To enable this work, he has pioneered new data-driven methods for analyzing and modeling human linguistic and social behavior. He has authored numerous scientific papers on artificial intelligence, cognitive modeling, human-machine interaction, data mining, and information visualization.

Deb Roy was the co-founder and serves as CEO of Bluefin Labs, a venture-backed technology company. Built upon deep machine learning principles developed in his research over the past 15 years, Bluefin has created a technology platform that analyzes social media commentary to measure real-time audience response to TV ads and shows.

Follow Deb Roy on Twitter>

Roy adds some relevant papers:

Deb Roy. (2009). New Horizons in the Study of Child Language Acquisition. Proceedings of Interspeech 2009. Brighton, England. bit.ly/fSP4Qh

Brandon C. Roy, Michael C. Frank and Deb Roy. (2009). Exploring word learning in a high-density longitudinal corpus. Proceedings of the 31st Annual Meeting of the Cognitive Science Society. Amsterdam, Netherlands. bit.ly/e1qxej

Plenty more papers on our research including technology and methodology can be found here, together with other research from my lab at MIT: bit.ly/h3paSQ

The work that I mentioned on relationships between television content and the social graph is being done at Bluefin Labs (www.bluefinlabs.com). Details of this work have not been published. The social structures we are finding (and that I highlighted in my TED talk) are indeed new. The social media communication channels that are leading to their formation did not even exist a few years ago, and Bluefin's technology platform for discovering these kinds of structures is the first of its kind. We'll certainly have more to say about all this as we continue to dig into this fascinating new kind of data, and as new social structures continue to evolve!

More profile about the speaker
Deb Roy | Speaker | TED.com