sponsored links
TED2015

Abe Davis: New video technology that reveals an object's hidden properties

March 17, 2015

Subtle motion happens around us all the time, including tiny vibrations caused by sound. New technology shows that we can pick up on these vibrations and actually re-create sound and conversations just from a video of a seemingly still object. But now Abe Davis takes it one step further: Watch him demo software that lets anyone interact with these hidden properties, just from a simple video.

Abe Davis - Computer scientist
Computer vision expert Abe Davis pioneers methods to extract audio from silent digital videos, even footage shot on ordinary consumer cameras. Full bio

sponsored links
Double-click the English subtitles below to play the video.
Most of us think of motion
as a very visual thing.
00:13
If I walk across this stage
or gesture with my hands while I speak,
00:17
that motion is something that you can see.
00:22
But there's a world of important motion
that's too subtle for the human eye,
00:26
and over the past few years,
00:31
we've started to find that cameras
00:33
can often see this motion
even when humans can't.
00:35
So let me show you what I mean.
00:40
On the left here, you see video
of a person's wrist,
00:42
and on the right, you see video
of a sleeping infant,
00:46
but if I didn't tell you
that these were videos,
00:49
you might assume that you were looking
at two regular images,
00:52
because in both cases,
00:56
these videos appear to be
almost completely still.
00:57
But there's actually a lot
of subtle motion going on here,
01:01
and if you were to touch
the wrist on the left,
01:05
you would feel a pulse,
01:08
and if you were to hold
the infant on the right,
01:10
you would feel the rise
and fall of her chest
01:12
as she took each breath.
01:15
And these motions carry
a lot of significance,
01:17
but they're usually
too subtle for us to see,
01:21
so instead, we have to observe them
01:24
through direct contact, through touch.
01:26
But a few years ago,
01:30
my colleagues at MIT developed
what they call a motion microscope,
01:32
which is software that finds
these subtle motions in video
01:36
and amplifies them so that they
become large enough for us to see.
01:40
And so, if we use their software
on the left video,
01:45
it lets us see the pulse in this wrist,
01:48
and if we were to count that pulse,
01:51
we could even figure out
this person's heart rate.
01:53
And if we used the same software
on the right video,
01:56
it lets us see each breath
that this infant takes,
01:59
and we can use this as a contact-free way
to monitor her breathing.
02:03
And so this technology is really powerful
because it takes these phenomena
02:08
that we normally have
to experience through touch
02:14
and it lets us capture them visually
and non-invasively.
02:16
So a couple years ago, I started working
with the folks that created that software,
02:20
and we decided to pursue a crazy idea.
02:25
We thought, it's cool
that we can use software
02:28
to visualize tiny motions like this,
02:31
and you can almost think of it
as a way to extend our sense of touch.
02:34
But what if we could do the same thing
with our ability to hear?
02:38
What if we could use video
to capture the vibrations of sound,
02:44
which are just another kind of motion,
02:48
and turn everything that we see
into a microphone?
02:51
Now, this is a bit of a strange idea,
02:56
so let me try to put it
in perspective for you.
02:58
Traditional microphones
work by converting the motion
03:01
of an internal diaphragm
into an electrical signal,
03:04
and that diaphragm is designed
to move readily with sound
03:08
so that its motion can be recorded
and interpreted as audio.
03:12
But sound causes all objects to vibrate.
03:17
Those vibrations are just usually
too subtle and too fast for us to see.
03:21
So what if we record them
with a high-speed camera
03:26
and then use software
to extract tiny motions
03:30
from our high-speed video,
03:34
and analyze those motions to figure out
what sounds created them?
03:36
This would let us turn visible objects
into visual microphones from a distance.
03:41
And so we tried this out,
03:48
and here's one of our experiments,
03:51
where we took this potted plant
that you see on the right
03:53
and we filmed it with a high-speed camera
03:55
while a nearby loudspeaker
played this sound.
03:58
(Music: "Mary Had a Little Lamb")
04:02
And so here's the video that we recorded,
04:11
and we recorded it at thousands
of frames per second,
04:14
but even if you look very closely,
04:18
all you'll see are some leaves
04:20
that are pretty much
just sitting there doing nothing,
04:22
because our sound only moved those leaves
by about a micrometer.
04:25
That's one ten-thousandth of a centimeter,
04:30
which spans somewhere between
a hundredth and a thousandth
04:35
of a pixel in this image.
04:39
So you can squint all you want,
04:41
but motion that small is pretty much
perceptually invisible.
04:44
But it turns out that something
can be perceptually invisible
04:49
and still be numerically significant,
04:53
because with the right algorithms,
04:56
we can take this silent,
seemingly still video
04:58
and we can recover this sound.
05:02
(Music: "Mary Had a Little Lamb")
05:04
(Applause)
05:11
So how is this possible?
05:21
How can we get so much information
out of so little motion?
05:23
Well, let's say that those leaves
move by just a single micrometer,
05:28
and let's say that that shifts our image
by just a thousandth of a pixel.
05:33
That may not seem like much,
05:39
but a single frame of video
05:41
may have hundreds of thousands
of pixels in it,
05:43
and so if we combine all
of the tiny motions that we see
05:46
from across that entire image,
05:50
then suddenly a thousandth of a pixel
05:52
can start to add up
to something pretty significant.
05:55
On a personal note, we were pretty psyched
when we figured this out.
05:58
(Laughter)
06:02
But even with the right algorithm,
06:04
we were still missing
a pretty important piece of the puzzle.
06:07
You see, there are a lot of factors
that affect when and how well
06:11
this technique will work.
06:15
There's the object and how far away it is;
06:17
there's the camera
and the lens that you use;
06:20
how much light is shining on the object
and how loud your sound is.
06:22
And even with the right algorithm,
06:27
we had to be very careful
with our early experiments,
06:31
because if we got
any of these factors wrong,
06:34
there was no way to tell
what the problem was.
06:36
We would just get noise back.
06:39
And so a lot of our early
experiments looked like this.
06:41
And so here I am,
06:45
and on the bottom left, you can kind of
see our high-speed camera,
06:47
which is pointed at a bag of chips,
06:51
and the whole thing is lit
by these bright lamps.
06:53
And like I said, we had to be
very careful in these early experiments,
06:56
so this is how it went down.
07:01
(Video) Abe Davis: Three, two, one, go.
07:03
Mary had a little lamb!
Little lamb! Little lamb!
07:07
(Laughter)
07:12
AD: So this experiment
looks completely ridiculous.
07:17
(Laughter)
07:19
I mean, I'm screaming at a bag of chips --
07:21
(Laughter) --
07:24
and we're blasting it with so much light,
07:25
we literally melted the first bag
we tried this on. (Laughter)
07:27
But ridiculous as this experiment looks,
07:32
it was actually really important,
07:35
because we were able
to recover this sound.
07:37
(Audio) Mary had a little lamb!
Little lamb! Little lamb!
07:40
(Applause)
07:45
AD: And this was really significant,
07:49
because it was the first time
we recovered intelligible human speech
07:51
from silent video of an object.
07:55
And so it gave us this point of reference,
07:57
and gradually we could start
to modify the experiment,
07:59
using different objects
or moving the object further away,
08:03
using less light or quieter sounds.
08:07
And we analyzed all of these experiments
08:11
until we really understood
the limits of our technique,
08:14
because once we understood those limits,
08:18
we could figure out how to push them.
08:20
And that led to experiments like this one,
08:22
where again, I'm going to speak
to a bag of chips,
08:25
but this time we've moved our camera
about 15 feet away,
08:28
outside, behind a soundproof window,
08:33
and the whole thing is lit
by only natural sunlight.
08:36
And so here's the video that we captured.
08:40
And this is what things sounded like
from inside, next to the bag of chips.
08:44
(Audio) Mary had a little lamb
whose fleece was white as snow,
08:48
and everywhere that Mary went,
that lamb was sure to go.
08:53
AD: And here's what we were able
to recover from our silent video
08:59
captured outside behind that window.
09:03
(Audio) Mary had a little lamb
whose fleece was white as snow,
09:05
and everywhere that Mary went,
that lamb was sure to go.
09:10
(Applause)
09:15
AD: And there are other ways
that we can push these limits as well.
09:22
So here's a quieter experiment
09:25
where we filmed some earphones
plugged into a laptop computer,
09:27
and in this case, our goal was to recover
the music that was playing on that laptop
09:31
from just silent video
09:35
of these two little plastic earphones,
09:38
and we were able to do this so well
09:40
that I could even Shazam our results.
09:42
(Laughter)
09:45
(Music: "Under Pressure" by Queen)
09:49
(Applause)
10:01
And we can also push things
by changing the hardware that we use.
10:06
Because the experiments
I've shown you so far
10:10
were done with a camera,
a high-speed camera,
10:13
that can record video
about a 100 times faster
10:15
than most cell phones,
10:18
but we've also found a way
to use this technique
10:20
with more regular cameras,
10:23
and we do that by taking advantage
of what's called a rolling shutter.
10:25
You see, most cameras
record images one row at a time,
10:29
and so if an object moves
during the recording of a single image,
10:34
there's a slight time delay
between each row,
10:40
and this causes slight artifacts
10:42
that get coded into each frame of a video.
10:46
And so what we found
is that by analyzing these artifacts,
10:49
we can actually recover sound
using a modified version of our algorithm.
10:53
So here's an experiment we did
10:57
where we filmed a bag of candy
10:59
while a nearby loudspeaker played
11:01
the same "Mary Had a Little Lamb"
music from before,
11:03
but this time, we used just a regular
store-bought camera,
11:06
and so in a second, I'll play for you
the sound that we recovered,
11:10
and it's going to sound
distorted this time,
11:13
but listen and see if you can still
recognize the music.
11:15
(Audio: "Mary Had a Little Lamb")
11:19
And so, again, that sounds distorted,
11:37
but what's really amazing here
is that we were able to do this
11:40
with something
that you could literally run out
11:45
and pick up at a Best Buy.
11:47
So at this point,
11:50
a lot of people see this work,
11:52
and they immediately think
about surveillance.
11:54
And to be fair,
11:57
it's not hard to imagine how you might use
this technology to spy on someone.
12:00
But keep in mind that there's already
a lot of very mature technology
12:04
out there for surveillance.
12:08
In fact, people have been using lasers
12:09
to eavesdrop on objects
from a distance for decades.
12:11
But what's really new here,
12:15
what's really different,
12:17
is that now we have a way
to picture the vibrations of an object,
12:19
which gives us a new lens
through which to look at the world,
12:23
and we can use that lens
12:26
to learn not just about forces like sound
that cause an object to vibrate,
12:28
but also about the object itself.
12:33
And so I want to take a step back
12:36
and think about how that might change
the ways that we use video,
12:38
because we usually use video
to look at things,
12:42
and I've just shown you how we can use it
12:46
to listen to things.
12:48
But there's another important way
that we learn about the world:
12:50
that's by interacting with it.
12:54
We push and pull and poke and prod things.
12:56
We shake things and see what happens.
12:59
And that's something that video
still won't let us do,
13:03
at least not traditionally.
13:07
So I want to show you some new work,
13:09
and this is based on an idea I had
just a few months ago,
13:11
so this is actually the first time
I've shown it to a public audience.
13:14
And the basic idea is that we're going
to use the vibrations in a video
13:17
to capture objects in a way
that will let us interact with them
13:22
and see how they react to us.
13:27
So here's an object,
13:30
and in this case, it's a wire figure
in the shape of a human,
13:32
and we're going to film that object
with just a regular camera.
13:36
So there's nothing special
about this camera.
13:39
In fact, I've actually done this
with my cell phone before.
13:41
But we do want to see the object vibrate,
13:44
so to make that happen,
13:46
we're just going to bang a little bit
on the surface where it's resting
13:48
while we record this video.
13:51
So that's it: just five seconds
of regular video,
13:59
while we bang on this surface,
14:02
and we're going to use
the vibrations in that video
14:05
to learn about the structural
and material properties of our object,
14:08
and we're going to use that information
to create something new and interactive.
14:13
And so here's what we've created.
14:24
And it looks like a regular image,
14:27
but this isn't an image,
and it's not a video,
14:29
because now I can take my mouse
14:32
and I can start interacting
with the object.
14:35
And so what you see here
14:44
is a simulation of how this object
14:47
would respond to new forces
that we've never seen before,
14:49
and we created it from just
five seconds of regular video.
14:53
(Applause)
14:59
And so this is a really powerful
way to look at the world,
15:09
because it lets us predict
how objects will respond
15:12
to new situations,
15:15
and you could imagine, for instance,
looking at an old bridge
15:17
and wondering what would happen,
how would that bridge hold up
15:20
if I were to drive my car across it.
15:24
And that's a question
that you probably want to answer
15:27
before you start driving
across that bridge.
15:29
And of course, there are going to be
limitations to this technique,
15:33
just like there were
with the visual microphone,
15:37
but we found that it works
in a lot of situations
15:39
that you might not expect,
15:42
especially if you give it longer videos.
15:44
So for example,
here's a video that I captured
15:47
of a bush outside of my apartment,
15:49
and I didn't do anything to this bush,
15:52
but by capturing a minute-long video,
15:55
a gentle breeze caused enough vibrations
15:57
that we could learn enough about this bush
to create this simulation.
16:01
(Applause)
16:07
And so you could imagine giving this
to a film director,
16:13
and letting him control, say,
16:16
the strength and direction of wind
in a shot after it's been recorded.
16:17
Or, in this case, we pointed our camera
at a hanging curtain,
16:24
and you can't even see
any motion in this video,
16:29
but by recording a two-minute-long video,
16:33
natural air currents in this room
16:36
created enough subtle,
imperceptible motions and vibrations
16:38
that we could learn enough
to create this simulation.
16:43
And ironically,
16:48
we're kind of used to having
this kind of interactivity
16:50
when it comes to virtual objects,
16:53
when it comes to video games
and 3D models,
16:56
but to be able to capture this information
from real objects in the real world
16:59
using just simple, regular video,
17:03
is something new that has
a lot of potential.
17:06
So here are the amazing people
who worked with me on these projects.
17:10
(Applause)
17:15
And what I've shown you today
is only the beginning.
17:24
We've just started to scratch the surface
17:27
of what you can do
with this kind of imaging,
17:29
because it gives us a new way
17:32
to capture our surroundings
with common, accessible technology.
17:35
And so looking to the future,
17:39
it's going to be
really exciting to explore
17:41
what this can tell us about the world.
17:43
Thank you.
17:46
(Applause)
17:47

sponsored links

Abe Davis - Computer scientist
Computer vision expert Abe Davis pioneers methods to extract audio from silent digital videos, even footage shot on ordinary consumer cameras.

Why you should listen

MIT PhD student, computer vision wizard and rap artist Abe Davis has co-created the world’s most improbable audio instrument.  In 2014, Davis and his collaborators debuted the “visual microphone,” an algorithm that samples the sympathetic vibrations of ordinary objects (such as a potato chip bag) from ordinary high-speed video footage and transduces them into intelligible audio tracks.

Davis is also the author of Caperture, a 3D-imaging app designed to create and share 3D images on any compatible smartphone.

sponsored links

If you need translations, you can install "Google Translate" extension into your Chrome Browser.
Furthermore, you can change playback rate by installing "Video Speed Controller" extension.

Data provided by TED.

This website is owned and operated by Tokyo English Network.
The developer's blog is here.