ABOUT THE SPEAKER
Abe Davis - Computer scientist
Computer vision expert Abe Davis pioneers methods to extract audio from silent digital videos, even footage shot on ordinary consumer cameras.

Why you should listen

MIT PhD student, computer vision wizard and rap artist Abe Davis has co-created the world’s most improbable audio instrument.  In 2014, Davis and his collaborators debuted the “visual microphone,” an algorithm that samples the sympathetic vibrations of ordinary objects (such as a potato chip bag) from ordinary high-speed video footage and transduces them into intelligible audio tracks.

Davis is also the author of Caperture, a 3D-imaging app designed to create and share 3D images on any compatible smartphone.

More profile about the speaker
Abe Davis | Speaker | TED.com
TED2015

Abe Davis: New video technology that reveals an object's hidden properties

Filmed:
1,482,525 views

Subtle motion happens around us all the time, including tiny vibrations caused by sound. New technology shows that we can pick up on these vibrations and actually re-create sound and conversations just from a video of a seemingly still object. But now Abe Davis takes it one step further: Watch him demo software that lets anyone interact with these hidden properties, just from a simple video.
- Computer scientist
Computer vision expert Abe Davis pioneers methods to extract audio from silent digital videos, even footage shot on ordinary consumer cameras. Full bio

Double-click the English transcript below to play the video.

00:13
Most of us think of motion
as a very visual thing.
0
1373
3349
00:17
If I walk across this stage
or gesture with my hands while I speak,
1
5889
5088
00:22
that motion is something that you can see.
2
10977
2261
00:26
But there's a world of important motion
that's too subtle for the human eye,
3
14255
5482
00:31
and over the past few years,
4
19737
2041
00:33
we've started to find that cameras
5
21778
1997
00:35
can often see this motion
even when humans can't.
6
23775
3410
00:40
So let me show you what I mean.
7
28305
1551
00:42
On the left here, you see video
of a person's wrist,
8
30717
3622
00:46
and on the right, you see video
of a sleeping infant,
9
34339
3147
00:49
but if I didn't tell you
that these were videos,
10
37486
3146
00:52
you might assume that you were looking
at two regular images,
11
40632
3761
00:56
because in both cases,
12
44393
1672
00:58
these videos appear to be
almost completely still.
13
46065
3047
01:02
But there's actually a lot
of subtle motion going on here,
14
50175
3885
01:06
and if you were to touch
the wrist on the left,
15
54060
2392
01:08
you would feel a pulse,
16
56452
1996
01:10
and if you were to hold
the infant on the right,
17
58448
2485
01:12
you would feel the rise
and fall of her chest
18
60933
2391
01:15
as she took each breath.
19
63324
1390
01:17
And these motions carry
a lot of significance,
20
65762
3576
01:21
but they're usually
too subtle for us to see,
21
69338
3343
01:24
so instead, we have to observe them
22
72681
2276
01:26
through direct contact, through touch.
23
74957
2900
01:30
But a few years ago,
24
78997
1265
01:32
my colleagues at MIT developed
what they call a motion microscope,
25
80262
4405
01:36
which is software that finds
these subtle motions in video
26
84667
4384
01:41
and amplifies them so that they
become large enough for us to see.
27
89051
3562
01:45
And so, if we use their software
on the left video,
28
93416
3483
01:48
it lets us see the pulse in this wrist,
29
96899
3250
01:52
and if we were to count that pulse,
30
100149
1695
01:53
we could even figure out
this person's heart rate.
31
101844
2355
01:57
And if we used the same software
on the right video,
32
105095
3065
02:00
it lets us see each breath
that this infant takes,
33
108160
3227
02:03
and we can use this as a contact-free way
to monitor her breathing.
34
111387
4137
02:08
And so this technology is really powerful
because it takes these phenomena
35
116884
5348
02:14
that we normally have
to experience through touch
36
122232
2367
02:16
and it lets us capture them visually
and non-invasively.
37
124599
2957
02:21
So a couple years ago, I started working
with the folks that created that software,
38
129104
4411
02:25
and we decided to pursue a crazy idea.
39
133515
3367
02:28
We thought, it's cool
that we can use software
40
136882
2693
02:31
to visualize tiny motions like this,
41
139575
3135
02:34
and you can almost think of it
as a way to extend our sense of touch.
42
142710
4458
02:39
But what if we could do the same thing
with our ability to hear?
43
147168
4059
02:44
What if we could use video
to capture the vibrations of sound,
44
152508
4665
02:49
which are just another kind of motion,
45
157173
2827
02:52
and turn everything that we see
into a microphone?
46
160000
3346
02:56
Now, this is a bit of a strange idea,
47
164236
1971
02:58
so let me try to put it
in perspective for you.
48
166207
2586
03:01
Traditional microphones
work by converting the motion
49
169523
3488
03:05
of an internal diaphragm
into an electrical signal,
50
173011
3599
03:08
and that diaphragm is designed
to move readily with sound
51
176610
4318
03:12
so that its motion can be recorded
and interpreted as audio.
52
180928
4807
03:17
But sound causes all objects to vibrate.
53
185735
3668
03:21
Those vibrations are just usually
too subtle and too fast for us to see.
54
189403
5480
03:26
So what if we record them
with a high-speed camera
55
194883
3738
03:30
and then use software
to extract tiny motions
56
198621
3576
03:34
from our high-speed video,
57
202197
2090
03:36
and analyze those motions to figure out
what sounds created them?
58
204287
4274
03:41
This would let us turn visible objects
into visual microphones from a distance.
59
209859
5449
03:49
And so we tried this out,
60
217080
2183
03:51
and here's one of our experiments,
61
219263
1927
03:53
where we took this potted plant
that you see on the right
62
221190
2949
03:56
and we filmed it with a high-speed camera
63
224139
2438
03:58
while a nearby loudspeaker
played this sound.
64
226577
3529
04:02
(Music: "Mary Had a Little Lamb")
65
230275
8190
04:11
And so here's the video that we recorded,
66
239820
2824
04:14
and we recorded it at thousands
of frames per second,
67
242644
3924
04:18
but even if you look very closely,
68
246568
2322
04:20
all you'll see are some leaves
69
248890
1951
04:22
that are pretty much
just sitting there doing nothing,
70
250841
3065
04:25
because our sound only moved those leaves
by about a micrometer.
71
253906
4806
04:31
That's one ten-thousandth of a centimeter,
72
259103
4276
04:35
which spans somewhere between
a hundredth and a thousandth
73
263379
4156
04:39
of a pixel in this image.
74
267535
2299
04:41
So you can squint all you want,
75
269881
2887
04:44
but motion that small is pretty much
perceptually invisible.
76
272768
3335
04:49
But it turns out that something
can be perceptually invisible
77
277667
4157
04:53
and still be numerically significant,
78
281824
2809
04:56
because with the right algorithms,
79
284633
2002
04:58
we can take this silent,
seemingly still video
80
286635
3687
05:02
and we can recover this sound.
81
290322
1527
05:04
(Music: "Mary Had a Little Lamb")
82
292690
7384
05:12
(Applause)
83
300074
5828
05:22
So how is this possible?
84
310058
1939
05:23
How can we get so much information
out of so little motion?
85
311997
4344
05:28
Well, let's say that those leaves
move by just a single micrometer,
86
316341
5361
05:33
and let's say that that shifts our image
by just a thousandth of a pixel.
87
321702
4308
05:39
That may not seem like much,
88
327269
2572
05:41
but a single frame of video
89
329841
1996
05:43
may have hundreds of thousands
of pixels in it,
90
331837
3257
05:47
and so if we combine all
of the tiny motions that we see
91
335094
3454
05:50
from across that entire image,
92
338548
2298
05:52
then suddenly a thousandth of a pixel
93
340846
2623
05:55
can start to add up
to something pretty significant.
94
343469
2775
05:58
On a personal note, we were pretty psyched
when we figured this out.
95
346870
3635
06:02
(Laughter)
96
350505
2320
06:04
But even with the right algorithm,
97
352825
3253
06:08
we were still missing
a pretty important piece of the puzzle.
98
356078
3617
06:11
You see, there are a lot of factors
that affect when and how well
99
359695
3604
06:15
this technique will work.
100
363299
1997
06:17
There's the object and how far away it is;
101
365296
3204
06:20
there's the camera
and the lens that you use;
102
368500
2394
06:22
how much light is shining on the object
and how loud your sound is.
103
370894
4091
06:27
And even with the right algorithm,
104
375945
3375
06:31
we had to be very careful
with our early experiments,
105
379320
3390
06:34
because if we got
any of these factors wrong,
106
382710
2392
06:37
there was no way to tell
what the problem was.
107
385102
2368
06:39
We would just get noise back.
108
387470
2647
06:42
And so a lot of our early
experiments looked like this.
109
390117
3320
06:45
And so here I am,
110
393437
2206
06:47
and on the bottom left, you can kind of
see our high-speed camera,
111
395643
4040
06:51
which is pointed at a bag of chips,
112
399683
2183
06:53
and the whole thing is lit
by these bright lamps.
113
401866
2949
06:56
And like I said, we had to be
very careful in these early experiments,
114
404815
4365
07:01
so this is how it went down.
115
409180
2508
07:03
(Video) Abe Davis: Three, two, one, go.
116
411688
3761
07:07
Mary had a little lamb!
Little lamb! Little lamb!
117
415449
5387
07:12
(Laughter)
118
420836
4500
07:17
AD: So this experiment
looks completely ridiculous.
119
425336
2814
07:20
(Laughter)
120
428150
1788
07:21
I mean, I'm screaming at a bag of chips --
121
429938
2345
07:24
(Laughter) --
122
432283
1551
07:25
and we're blasting it with so much light,
123
433834
2117
07:27
we literally melted the first bag
we tried this on. (Laughter)
124
435951
4479
07:32
But ridiculous as this experiment looks,
125
440525
3274
07:35
it was actually really important,
126
443799
1788
07:37
because we were able
to recover this sound.
127
445587
2926
07:40
(Audio) Mary had a little lamb!
Little lamb! Little lamb!
128
448513
4712
07:45
(Applause)
129
453225
4088
07:49
AD: And this was really significant,
130
457313
1881
07:51
because it was the first time
we recovered intelligible human speech
131
459194
4119
07:55
from silent video of an object.
132
463424
2341
07:57
And so it gave us this point of reference,
133
465765
2391
08:00
and gradually we could start
to modify the experiment,
134
468156
3871
08:04
using different objects
or moving the object further away,
135
472106
3805
08:07
using less light or quieter sounds.
136
475911
2770
08:11
And we analyzed all of these experiments
137
479887
2874
08:14
until we really understood
the limits of our technique,
138
482761
3622
08:18
because once we understood those limits,
139
486383
1950
08:20
we could figure out how to push them.
140
488333
2346
08:22
And that led to experiments like this one,
141
490679
3181
08:25
where again, I'm going to speak
to a bag of chips,
142
493860
2739
08:28
but this time we've moved our camera
about 15 feet away,
143
496599
4830
08:33
outside, behind a soundproof window,
144
501429
2833
08:36
and the whole thing is lit
by only natural sunlight.
145
504262
2803
08:40
And so here's the video that we captured.
146
508529
2155
08:44
And this is what things sounded like
from inside, next to the bag of chips.
147
512450
4559
08:49
(Audio) Mary had a little lamb
whose fleece was white as snow,
148
517009
5038
08:54
and everywhere that Mary went,
that lamb was sure to go.
149
522047
5619
08:59
AD: And here's what we were able
to recover from our silent video
150
527666
4017
09:03
captured outside behind that window.
151
531683
2345
09:06
(Audio) Mary had a little lamb
whose fleece was white as snow,
152
534028
4435
09:10
and everywhere that Mary went,
that lamb was sure to go.
153
538463
5457
09:15
(Applause)
154
543920
6501
09:22
AD: And there are other ways
that we can push these limits as well.
155
550421
3542
09:25
So here's a quieter experiment
156
553963
1798
09:27
where we filmed some earphones
plugged into a laptop computer,
157
555761
4110
09:31
and in this case, our goal was to recover
the music that was playing on that laptop
158
559871
4110
09:35
from just silent video
159
563981
2299
09:38
of these two little plastic earphones,
160
566280
2507
09:40
and we were able to do this so well
161
568787
2183
09:42
that I could even Shazam our results.
162
570970
2461
09:45
(Laughter)
163
573431
2411
09:49
(Music: "Under Pressure" by Queen)
164
577191
10034
10:01
(Applause)
165
589615
4969
10:06
And we can also push things
by changing the hardware that we use.
166
594584
4551
10:11
Because the experiments
I've shown you so far
167
599135
2461
10:13
were done with a camera,
a high-speed camera,
168
601596
2322
10:15
that can record video
about a 100 times faster
169
603918
2879
10:18
than most cell phones,
170
606797
1927
10:20
but we've also found a way
to use this technique
171
608724
2809
10:23
with more regular cameras,
172
611533
2230
10:25
and we do that by taking advantage
of what's called a rolling shutter.
173
613763
4069
10:29
You see, most cameras
record images one row at a time,
174
617832
4798
10:34
and so if an object moves
during the recording of a single image,
175
622630
5702
10:40
there's a slight time delay
between each row,
176
628344
2717
10:43
and this causes slight artifacts
177
631061
3157
10:46
that get coded into each frame of a video.
178
634218
3483
10:49
And so what we found
is that by analyzing these artifacts,
179
637701
3806
10:53
we can actually recover sound
using a modified version of our algorithm.
180
641507
4615
10:58
So here's an experiment we did
181
646122
1912
11:00
where we filmed a bag of candy
182
648034
1695
11:01
while a nearby loudspeaker played
183
649729
1741
11:03
the same "Mary Had a Little Lamb"
music from before,
184
651470
2972
11:06
but this time, we used just a regular
store-bought camera,
185
654442
4203
11:10
and so in a second, I'll play for you
the sound that we recovered,
186
658645
3174
11:13
and it's going to sound
distorted this time,
187
661819
2050
11:15
but listen and see if you can still
recognize the music.
188
663869
2836
11:19
(Audio: "Mary Had a Little Lamb")
189
667723
6223
11:37
And so, again, that sounds distorted,
190
685527
3465
11:40
but what's really amazing here
is that we were able to do this
191
688992
4386
11:45
with something
that you could literally run out
192
693378
2626
11:48
and pick up at a Best Buy.
193
696004
1444
11:51
So at this point,
194
699122
1363
11:52
a lot of people see this work,
195
700485
1974
11:54
and they immediately think
about surveillance.
196
702459
3413
11:57
And to be fair,
197
705872
2415
12:00
it's not hard to imagine how you might use
this technology to spy on someone.
198
708287
4133
12:04
But keep in mind that there's already
a lot of very mature technology
199
712420
3947
12:08
out there for surveillance.
200
716367
1579
12:09
In fact, people have been using lasers
201
717946
2090
12:12
to eavesdrop on objects
from a distance for decades.
202
720036
2799
12:15
But what's really new here,
203
723978
2025
12:18
what's really different,
204
726003
1440
12:19
is that now we have a way
to picture the vibrations of an object,
205
727443
4295
12:23
which gives us a new lens
through which to look at the world,
206
731738
3413
12:27
and we can use that lens
207
735151
1510
12:28
to learn not just about forces like sound
that cause an object to vibrate,
208
736661
4899
12:33
but also about the object itself.
209
741560
2288
12:36
And so I want to take a step back
210
744975
1693
12:38
and think about how that might change
the ways that we use video,
211
746668
4249
12:42
because we usually use video
to look at things,
212
750917
3553
12:46
and I've just shown you how we can use it
213
754470
2322
12:48
to listen to things.
214
756792
1857
12:50
But there's another important way
that we learn about the world:
215
758649
3971
12:54
that's by interacting with it.
216
762620
2275
12:56
We push and pull and poke and prod things.
217
764895
3111
13:00
We shake things and see what happens.
218
768006
3181
13:03
And that's something that video
still won't let us do,
219
771187
4273
13:07
at least not traditionally.
220
775460
2136
13:09
So I want to show you some new work,
221
777596
1950
13:11
and this is based on an idea I had
just a few months ago,
222
779546
2667
13:14
so this is actually the first time
I've shown it to a public audience.
223
782213
3301
13:17
And the basic idea is that we're going
to use the vibrations in a video
224
785514
5363
13:22
to capture objects in a way
that will let us interact with them
225
790877
4481
13:27
and see how they react to us.
226
795358
1974
13:31
So here's an object,
227
799120
1764
13:32
and in this case, it's a wire figure
in the shape of a human,
228
800884
3832
13:36
and we're going to film that object
with just a regular camera.
229
804716
3088
13:39
So there's nothing special
about this camera.
230
807804
2124
13:41
In fact, I've actually done this
with my cell phone before.
231
809928
2961
13:44
But we do want to see the object vibrate,
232
812889
2252
13:47
so to make that happen,
233
815141
1133
13:48
we're just going to bang a little bit
on the surface where it's resting
234
816274
3346
13:51
while we record this video.
235
819620
2138
13:59
So that's it: just five seconds
of regular video,
236
827398
3671
14:03
while we bang on this surface,
237
831069
2136
14:05
and we're going to use
the vibrations in that video
238
833205
3513
14:08
to learn about the structural
and material properties of our object,
239
836718
4544
14:13
and we're going to use that information
to create something new and interactive.
240
841262
4834
14:24
And so here's what we've created.
241
852866
2653
14:27
And it looks like a regular image,
242
855519
2229
14:29
but this isn't an image,
and it's not a video,
243
857748
3111
14:32
because now I can take my mouse
244
860859
2368
14:35
and I can start interacting
with the object.
245
863227
2859
14:44
And so what you see here
246
872936
2357
14:47
is a simulation of how this object
247
875389
2226
14:49
would respond to new forces
that we've never seen before,
248
877615
4458
14:54
and we created it from just
five seconds of regular video.
249
882073
3633
14:59
(Applause)
250
887249
4715
15:09
And so this is a really powerful
way to look at the world,
251
897421
3227
15:12
because it lets us predict
how objects will respond
252
900648
2972
15:15
to new situations,
253
903620
1823
15:17
and you could imagine, for instance,
looking at an old bridge
254
905443
3473
15:20
and wondering what would happen,
how would that bridge hold up
255
908916
3527
15:24
if I were to drive my car across it.
256
912443
2833
15:27
And that's a question
that you probably want to answer
257
915276
2774
15:30
before you start driving
across that bridge.
258
918050
2560
15:33
And of course, there are going to be
limitations to this technique,
259
921988
3272
15:37
just like there were
with the visual microphone,
260
925260
2462
15:39
but we found that it works
in a lot of situations
261
927722
3181
15:42
that you might not expect,
262
930903
1875
15:44
especially if you give it longer videos.
263
932778
2768
15:47
So for example,
here's a video that I captured
264
935546
2508
15:50
of a bush outside of my apartment,
265
938054
2299
15:52
and I didn't do anything to this bush,
266
940353
3088
15:55
but by capturing a minute-long video,
267
943441
2705
15:58
a gentle breeze caused enough vibrations
268
946146
3378
16:01
that we could learn enough about this bush
to create this simulation.
269
949524
3587
16:07
(Applause)
270
955270
6142
16:13
And so you could imagine giving this
to a film director,
271
961412
2972
16:16
and letting him control, say,
272
964384
1719
16:18
the strength and direction of wind
in a shot after it's been recorded.
273
966103
4922
16:24
Or, in this case, we pointed our camera
at a hanging curtain,
274
972810
4535
16:29
and you can't even see
any motion in this video,
275
977345
4129
16:33
but by recording a two-minute-long video,
276
981474
2925
16:36
natural air currents in this room
277
984399
2438
16:38
created enough subtle,
imperceptible motions and vibrations
278
986837
4412
16:43
that we could learn enough
to create this simulation.
279
991249
2565
16:48
And ironically,
280
996243
2366
16:50
we're kind of used to having
this kind of interactivity
281
998609
3088
16:53
when it comes to virtual objects,
282
1001697
2647
16:56
when it comes to video games
and 3D models,
283
1004344
3297
16:59
but to be able to capture this information
from real objects in the real world
284
1007641
4404
17:04
using just simple, regular video,
285
1012045
2817
17:06
is something new that has
a lot of potential.
286
1014862
2183
17:10
So here are the amazing people
who worked with me on these projects.
287
1018410
4904
17:16
(Applause)
288
1024057
5596
17:24
And what I've shown you today
is only the beginning.
289
1032819
3057
17:27
We've just started to scratch the surface
290
1035876
2113
17:29
of what you can do
with this kind of imaging,
291
1037989
2972
17:32
because it gives us a new way
292
1040961
2286
17:35
to capture our surroundings
with common, accessible technology.
293
1043342
4724
17:40
And so looking to the future,
294
1048066
1929
17:41
it's going to be
really exciting to explore
295
1049995
2037
17:44
what this can tell us about the world.
296
1052032
1856
17:46
Thank you.
297
1054381
1204
17:47
(Applause)
298
1055610
6107

▲Back to top

ABOUT THE SPEAKER
Abe Davis - Computer scientist
Computer vision expert Abe Davis pioneers methods to extract audio from silent digital videos, even footage shot on ordinary consumer cameras.

Why you should listen

MIT PhD student, computer vision wizard and rap artist Abe Davis has co-created the world’s most improbable audio instrument.  In 2014, Davis and his collaborators debuted the “visual microphone,” an algorithm that samples the sympathetic vibrations of ordinary objects (such as a potato chip bag) from ordinary high-speed video footage and transduces them into intelligible audio tracks.

Davis is also the author of Caperture, a 3D-imaging app designed to create and share 3D images on any compatible smartphone.

More profile about the speaker
Abe Davis | Speaker | TED.com