ABOUT THE SPEAKER
Joseph Redmon - Computer scientist
Joseph Redmon works on the YOLO algorithm, which combines the simple face detection of your phone camera with a cloud-based AI -- in real time.

Why you should listen

Computer scientist Joseph Redmon is working on the YOLO (You Only Look Once) algorithm, which has a simple goal: to deliver image recognition and object detection at a speed that would seem science-fictional only a few years ago. The algorithm looks like the simple face detection of a camera app but with the level complexity of systems like Google's Deep Mind Cloud Vision, using Convolutional Deep Neural Networks to crunch object detection in realtime. It's the kind of technology that will be embedded on all smartphones in the next few years.

Redmon is also internet-famous for his resume.

More profile about the speaker
Joseph Redmon | Speaker | TED.com
TED2017

Joseph Redmon: How computers learn to recognize objects instantly

Filmed:
2,471,805 views

Ten years ago, researchers thought that getting a computer to tell the difference between a cat and a dog would be almost impossible. Today, computer vision systems do it with greater than 99 percent accuracy. How? Joseph Redmon works on the YOLO (You Only Look Once) system, an open-source method of object detection that can identify objects in images and video -- from zebras to stop signs -- with lightning-quick speed. In a remarkable live demo, Redmon shows off this important step forward for applications like self-driving cars, robotics and even cancer detection.
- Computer scientist
Joseph Redmon works on the YOLO algorithm, which combines the simple face detection of your phone camera with a cloud-based AI -- in real time. Full bio

Double-click the English transcript below to play the video.

00:12
Ten years ago,
0
825
1151
00:14
computer vision researchers
thought that getting a computer
1
2000
2776
00:16
to tell the difference
between a cat and a dog
2
4800
2696
00:19
would be almost impossible,
3
7520
1976
00:21
even with the significant advance
in the state of artificial intelligence.
4
9520
3696
00:25
Now we can do it at a level
greater than 99 percent accuracy.
5
13240
3560
00:29
This is called image classification --
6
17680
1856
00:31
give it an image,
put a label to that image --
7
19560
3096
00:34
and computers know
thousands of other categories as well.
8
22680
3040
00:38
I'm a graduate student
at the University of Washington,
9
26680
2896
00:41
and I work on a project called Darknet,
10
29600
1896
00:43
which is a neural network framework
11
31520
1696
00:45
for training and testing
computer vision models.
12
33240
2816
00:48
So let's just see what Darknet thinks
13
36080
2976
00:51
of this image that we have.
14
39080
1760
00:54
When we run our classifier
15
42520
2336
00:56
on this image,
16
44880
1216
00:58
we see we don't just get
a prediction of dog or cat,
17
46120
2456
01:00
we actually get
specific breed predictions.
18
48600
2336
01:02
That's the level
of granularity we have now.
19
50960
2176
01:05
And it's correct.
20
53160
1616
01:06
My dog is in fact a malamute.
21
54800
1840
01:09
So we've made amazing strides
in image classification,
22
57040
4336
01:13
but what happens
when we run our classifier
23
61400
2000
01:15
on an image that looks like this?
24
63424
1960
01:19
Well ...
25
67080
1200
01:24
We see that the classifier comes back
with a pretty similar prediction.
26
72640
3896
01:28
And it's correct,
there is a malamute in the image,
27
76560
3096
01:31
but just given this label,
we don't actually know that much
28
79680
3696
01:35
about what's going on in the image.
29
83400
1667
01:37
We need something more powerful.
30
85091
1560
01:39
I work on a problem
called object detection,
31
87240
2616
01:41
where we look at an image
and try to find all of the objects,
32
89880
2936
01:44
put bounding boxes around them
33
92840
1456
01:46
and say what those objects are.
34
94320
1520
01:48
So here's what happens
when we run a detector on this image.
35
96400
3280
01:53
Now, with this kind of result,
36
101240
2256
01:55
we can do a lot more
with our computer vision algorithms.
37
103520
2696
01:58
We see that it knows
that there's a cat and a dog.
38
106240
2976
02:01
It knows their relative locations,
39
109240
2256
02:03
their size.
40
111520
1216
02:04
It may even know some extra information.
41
112760
1936
02:06
There's a book sitting in the background.
42
114720
1960
02:09
And if you want to build a system
on top of computer vision,
43
117280
3256
02:12
say a self-driving vehicle
or a robotic system,
44
120560
3456
02:16
this is the kind
of information that you want.
45
124040
2456
02:18
You want something so that
you can interact with the physical world.
46
126520
3239
02:22
Now, when I started working
on object detection,
47
130759
2257
02:25
it took 20 seconds
to process a single image.
48
133040
3296
02:28
And to get a feel for why
speed is so important in this domain,
49
136360
3880
02:33
here's an example of an object detector
50
141120
2536
02:35
that takes two seconds
to process an image.
51
143680
2416
02:38
So this is 10 times faster
52
146120
2616
02:40
than the 20-seconds-per-image detector,
53
148760
3536
02:44
and you can see that by the time
it makes predictions,
54
152320
2656
02:47
the entire state of the world has changed,
55
155000
2040
02:49
and this wouldn't be very useful
56
157880
2416
02:52
for an application.
57
160320
1416
02:53
If we speed this up
by another factor of 10,
58
161760
2496
02:56
this is a detector running
at five frames per second.
59
164280
2816
02:59
This is a lot better,
60
167120
1536
03:00
but for example,
61
168680
1976
03:02
if there's any significant movement,
62
170680
2296
03:05
I wouldn't want a system
like this driving my car.
63
173000
2560
03:09
This is our detection system
running in real time on my laptop.
64
177120
3240
03:13
So it smoothly tracks me
as I move around the frame,
65
181000
3136
03:16
and it's robust to a wide variety
of changes in size,
66
184160
3720
03:21
pose,
67
189440
1200
03:23
forward, backward.
68
191280
1856
03:25
This is great.
69
193160
1216
03:26
This is what we really need
70
194400
1736
03:28
if we're going to build systems
on top of computer vision.
71
196160
2896
03:31
(Applause)
72
199080
4000
03:36
So in just a few years,
73
204280
2176
03:38
we've gone from 20 seconds per image
74
206480
2656
03:41
to 20 milliseconds per image,
a thousand times faster.
75
209160
3536
03:44
How did we get there?
76
212720
1416
03:46
Well, in the past,
object detection systems
77
214160
3016
03:49
would take an image like this
78
217200
1936
03:51
and split it into a bunch of regions
79
219160
2456
03:53
and then run a classifier
on each of these regions,
80
221640
3256
03:56
and high scores for that classifier
81
224920
2536
03:59
would be considered
detections in the image.
82
227480
3136
04:02
But this involved running a classifier
thousands of times over an image,
83
230640
4056
04:06
thousands of neural network evaluations
to produce detection.
84
234720
2920
04:11
Instead, we trained a single network
to do all of detection for us.
85
239240
4536
04:15
It produces all of the bounding boxes
and class probabilities simultaneously.
86
243800
4280
04:20
With our system, instead of looking
at an image thousands of times
87
248680
3496
04:24
to produce detection,
88
252200
1456
04:25
you only look once,
89
253680
1256
04:26
and that's why we call it
the YOLO method of object detection.
90
254960
2920
04:31
So with this speed,
we're not just limited to images;
91
259360
3976
04:35
we can process video in real time.
92
263360
2416
04:37
And now, instead of just seeing
that cat and dog,
93
265800
3096
04:40
we can see them move around
and interact with each other.
94
268920
2960
04:46
This is a detector that we trained
95
274560
2056
04:48
on 80 different classes
96
276640
4376
04:53
in Microsoft's COCO dataset.
97
281040
3256
04:56
It has all sorts of things
like spoon and fork, bowl,
98
284320
3336
04:59
common objects like that.
99
287680
1800
05:02
It has a variety of more exotic things:
100
290360
3096
05:05
animals, cars, zebras, giraffes.
101
293480
3256
05:08
And now we're going to do something fun.
102
296760
1936
05:10
We're just going to go
out into the audience
103
298720
2096
05:12
and see what kind of things we can detect.
104
300840
2016
05:14
Does anyone want a stuffed animal?
105
302880
1620
05:18
There are some teddy bears out there.
106
306000
1762
05:22
And we can turn down
our threshold for detection a little bit,
107
310040
4536
05:26
so we can find more of you guys
out in the audience.
108
314600
3400
05:31
Let's see if we can get these stop signs.
109
319560
2336
05:33
We find some backpacks.
110
321920
1880
05:37
Let's just zoom in a little bit.
111
325880
1840
05:42
And this is great.
112
330320
1256
05:43
And all of the processing
is happening in real time
113
331600
3176
05:46
on the laptop.
114
334800
1200
05:49
And it's important to remember
115
337080
1456
05:50
that this is a general purpose
object detection system,
116
338560
3216
05:53
so we can train this for any image domain.
117
341800
5000
06:00
The same code that we use
118
348320
2536
06:02
to find stop signs or pedestrians,
119
350880
2456
06:05
bicycles in a self-driving vehicle,
120
353360
1976
06:07
can be used to find cancer cells
121
355360
2856
06:10
in a tissue biopsy.
122
358240
3016
06:13
And there are researchers around the globe
already using this technology
123
361280
4040
06:18
for advances in things
like medicine, robotics.
124
366240
3416
06:21
This morning, I read a paper
125
369680
1376
06:23
where they were taking a census
of animals in Nairobi National Park
126
371080
4576
06:27
with YOLO as part
of this detection system.
127
375680
3136
06:30
And that's because Darknet is open source
128
378840
3096
06:33
and in the public domain,
free for anyone to use.
129
381960
2520
06:37
(Applause)
130
385600
5696
06:43
But we wanted to make detection
even more accessible and usable,
131
391320
4936
06:48
so through a combination
of model optimization,
132
396280
4056
06:52
network binarization and approximation,
133
400360
2296
06:54
we actually have object detection
running on a phone.
134
402680
3920
07:04
(Applause)
135
412800
5320
07:10
And I'm really excited because
now we have a pretty powerful solution
136
418960
5056
07:16
to this low-level computer vision problem,
137
424040
2296
07:18
and anyone can take it
and build something with it.
138
426360
3856
07:22
So now the rest is up to all of you
139
430240
3176
07:25
and people around the world
with access to this software,
140
433440
2936
07:28
and I can't wait to see what people
will build with this technology.
141
436400
3656
07:32
Thank you.
142
440080
1216
07:33
(Applause)
143
441320
3440

▲Back to top

ABOUT THE SPEAKER
Joseph Redmon - Computer scientist
Joseph Redmon works on the YOLO algorithm, which combines the simple face detection of your phone camera with a cloud-based AI -- in real time.

Why you should listen

Computer scientist Joseph Redmon is working on the YOLO (You Only Look Once) algorithm, which has a simple goal: to deliver image recognition and object detection at a speed that would seem science-fictional only a few years ago. The algorithm looks like the simple face detection of a camera app but with the level complexity of systems like Google's Deep Mind Cloud Vision, using Convolutional Deep Neural Networks to crunch object detection in realtime. It's the kind of technology that will be embedded on all smartphones in the next few years.

Redmon is also internet-famous for his resume.

More profile about the speaker
Joseph Redmon | Speaker | TED.com