ABOUT THE SPEAKER

Joseph Redmon - Computer scientist
Joseph Redmon works on the YOLO algorithm, which combines the simple face detection of your phone camera with a cloud-based AI -- in real time.

Why you should listen

Computer scientist Joseph Redmon is working on the YOLO (You Only Look Once) algorithm, which has a simple goal: to deliver image recognition and object detection at a speed that would seem science-fictional only a few years ago. The algorithm looks like the simple face detection of a camera app but with the level complexity of systems like Google's Deep Mind Cloud Vision, using Convolutional Deep Neural Networks to crunch object detection in realtime. It's the kind of technology that will be embedded on all smartphones in the next few years.

Redmon is also internet-famous for his resume.

More profile about the speaker
Joseph Redmon | Speaker | TED.com

TED2017

Joseph Redmon: How computers learn to recognize objects instantly

Filmed: 2017-04-24

Readability: 4.5

2,471,805 views

Ten years ago, researchers thought that getting a computer to tell the difference between a cat and a dog would be almost impossible. Today, computer vision systems do it with greater than 99 percent accuracy. How? Joseph Redmon works on the YOLO (You Only Look Once) system, an open-source method of object detection that can identify objects in images and video -- from zebras to stop signs -- with lightning-quick speed. In a remarkable live demo, Redmon shows off this important step forward for applications like self-driving cars, robotics and even cancer detection.

Joseph Redmon - Computer scientist
Joseph Redmon works on the YOLO algorithm, which combines the simple face detection of your phone camera with a cloud-based AI -- in real time. Full bio

Double-click the English transcript below to play the video.

00:12

Ten years ago,

0

825

1151

00:14

computer vision researchers
thought that getting a computer

1

2000

2776

00:16

to tell the difference
between a cat and a dog

2

4800

2696

00:19

would be almost impossible,

3

7520

1976

00:21

even with the significant advance
in the state of artificial intelligence.

4

9520

3696

00:25

Now we can do it at a level
greater than 99 percent accuracy.

5

13240

3560

00:29

This is called image classification --

6

17680

1856

00:31

give it an image,
put a label to that image --

7

19560

3096

00:34

and computers know
thousands of other categories as well.

8

22680

3040

00:38

I'm a graduate student
at the University of Washington,

9

26680

2896

00:41

and I work on a project called Darknet,

10

29600

1896

00:43

which is a neural network framework

11

31520

1696

00:45

for training and testing
computer vision models.

12

33240

2816

00:48

So let's just see what Darknet thinks

13

36080

2976

00:51

of this image that we have.

14

39080

1760

00:54

When we run our classifier

15

42520

2336

00:56

on this image,

16

44880

1216

00:58

we see we don't just get
a prediction of dog or cat,

17

46120

2456

01:00

we actually get
specific breed predictions.

18

48600

2336

01:02

That's the level
of granularity we have now.

19

50960

2176

01:05

And it's correct.

20

53160

1616

01:06

My dog is in fact a malamute.

21

54800

1840

01:09

So we've made amazing strides
in image classification,

22

57040

4336

01:13

but what happens
when we run our classifier

23

61400

2000

01:15

on an image that looks like this?

24

63424

1960

01:19

Well ...

25

67080

1200

01:24

We see that the classifier comes back
with a pretty similar prediction.

26

72640

3896

01:28

And it's correct,
there is a malamute in the image,

27

76560

3096

01:31

but just given this label,
we don't actually know that much

28

79680

3696

01:35

about what's going on in the image.

29

83400

1667

01:37

We need something more powerful.

30

85091

1560

01:39

I work on a problem
called object detection,

31

87240

2616

01:41

where we look at an image
and try to find all of the objects,

32

89880

2936

01:44

put bounding boxes around them

33

92840

1456

01:46

and say what those objects are.

34

94320

1520

01:48

So here's what happens
when we run a detector on this image.

35

96400

3280

01:53

Now, with this kind of result,

36

101240

2256

01:55

we can do a lot more
with our computer vision algorithms.

37

103520

2696

01:58

We see that it knows
that there's a cat and a dog.

38

106240

2976

02:01

It knows their relative locations,

39

109240

2256

02:03

their size.

40

111520

1216

02:04

It may even know some extra information.

41

112760

1936

02:06

There's a book sitting in the background.

42

114720

1960

02:09

And if you want to build a system
on top of computer vision,

43

117280

3256

02:12

say a self-driving vehicle
or a robotic system,

44

120560

3456

02:16

this is the kind
of information that you want.

45

124040

2456

02:18

You want something so that
you can interact with the physical world.

46

126520

3239

02:22

Now, when I started working
on object detection,

47

130759

2257

02:25

it took 20 seconds
to process a single image.

48

133040

3296

02:28

And to get a feel for why
speed is so important in this domain,

49

136360

3880

02:33

here's an example of an object detector

50

141120

2536

02:35

that takes two seconds
to process an image.

51

143680

2416

02:38

So this is 10 times faster

52

146120

2616

02:40

than the 20-seconds-per-image detector,

53

148760

3536

02:44

and you can see that by the time
it makes predictions,

54

152320

2656

02:47

the entire state of the world has changed,

55

155000

2040

02:49

and this wouldn't be very useful

56

157880

2416

02:52

for an application.

57

160320

1416

02:53

If we speed this up
by another factor of 10,

58

161760

2496

02:56

this is a detector running
at five frames per second.

59

164280

2816

02:59

This is a lot better,

60

167120

1536

03:00

but for example,

61

168680

1976

03:02

if there's any significant movement,

62

170680

2296

03:05

I wouldn't want a system
like this driving my car.

63

173000

2560

03:09

This is our detection system
running in real time on my laptop.

64

177120

3240

03:13

So it smoothly tracks me
as I move around the frame,

65

181000

3136

03:16

and it's robust to a wide variety
of changes in size,

66

184160

3720

03:21

pose,

67

189440

1200

03:23

forward, backward.

68

191280

1856

03:25

This is great.

69

193160

1216

03:26

This is what we really need

70

194400

1736

03:28

if we're going to build systems
on top of computer vision.

71

196160

2896

03:31

(Applause)

72

199080

4000

03:36

So in just a few years,

73

204280

2176

03:38

we've gone from 20 seconds per image

74

206480

2656

03:41

to 20 milliseconds per image,
a thousand times faster.

75

209160

3536

03:44

How did we get there?

76

212720

1416

03:46

Well, in the past,
object detection systems

77

214160

3016

03:49

would take an image like this

78

217200

1936

03:51

and split it into a bunch of regions

79

219160

2456

03:53

and then run a classifier
on each of these regions,

80

221640

3256

03:56

and high scores for that classifier

81

224920

2536

03:59

would be considered
detections in the image.

82

227480

3136

04:02

But this involved running a classifier
thousands of times over an image,

83

230640

4056

04:06

thousands of neural network evaluations
to produce detection.

84

234720

2920

04:11

Instead, we trained a single network
to do all of detection for us.

85

239240

4536

04:15

It produces all of the bounding boxes
and class probabilities simultaneously.

86

243800

4280

04:20

With our system, instead of looking
at an image thousands of times

87

248680

3496

04:24

to produce detection,

88

252200

1456

04:25

you only look once,

89

253680

1256

04:26

and that's why we call it
the YOLO method of object detection.

90

254960

2920

04:31

So with this speed,
we're not just limited to images;

91

259360

3976

04:35

we can process video in real time.

92

263360

2416

04:37

And now, instead of just seeing
that cat and dog,

93

265800

3096

04:40

we can see them move around
and interact with each other.

94

268920

2960

04:46

This is a detector that we trained

95

274560

2056

04:48

on 80 different classes

96

276640

4376

04:53

in Microsoft's COCO dataset.

97

281040

3256

04:56

It has all sorts of things
like spoon and fork, bowl,

98

284320

3336

04:59

common objects like that.

99

287680

1800

05:02

It has a variety of more exotic things:

100

290360

3096

05:05

animals, cars, zebras, giraffes.

101

293480

3256

05:08

And now we're going to do something fun.

102

296760

1936

05:10

We're just going to go
out into the audience

103

298720

2096

05:12

and see what kind of things we can detect.

104

300840

2016

05:14

Does anyone want a stuffed animal?

105

302880

1620

05:18

There are some teddy bears out there.

106

306000

1762

05:22

And we can turn down
our threshold for detection a little bit,

107

310040

4536

05:26

so we can find more of you guys
out in the audience.

108

314600

3400

05:31

Let's see if we can get these stop signs.

109

319560

2336

05:33

We find some backpacks.

110

321920

1880

05:37

Let's just zoom in a little bit.

111

325880

1840

05:42

And this is great.

112

330320

1256

05:43

And all of the processing
is happening in real time

113

331600

3176

05:46

on the laptop.

114

334800

1200

05:49

And it's important to remember

115

337080

1456

05:50

that this is a general purpose
object detection system,

116

338560

3216

05:53

so we can train this for any image domain.

117

341800

5000

06:00

The same code that we use

118

348320

2536

06:02

to find stop signs or pedestrians,

119

350880

2456

06:05

bicycles in a self-driving vehicle,

120

353360

1976

06:07

can be used to find cancer cells

121

355360

2856

06:10

in a tissue biopsy.

122

358240

3016

06:13

And there are researchers around the globe
already using this technology

123

361280

4040

06:18

for advances in things
like medicine, robotics.

124

366240

3416

06:21

This morning, I read a paper

125

369680

1376

06:23

where they were taking a census
of animals in Nairobi National Park

126

371080

4576

06:27

with YOLO as part
of this detection system.

127

375680

3136

06:30

And that's because Darknet is open source

128

378840

3096

06:33

and in the public domain,
free for anyone to use.

129

381960

2520

06:37

(Applause)

130

385600

5696

06:43

But we wanted to make detection
even more accessible and usable,

131

391320

4936

06:48

so through a combination
of model optimization,

132

396280

4056

06:52

network binarization and approximation,

133

400360

2296

06:54

we actually have object detection
running on a phone.

134

402680

3920

07:04

(Applause)

135

412800

5320

07:10

And I'm really excited because
now we have a pretty powerful solution

136

418960

5056

07:16

to this low-level computer vision problem,

137

424040

2296

07:18

and anyone can take it
and build something with it.

138

426360

3856

07:22

So now the rest is up to all of you

139

430240

3176

07:25

and people around the world
with access to this software,

140

433440

2936

07:28

and I can't wait to see what people
will build with this technology.

141

436400

3656

07:32

Thank you.

142

440080

1216

07:33

(Applause)

143

441320

3440

ABOUT THE SPEAKER

Joseph Redmon - Computer scientist
Joseph Redmon works on the YOLO algorithm, which combines the simple face detection of your phone camera with a cloud-based AI -- in real time.

Why you should listen

Computer scientist Joseph Redmon is working on the YOLO (You Only Look Once) algorithm, which has a simple goal: to deliver image recognition and object detection at a speed that would seem science-fictional only a few years ago. The algorithm looks like the simple face detection of a camera app but with the level complexity of systems like Google's Deep Mind Cloud Vision, using Convolutional Deep Neural Networks to crunch object detection in realtime. It's the kind of technology that will be embedded on all smartphones in the next few years.

Redmon is also internet-famous for his resume.

More profile about the speaker
Joseph Redmon | Speaker | TED.com

THE ORIGINAL VIDEO ON TED.COM

Joseph Redmon: How computers learn to recognize objects instantly | TED Talk | TED.com