ABOUT THE SPEAKERS

Jean-Baptiste Michel - Data researcher
Jean-Baptiste Michel looks at how we can use large volumes of data to better understand our world.

Why you should listen

Jean-Baptiste Michel holds joint academic appointments at Harvard (FQEB Fellow) and Google (Visiting Faculty). His research focusses on using large volumes of data as tools that help better understand the world around us -- from the way diseases progress in patients over years, to the way cultures change in human societies over centuries. With his colleague Erez Lieberman Aiden, Jean-Baptiste is a Founding Director of Harvard's Cultural Observatory, where their research team pioneers the use of quantitative methods for the study of human culture, language and history. His research was featured on the covers of Science and Nature, on the front pages of the New York Times and the Boston Globe, in The Economist, Wired and many other venues. The online tool he helped create -- ngrams.googlelabs.com -- was used millions of times to browse cultural trends. Jean-Baptiste is an Engineer from Ecole Polytechnique (Paris), and holds an MS in Applied Mathematics and a PhD in Systems Biology from Harvard.

More profile about the speaker
Jean-Baptiste Michel | Speaker | TED.com

Erez Lieberman Aiden - Researcher
Erez Lieberman Aiden pursues a broad range of research interests, spanning genomics, linguistics, mathematics ...

Why you should listen

Erez Lieberman Aiden is a fellow at the Harvard Society of Fellows and Visiting Faculty at Google. His research spans many disciplines and has won numerous awards, including recognition for one of the top 20 "Biotech Breakthroughs that will Change Medicine", by Popular Mechanics; the Lemelson-MIT prize for the best student inventor at MIT; the American Physical Society's Award for the Best Doctoral Dissertation in Biological Physics; and membership in Technology Review's 2009 TR35, recognizing the top 35 innovators under 35. His last three papers -- two with JB Michel -- have all appeared on the cover of Nature and Science.

More profile about the speaker
Erez Lieberman Aiden | Speaker | TED.com

TEDxBoston 2011

Jean-Baptiste Michel + Erez Lieberman Aiden: What we learned from 5 million books

O que aprendemos de 5 milhões de livros

Filmed: 2011-07-24

Readability: 3.9

2,049,453 views

Você já brincou com o Ngram Viewer do Google Labs? É uma ferramenta viciante que permite a você pesquisar por palavras e ideias em um banco de dados de 5 milhões de livros através dos séculos. Erez Lieberman Aiden e Jean-Baptiste Michel mostram como funciona, e algumas coisas surpreendentes que podemos aprender de 500 bilhões de palavras.

Jean-Baptiste Michel - Data researcher
Jean-Baptiste Michel looks at how we can use large volumes of data to better understand our world. Full bioErez Lieberman Aiden - Researcher
Erez Lieberman Aiden pursues a broad range of research interests, spanning genomics, linguistics, mathematics ... Full bio

Double-click the English transcript below to play the video.

00:15

Erez Lieberman Aiden: Everyone knows

0

0

2000

Erez Lieberman Aiden: Todos sabem

00:17

that a picture is worth a thousand words.

1

2000

3000

que uma imagem vale mil palavras.

00:22

But we at Harvard

2

7000

2000

Mas nós em Harvard

00:24

were wondering if this was really true.

3

9000

3000

estávamos questionando se é mesmo verdade.

00:27

(Laughter)

4

12000

2000

(Risos)

00:29

So we assembled a team of experts,

5

14000

4000

Assim montamos uma equipe de peritos,

00:33

spanning Harvard, MIT,

6

18000

2000

desde Harvard, MIT,

00:35

The American Heritage Dictionary, The Encyclopedia Britannica

7

20000

3000

The American Heritage Dictionary, Enciclopédia Britânica

00:38

and even our proud sponsors,

8

23000

2000

e mesmo nossos orgulhosos patrocinadores,

00:40

the Google.

9

25000

3000

o Google.

00:43

And we cogitated about this

10

28000

2000

E pensamos sobre isto

00:45

for about four years.

11

30000

2000

por cerca de 4 anos.

00:47

And we came to a startling conclusion.

12

32000

5000

Chegamos a uma surpreendente conclusão.

00:52

Ladies and gentlemen, a picture is not worth a thousand words.

13

37000

3000

Senhoras e senhores, uma imagem não vale mil palavras.

00:55

In fact, we found some pictures

14

40000

2000

De fato, encontramos algumas imagens

00:57

that are worth 500 billion words.

15

42000

5000

que valem 500 bilhões de palavras.

01:02

Jean-Baptiste Michel: So how did we get to this conclusion?

16

47000

2000

Jean-Baptiste Michel: Como chegamos a esta conclusão?

01:04

So Erez and I were thinking about ways

17

49000

2000

Erez e eu pensávamos em maneiras

01:06

to get a big picture of human culture

18

51000

2000

de obter uma grande imagem da cultura e

01:08

and human history: change over time.

19

53000

3000

história humana: a mudança através dos tempos.

01:11

So many books actually have been written over the years.

20

56000

2000

Muitos livros tem sido escritos ao longo dos anos.

01:13

So we were thinking, well the best way to learn from them

21

58000

2000

Pensávamos, a melhor maneira de aprender com eles

01:15

is to read all of these millions of books.

22

60000

2000

é ler todos estes milhões de livros.

01:17

Now of course, if there's a scale for how awesome that is,

23

62000

3000

Naturalmente, se há uma medida do incrível que isso é,

01:20

that has to rank extremely, extremely high.

24

65000

3000

teria que ser colocado lá em cima.

01:23

Now the problem is there's an X-axis for that,

25

68000

2000

O problema é que existe um eixo-X para isso,

01:25

which is the practical axis.

26

70000

2000

que é o eixo da praticidade.

01:27

This is very, very low.

27

72000

2000

Que é muito, muito baixa.

01:29

(Applause)

28

74000

3000

(Aplausos)

01:32

Now people tend to use an alternative approach,

29

77000

3000

As pessoas costumam usar um método alternativo,

01:35

which is to take a few sources and read them very carefully.

30

80000

2000

que seria pegar algumas fontes e lê-las cuidadosamente.

01:37

This is extremely practical, but not so awesome.

31

82000

2000

É extremamente prático, mas nem um pouco incrível.

01:39

What you really want to do

32

84000

3000

O que realmente se quer fazer

01:42

is to get to the awesome yet practical part of this space.

33

87000

3000

é alcançar o incrível junto com a parte prática deste espaço.

01:45

So it turns out there was a company across the river called Google

34

90000

3000

Aconteceu que havia uma empresa próxima chamada Google

01:48

who had started a digitization project a few years back

35

93000

2000

que iniciou um projeto de digitalização alguns anos antes

01:50

that might just enable this approach.

36

95000

2000

que poderia viabilizar este método.

01:52

They have digitized millions of books.

37

97000

2000

Eles haviam digitalizado milhões de livros.

01:54

So what that means is, one could use computational methods

38

99000

3000

O que significa, que alguém poderia usar métodos computacionais

01:57

to read all of the books in a click of a button.

39

102000

2000

para ler todos os livros com um clique de botão.

01:59

That's very practical and extremely awesome.

40

104000

3000

Isso é muito prático e extremamente incrível.

02:03

ELA: Let me tell you a little bit about where books come from.

41

108000

2000

ELA: Permitam-me contar um pouco de onde os livros vêm.

02:05

Since time immemorial, there have been authors.

42

110000

3000

Desde tempos imemoriais, existem os autores.

02:08

These authors have been striving to write books.

43

113000

3000

Estes autores tem se esforçado para escrever livros.

02:11

And this became considerably easier

44

116000

2000

O que se tornou consideravelmente mais fácil

02:13

with the development of the printing press some centuries ago.

45

118000

2000

com o desenvolvimento da imprensa alguns séculos atrás.

02:15

Since then, the authors have won

46

120000

3000

Desde então, os autores venceram

02:18

on 129 million distinct occasions,

47

123000

2000

em 129 milhões de ocasiões distintas,

02:20

publishing books.

48

125000

2000

publicando livros.

02:22

Now if those books are not lost to history,

49

127000

2000

Agora se esses livros não se perderam na história,

02:24

then they are somewhere in a library,

50

129000

2000

então eles estão em algum lugar em uma biblioteca,

02:26

and many of those books have been getting retrieved from the libraries

51

131000

3000

e muitos deles estão sendo recuperados das bibliotecas

02:29

and digitized by Google,

52

134000

2000

e digitalizados pelo Google,

02:31

which has scanned 15 million books to date.

53

136000

2000

que escaneou 15 milhões de livros até agora.

02:33

Now when Google digitizes a book, they put it into a really nice format.

54

138000

3000

Quando o Google digitaliza, eles o colocam em um formato muito legal.

02:36

Now we've got the data, plus we have metadata.

55

141000

2000

Agora temos a informação, e temos os metadados.

02:38

We have information about things like where was it published,

56

143000

3000

Temos informações sobre coisas como onde foi publicado,

02:41

who was the author, when was it published.

57

146000

2000

quem era o autor, quando foi publicado.

02:43

And what we do is go through all of those records

58

148000

3000

E o que fazemos é percorrer todos estes registros

02:46

and exclude everything that's not the highest quality data.

59

151000

4000

e excluir tudo que não seja informação de alta qualidade.

02:50

What we're left with

60

155000

2000

O que permanece

02:52

is a collection of five million books,

61

157000

3000

é uma coleção de 5 milhões de livros,

02:55

500 billion words,

62

160000

3000

500 bilhões de palavras,

02:58

a string of characters a thousand times longer

63

163000

2000

uma sequência de caracteres mil vezes maior

03:00

than the human genome --

64

165000

3000

que o genoma humano --

03:03

a text which, when written out,

65

168000

2000

um texto que, quando escrito,

03:05

would stretch from here to the Moon and back

66

170000

2000

se estenderia daqui até a Lua e de volta

03:07

10 times over --

67

172000

2000

mais de 10 vezes --

03:09

a veritable shard of our cultural genome.

68

174000

4000

um verdadeiro fragmento de nosso genoma cultural.

03:13

Of course what we did

69

178000

2000

Claro que fizemos

03:15

when faced with such outrageous hyperbole ...

70

180000

3000

quando encaramos tal ultrajante hipérbole...

03:18

(Laughter)

71

183000

2000

(Risos)

03:20

was what any self-respecting researchers

72

185000

3000

foi o que qualquer pesquisador com respeito próprio

03:23

would have done.

73

188000

3000

teria feito.

03:26

We took a page out of XKCD,

74

191000

2000

Pegamos uma webcomic do XKCD,

03:28

and we said, "Stand back.

75

193000

2000

e dissemos, "Afastem-se.

03:30

We're going to try science."

76

195000

2000

Vamos tentar a ciência."

03:32

(Laughter)

77

197000

2000

(Risos)

03:34

JM: Now of course, we were thinking,

78

199000

2000

JM: Naturalmente, nós pensamos,

03:36

well let's just first put the data out there

79

201000

2000

primeiro vamos mostrar os dados

03:38

for people to do science to it.

80

203000

2000

para que as pessoas façam ciência com eles.

03:40

Now we're thinking, what data can we release?

81

205000

2000

Depois pensamos, que informação podemos liberar?

03:42

Well of course, you want to take the books

82

207000

2000

Naturalmente, você quer pegar os livros

03:44

and release the full text of these five million books.

83

209000

2000

e liberar o texto completo destes 5 milhões de livros.

03:46

Now Google, and Jon Orwant in particular,

84

211000

2000

Aí o Google, e Jon Orwant em especial,

03:48

told us a little equation that we should learn.

85

213000

2000

falaram sobre uma equação que devíamos aprender.

03:50

So you have five million, that is, five million authors

86

215000

3000

Você tem 5 milhões, que são, 5 milhões de autores

03:53

and five million plaintiffs is a massive lawsuit.

87

218000

3000

e 5 milhões de queixosos é um processo e tanto.

03:56

So, although that would be really, really awesome,

88

221000

2000

Ainda que fosse muito, mas muito incrível,

03:58

again, that's extremely, extremely impractical.

89

223000

3000

de novo, é extremamente, extremamente impraticável.

04:01

(Laughter)

90

226000

2000

(Risos)

04:03

Now again, we kind of caved in,

91

228000

2000

Então, nós meio que nos aprofundamos,

04:05

and we did the very practical approach, which was a bit less awesome.

92

230000

3000

e fizemos uma alternativa prática, que foi só um pouco menos incrível.

04:08

We said, well instead of releasing the full text,

93

233000

2000

Falamos, ao invés de liberar o texto completo,

04:10

we're going to release statistics about the books.

94

235000

2000

vamos liberar estatísticas sobre os livros.

04:12

So take for instance "A gleam of happiness."

95

237000

2000

Peguem por exemplo "Um brilho de felicidade."

04:14

It's four words; we call that a four-gram.

96

239000

2000

São 4 palavras: nós chamamos de 4-grama.

04:16

We're going to tell you how many times a particular four-gram

97

241000

2000

Vamos dizer a vocês quantas vezes um 4-grama em especial

04:18

appeared in books in 1801, 1802, 1803,

98

243000

2000

apareceu nos livros em 1801, 1802, 1803,

04:20

all the way up to 2008.

99

245000

2000

até chegar em 2008.

04:22

That gives us a time series

100

247000

2000

Isso nos dá uma linha de tempo

04:24

of how frequently this particular sentence was used over time.

101

249000

2000

da frequência com que esta frase foi utilizada através dos tempos.

04:26

We do that for all the words and phrases that appear in those books,

102

251000

3000

Fizemos isso para todas as palavras e frases que aparecem nos livros,

04:29

and that gives us a big table of two billion lines

103

254000

3000

o que nos dá uma grande tabela de 2 bilhões de linhas

04:32

that tell us about the way culture has been changing.

104

257000

2000

que nos conta como a cultura tem se modificado.

04:34

ELA: So those two billion lines,

105

259000

2000

ELA: Essas 2 bilhões de linhas,

04:36

we call them two billion n-grams.

106

261000

2000

nós chamamos de 2 bilhões de n-gramas.

04:38

What do they tell us?

107

263000

2000

O que eles nos dizem?

04:40

Well the individual n-grams measure cultural trends.

108

265000

2000

Os n-gramas individuais medem as tendências culturais.

04:42

Let me give you an example.

109

267000

2000

Permitam-me dar um exemplo.

04:44

Let's suppose that I am thriving,

110

269000

2000

Suponhamos que eu esteja prosperando.

04:46

then tomorrow I want to tell you about how well I did.

111

271000

2000

e amanhã eu queira contar como eu me dei bem.

04:48

And so I might say, "Yesterday, I throve."

112

273000

3000

Em inglês eu diria, "Ontem, eu 'throve'."

04:51

Alternatively, I could say, "Yesterday, I thrived."

113

276000

3000

Ou eu poderia dizer, "Ontem, eu 'thrived'."

04:54

Well which one should I use?

114

279000

3000

Qual deles eu deveria usar?

04:57

How to know?

115

282000

2000

Como saber?

04:59

As of about six months ago,

116

284000

2000

Como cerca de 6 meses atras,

05:01

the state of the art in this field

117

286000

2000

o estado de arte nesta matéria

05:03

is that you would, for instance,

118

288000

2000

seria, por exemplo,

05:05

go up to the following psychologist with fabulous hair,

119

290000

2000

ir até este psicólogo com um cabelo fabuloso,

05:07

and you'd say,

120

292000

2000

e dizer,

05:09

"Steve, you're an expert on the irregular verbs.

121

294000

3000

"Steve, você é um expert em verbos irregulares.

05:12

What should I do?"

122

297000

2000

O que eu devo fazer?"

05:14

And he'd tell you, "Well most people say thrived,

123

299000

2000

E ele diria, "Bem a maioria diria 'thrived',

05:16

but some people say throve."

124

301000

3000

mas algumas diriam 'throve'."

05:19

And you also knew, more or less,

125

304000

2000

E vocês também sabem, talvez,

05:21

that if you were to go back in time 200 years

126

306000

3000

que se voltassem no tempo 200 anos

05:24

and ask the following statesman with equally fabulous hair,

127

309000

3000

e perguntassem a esse estadista também de cabelo fabuloso,

05:27

(Laughter)

128

312000

3000

(Risos)

05:30

"Tom, what should I say?"

129

315000

2000

"Tom, o que devo falar?"

05:32

He'd say, "Well, in my day, most people throve,

130

317000

2000

Ele diria, "No meu tempo a maioria dizia 'throve',

05:34

but some thrived."

131

319000

3000

mas alguns 'thrive'."

05:37

So now what I'm just going to show you is raw data.

132

322000

2000

Agora o que vou lhes mostrar são dados crus.

05:39

Two rows from this table of two billion entries.

133

324000

4000

Duas linhas desta tabela de 2 bilhões de lançamentos.

05:43

What you're seeing is year by year frequency

134

328000

2000

O que estão vendo é a frequencia ano a ano

05:45

of "thrived" and "throve" over time.

135

330000

3000

de "thrived" e "throve" através dos tempos.

05:49

Now this is just two

136

334000

2000

Isso são apenas duas

05:51

out of two billion rows.

137

336000

3000

de 2 bilhões de linhas.

05:54

So the entire data set

138

339000

2000

Assim o conjunto completo de dados

05:56

is a billion times more awesome than this slide.

139

341000

3000

é 2 bilhões de vezes mais incrível que esse slide.

05:59

(Laughter)

140

344000

2000

(Risos)

06:01

(Applause)

141

346000

4000

(Aplausos)

06:05

JM: Now there are many other pictures that are worth 500 billion words.

142

350000

2000

JM: Existem muitas outras imagens que valem 500 bilhões de palavras.

06:07

For instance, this one.

143

352000

2000

Por exemplo, esta aqui.

06:09

If you just take influenza,

144

354000

2000

Se você escolher influenza,

06:11

you will see peaks at the time where you knew

145

356000

2000

verá picos nas épocas onde se sabe

06:13

big flu epidemics were killing people around the globe.

146

358000

3000

de grandes epidemias de gripe que mataram pessoas pelo mundo.

06:16

ELA: If you were not yet convinced,

147

361000

3000

ELA: Se vocês ainda não se convenceram,

06:19

sea levels are rising,

148

364000

2000

o nível dos mares está subindo,

06:21

so is atmospheric CO2 and global temperature.

149

366000

3000

junto com o CO2 na atmosfera e a temperatura global.

06:24

JM: You might also want to have a look at this particular n-gram,

150

369000

3000

JM: Vocês também podem querer dar uma olhada neste n-grama,

06:27

and that's to tell Nietzsche that God is not dead,

151

372000

3000

que diz ao Nietzsche que Deus não morreu,

06:30

although you might agree that he might need a better publicist.

152

375000

3000

apesar que ele podia ter um publicitário melhor.

06:33

(Laughter)

153

378000

2000

(Risos)

06:35

ELA: You can get at some pretty abstract concepts with this sort of thing.

154

380000

3000

ELA: Se pode entender alguns conceitos bem abstratos com essa coisa.

06:38

For instance, let me tell you the history

155

383000

2000

Por exemplo, permitam-me contar a história

06:40

of the year 1950.

156

385000

2000

do ano de 1950.

06:42

Pretty much for the vast majority of history,

157

387000

2000

Durante todo o transcurso da história,

06:44

no one gave a damn about 1950.

158

389000

2000

ninguém dava a mínima para 1950.

06:46

In 1700, in 1800, in 1900,

159

391000

2000

Em 1700, em 1800, em 1900,

06:48

no one cared.

160

393000

3000

ninguém ligava.

06:52

Through the 30s and 40s,

161

397000

2000

Nos anos 30 e 40,

06:54

no one cared.

162

399000

2000

ninguém ligava.

06:56

Suddenly, in the mid-40s,

163

401000

2000

De repente, no meio dos anos 40,

06:58

there started to be a buzz.

164

403000

2000

começou um rumor.

07:00

People realized that 1950 was going to happen,

165

405000

2000

As pessoas perceberam que 1950 viria,

07:02

and it could be big.

166

407000

2000

e que seria algo grande.

07:04

(Laughter)

167

409000

3000

(Risos)

07:07

But nothing got people interested in 1950

168

412000

3000

Nada interessou tanto às pessoas em 1950

07:10

like the year 1950.

169

415000

3000

como o ano 1950.

07:13

(Laughter)

170

418000

3000

(Risos)

07:16

People were walking around obsessed.

171

421000

2000

As pessoas caminhavam obcecadas.

07:18

They couldn't stop talking

172

423000

2000

Não podiam parar de falar

07:20

about all the things they did in 1950,

173

425000

3000

sobre as coisas que fizeram em 1950,

07:23

all the things they were planning to do in 1950,

174

428000

3000

tudo o que estavam planejando para 1950,

07:26

all the dreams of what they wanted to accomplish in 1950.

175

431000

5000

todos os sonhos que queriam alcançar em 1950.

07:31

In fact, 1950 was so fascinating

176

436000

2000

De fato, 1950 foi tão fascinante

07:33

that for years thereafter,

177

438000

2000

que nos anos seguintes,

07:35

people just kept talking about all the amazing things that happened,

178

440000

3000

as pessoas continuavam falando sobre as coisas incríveis que aconteceram,

07:38

in '51, '52, '53.

179

443000

2000

em 51, 52, 53.

07:40

Finally in 1954,

180

445000

2000

Finalmente em 1954,

07:42

someone woke up and realized

181

447000

2000

alguém acordou e percebeu

07:44

that 1950 had gotten somewhat passé.

182

449000

4000

que 1950 tinha ficado algo 'passé'.

07:48

(Laughter)

183

453000

2000

(Risos)

07:50

And just like that, the bubble burst.

184

455000

2000

E de repente, a bolha estourou.

07:52

(Laughter)

185

457000

2000

(Risos)

07:54

And the story of 1950

186

459000

2000

A história de 1950

07:56

is the story of every year that we have on record,

187

461000

2000

é a história de todo ano que temos registro,

07:58

with a little twist, because now we've got these nice charts.

188

463000

3000

com um toque a mais, porque agora temos estes gráficos.

08:01

And because we have these nice charts, we can measure things.

189

466000

3000

E porque temos estes gráficos bacanas, podemos medir coisas.

08:04

We can say, "Well how fast does the bubble burst?"

190

469000

2000

Podemos dizer, "Quão rápido a bolha estourou?"

08:06

And it turns out that we can measure that very precisely.

191

471000

3000

E acontece que podemos medir muito precisamente.

08:09

Equations were derived, graphs were produced,

192

474000

3000

Equações foram derivadas, gráficos foram produzidos,

08:12

and the net result

193

477000

2000

e o resultado líquido

08:14

is that we find that the bubble bursts faster and faster

194

479000

3000

é que descobrimos que a bolha estoura cada vez mais rápido

08:17

with each passing year.

195

482000

2000

a cada ano que passa.

08:19

We are losing interest in the past more rapidly.

196

484000

5000

Perdemos interesse no passado cada vez mais rápido.

08:24

JM: Now a little piece of career advice.

197

489000

2000

JM: Agora uma dica para a carreira.

08:26

So for those of you who seek to be famous,

198

491000

2000

Aqueles de vocês que procuram ser famosos,

08:28

we can learn from the 25 most famous political figures,

199

493000

2000

podemos aprender com as 25 personalidades mais famosas,

08:30

authors, actors and so on.

200

495000

2000

políticos, autores, atores e demais.

08:32

So if you want to become famous early on, you should be an actor,

201

497000

3000

Se você quer se tornar famoso cedo, deveria ser um ator,

08:35

because then fame starts rising by the end of your 20s --

202

500000

2000

porque sua fama começa ao final dos vinte anos --

08:37

you're still young, it's really great.

203

502000

2000

você ainda é jovem, é muito bom.

08:39

Now if you can wait a little bit, you should be an author,

204

504000

2000

Agora se quiser esperar um pouco, deveria ser autor,

08:41

because then you rise to very great heights,

205

506000

2000

porque assim você alcançará grandes alturas,

08:43

like Mark Twain, for instance: extremely famous.

206

508000

2000

como Mark Twain, por exemplo, extremamente famoso.

08:45

But if you want to reach the very top,

207

510000

2000

Mas se você quiser alcançar mesmo o topo,

08:47

you should delay gratification

208

512000

2000

deveria postergar o reconhecimento

08:49

and, of course, become a politician.

209

514000

2000

e, claro, tornar-se um político.

08:51

So here you will become famous by the end of your 50s,

210

516000

2000

Assim você se torna famoso no final dos seus 50,

08:53

and become very, very famous afterward.

211

518000

2000

e se torna muito, muito famoso depois.

08:55

So scientists also tend to get famous when they're much older.

212

520000

3000

Cientistas também tendem a ficar famosos quando são mais velhos.

08:58

Like for instance, biologists and physics

213

523000

2000

Por exemplo, biólogos e físicos

09:00

tend to be almost as famous as actors.

214

525000

2000

tendem a ser tão famosos quanto atores.

09:02

One mistake you should not do is become a mathematician.

215

527000

3000

Um erro que vocês devem evitar é serem matemáticos.

09:05

(Laughter)

216

530000

2000

(Risos)

09:07

If you do that,

217

532000

2000

Se fizerem isso,

09:09

you might think, "Oh great. I'm going to do my best work when I'm in my 20s."

218

534000

3000

podem pensar, "Ótimo. Farei meu melhor trabalho quando estou com 20 anos."

09:12

But guess what, nobody will really care.

219

537000

2000

Mas olha só, ninguém quer saber.

09:14

(Laughter)

220

539000

3000

(Risos)

09:17

ELA: There are more sobering notes

221

542000

2000

ELA: Existem notas mais sérias

09:19

among the n-grams.

222

544000

2000

entre os n-gramas.

09:21

For instance, here's the trajectory of Marc Chagall,

223

546000

2000

Por exemplo, eis a trajetória de Marc Chagall,

09:23

an artist born in 1887.

224

548000

2000

artista nascido em 1887.

09:25

And this looks like the normal trajectory of a famous person.

225

550000

3000

Parece a trajetória normal de uma pessoa famosa.

09:28

He gets more and more and more famous,

226

553000

4000

Ele fica mais e mais e mais famoso,

09:32

except if you look in German.

227

557000

2000

exceto se pesquisar em alemão.

09:34

If you look in German, you see something completely bizarre,

228

559000

2000

Se pesquisar em alemão, verá algo totalmente bizarro,

09:36

something you pretty much never see,

229

561000

2000

algo que nunca se vê,

09:38

which is he becomes extremely famous

230

563000

2000

que é ele se tornar extremamente famoso

09:40

and then all of a sudden plummets,

231

565000

2000

e de repente despenca,

09:42

going through a nadir between 1933 and 1945,

232

567000

3000

chegando ao fundo do poço entre 1933 e 1945,

09:45

before rebounding afterward.

233

570000

3000

antes de retornar com tudo.

09:48

And of course, what we're seeing

234

573000

2000

Naturalmente, o que vemos

09:50

is the fact Marc Chagall was a Jewish artist

235

575000

3000

é o fato de que Chagall era um artista judeu

09:53

in Nazi Germany.

236

578000

2000

na Alemanha nazista.

09:55

Now these signals

237

580000

2000

Estes sinais

09:57

are actually so strong

238

582000

2000

são na verdade tão fortes

09:59

that we don't need to know that someone was censored.

239

584000

3000

que não precisamos saber que alguém foi censurado.

10:02

We can actually figure it out

240

587000

2000

Podemos ter uma ideia

10:04

using really basic signal processing.

241

589000

2000

usando até um básico processamento de sinais.

10:06

Here's a simple way to do it.

242

591000

2000

Eis um modo simples de fazer.

10:08

Well, a reasonable expectation

243

593000

2000

Uma expectativa razoável

10:10

is that somebody's fame in a given period of time

244

595000

2000

é que a fama de alguém em um período de tempo

10:12

should be roughly the average of their fame before

245

597000

2000

deveria ser mais ou menos a média de sua fama antes

10:14

and their fame after.

246

599000

2000

e da fama depois.

10:16

So that's sort of what we expect.

247

601000

2000

É algo assim o que esperamos.

10:18

And we compare that to the fame that we observe.

248

603000

3000

E comparamos isso com a fama que observamos.

10:21

And we just divide one by the other

249

606000

2000

E dividimos uma pela outra

10:23

to produce something we call a suppression index.

250

608000

2000

para produzir algo que chamamos de índice de supressão.

10:25

If the suppression index is very, very, very small,

251

610000

3000

Se o índice é muito, mas muito pequeno,

10:28

then you very well might be being suppressed.

252

613000

2000

você pode muito bem estar sendo suprimido.

10:30

If it's very large, maybe you're benefiting from propaganda.

253

615000

3000

Se for muito grande, você pode estar se benificiando com a propaganda.

10:34

JM: Now you can actually look at

254

619000

2000

JM: Agora vocês podem até mesmo ver

10:36

the distribution of suppression indexes over whole populations.

255

621000

3000

a distribuição dos índices de supressão de populações inteiras.

10:39

So for instance, here --

256

624000

2000

Por exemplo, aqui --

10:41

this suppression index is for 5,000 people

257

626000

2000

este índice é para 5.000 pessoas

10:43

picked in English books where there's no known suppression --

258

628000

2000

escolhidas em livros ingleses onde não existe supressão conhecida --

10:45

it would be like this, basically tightly centered on one.

259

630000

2000

seria assim, basicamente centrada no 1.

10:47

What you expect is basically what you observe.

260

632000

2000

O que se esperava é basicamente o que observamos.

10:49

This is distribution as seen in Germany --

261

634000

2000

Esta é a distribuição vista na Alemanha --

10:51

very different, it's shifted to the left.

262

636000

2000

bem diferente, é desviada para a esquerda.

10:53

People talked about it twice less as it should have been.

263

638000

3000

As pessoas falaram menos que o dobro do costumeiro.

10:56

But much more importantly, the distribution is much wider.

264

641000

2000

Mais importante, a distribuição é mais extensa.

10:58

There are many people who end up on the far left on this distribution

265

643000

3000

Existem muitas pessoas que acabaram no lado esquerdo desta distribuição

11:01

who are talked about 10 times fewer than they should have been.

266

646000

3000

que são faladas cerca de 10 vezes menos do que deveriam ter sido.

11:04

But then also many people on the far right

267

649000

2000

E também muitas pessoas bem à direita

11:06

who seem to benefit from propaganda.

268

651000

2000

que parecem ter se beneficiado da propaganda.

11:08

This picture is the hallmark of censorship in the book record.

269

653000

3000

Esta imagem é a marca da censura no registro de livros.

11:11

ELA: So culturomics

270

656000

2000

ELA: Cultorômica

11:13

is what we call this method.

271

658000

2000

é como chamamos este método.

11:15

It's kind of like genomics.

272

660000

2000

É como se fosse a genômica.

11:17

Except genomics is a lens on biology

273

662000

2000

Só que a genômica é uma lente para que a biologia

11:19

through the window of the sequence of bases in the human genome.

274

664000

3000

veja através da janela de sequencias das bases no genoma humano.

11:22

Culturomics is similar.

275

667000

2000

Cultorômica é parecido.

11:24

It's the application of massive-scale data collection analysis

276

669000

3000

É aplicação da análise da enorme quantidade de informações coletadas

11:27

to the study of human culture.

277

672000

2000

para estudo da cultura humana.

11:29

Here, instead of through the lens of a genome,

278

674000

2000

Ao invés de olharmos através das lentes de um genoma,

11:31

through the lens of digitized pieces of the historical record.

279

676000

3000

olhamos através de pedaços digitalizados do registro histórico.

11:34

The great thing about culturomics

280

679000

2000

O bom da culturômica

11:36

is that everyone can do it.

281

681000

2000

é que todos podem participar.

11:38

Why can everyone do it?

282

683000

2000

Por que todos podem?

11:40

Everyone can do it because three guys,

283

685000

2000

Todos podem porque três caras,

11:42

Jon Orwant, Matt Gray and Will Brockman over at Google,

284

687000

3000

Jon Orwant, Matt Gray e Will Brockman no Google,

11:45

saw the prototype of the Ngram Viewer,

285

690000

2000

viram o protótipo do Visualizador de N-Gramas,

11:47

and they said, "This is so fun.

286

692000

2000

e disseram, "Isso é bem divertido.

11:49

We have to make this available for people."

287

694000

3000

Temos que disponibilizar para as pessoas."

11:52

So in two weeks flat -- the two weeks before our paper came out --

288

697000

2000

Em exatamente 2 semanas - antes de nosso artigo ser publicado --

11:54

they coded up a version of the Ngram Viewer for the general public.

289

699000

3000

eles programaram uma versão do Visualizador para o público em geral.

11:57

And so you too can type in any word or phrase that you're interested in

290

702000

3000

Assim vocês podem digitar qualquer palavra ou frase que se interessarem

12:00

and see its n-gram immediately --

291

705000

2000

e imediatamente podem ver o n-grama --

12:02

also browse examples of all the various books

292

707000

2000

e também listar exemplos de todos os muitos livros

12:04

in which your n-gram appears.

293

709000

2000

nos quais o seu n-grama aparece.

12:06

JM: Now this was used over a million times on the first day,

294

711000

2000

JM: Já foi utilizado mais de um milhão de vezes no primeiro dia,

12:08

and this is really the best of all the queries.

295

713000

2000

e é de fato a melhor de todas as procuras.

12:10

So people want to be their best, put their best foot forward.

296

715000

3000

As pessoas querem ser as melhores, se destacar.

12:13

But it turns out in the 18th century, people didn't really care about that at all.

297

718000

3000

Mas acontece que no século 18, as pessoas não ligavam pra isso.

12:16

They didn't want to be their best, they wanted to be their beft.

298

721000

3000

Elas não queriam ser as 'the best', elas queriam ser 'beft'.

12:19

So what happened is, of course, this is just a mistake.

299

724000

3000

O que aconteceu, é claro, foi apenas um equívoco.

12:22

It's not that strove for mediocrity,

300

727000

2000

Não é um esforço pela mediocridade,

12:24

it's just that the S used to be written differently, kind of like an F.

301

729000

3000

apenas o 'S' costumava ser escrito diferente, quase um 'F'.

12:27

Now of course, Google didn't pick this up at the time,

302

732000

3000

Lógico, o Google não pegou isso na ocasião,

12:30

so we reported this in the science article that we wrote.

303

735000

3000

assim nós relatamos no artigo científico que escrevemos.

12:33

But it turns out this is just a reminder

304

738000

2000

Mas se tornou um lembrete

12:35

that, although this is a lot of fun,

305

740000

2000

de que, mesmo sendo muito divertido,

12:37

when you interpret these graphs, you have to be very careful,

306

742000

2000

quando se interpreta estes gráficos, temos que ter cuidado,

12:39

and you have to adopt the base standards in the sciences.

307

744000

3000

e vocês tem que adotar os métodos básicos da ciência.

12:42

ELA: People have been using this for all kinds of fun purposes.

308

747000

3000

ELA: Pessoas o tem utilizado para todo tipo de propósito.

12:45

(Laughter)

309

750000

7000

(Risos)

12:52

Actually, we're not going to have to talk,

310

757000

2000

Na verdade, não precisaremos falar,

12:54

we're just going to show you all the slides and remain silent.

311

759000

3000

vamos apenas mostrar todos os slides e ficar em silêncio.

12:57

This person was interested in the history of frustration.

312

762000

3000

Esta pessoa estava interessada na história da frustração.

13:00

There's various types of frustration.

313

765000

3000

Existem vários tipos de frustração.

13:03

If you stub your toe, that's a one A "argh."

314

768000

3000

Se você esfolar o dedo do pé, É um "ai" com um 'A'.

13:06

If the planet Earth is annihilated by the Vogons

315

771000

2000

Se a Terra é aniquilada pelos Vogons

13:08

to make room for an interstellar bypass,

316

773000

2000

pra dar lugar à um atalho interestelar,

13:10

that's an eight A "aaaaaaaargh."

317

775000

2000

é um "aaaaaaaai" com 8 'A's.

13:12

This person studies all the "arghs,"

318

777000

2000

Esta pessoa estudou todos os "ais",

13:14

from one through eight A's.

319

779000

2000

de 1 até 8 'A's.

13:16

And it turns out

320

781000

2000

E acontece

13:18

that the less-frequent "arghs"

321

783000

2000

que os "ais" menos frequentes

13:20

are, of course, the ones that correspond to things that are more frustrating --

322

785000

3000

são os que correspondem às coisas mais frustrantes --

13:23

except, oddly, in the early 80s.

323

788000

3000

exceto, curiosamente, no começo dos anos 80.

13:26

We think that might have something to do with Reagan.

324

791000

2000

Achamos que deve ter algo a ver com o Reagan.

13:28

(Laughter)

325

793000

2000

(Risos)

13:30

JM: There are many usages of this data,

326

795000

3000

JM: Existem muitos usos para estas informações,

13:33

but the bottom line is that the historical record is being digitized.

327

798000

3000

mas o principal é que o registro histórico está sendo digitalizado.

13:36

Google has started to digitize 15 million books.

328

801000

2000

Google começou a digitalizar 15 milhões de livros.

13:38

That's 12 percent of all the books that have ever been published.

329

803000

2000

É 12% de todos os livros já publicados.

13:40

It's a sizable chunk of human culture.

330

805000

3000

É um pedaço considerável da cultura humana.

13:43

There's much more in culture: there's manuscripts, there newspapers,

331

808000

3000

Há muito mais na cultura: existem manuscritos, jornais,

13:46

there's things that are not text, like art and paintings.

332

811000

2000

coisas que não são texto, como arte e pinturas.

13:48

These all happen to be on our computers,

333

813000

2000

Acontece que estes estão em nossos computadores,

13:50

on computers across the world.

334

815000

2000

em computadores ao redor do mundo.

13:52

And when that happens, that will transform the way we have

335

817000

3000

E quando isso acontece, vai transformar a maneira

13:55

to understand our past, our present and human culture.

336

820000

2000

de compreender nosso passado, o presente e a cultura humana.

13:57

Thank you very much.

337

822000

2000

Muito obrigado.

13:59

(Applause)

338

824000

3000

(Aplausos)

Translated by Lisangelo Berti
Reviewed by Wanderley Jesus

ABOUT THE SPEAKERS

Jean-Baptiste Michel - Data researcher
Jean-Baptiste Michel looks at how we can use large volumes of data to better understand our world.

Why you should listen

Jean-Baptiste Michel holds joint academic appointments at Harvard (FQEB Fellow) and Google (Visiting Faculty). His research focusses on using large volumes of data as tools that help better understand the world around us -- from the way diseases progress in patients over years, to the way cultures change in human societies over centuries. With his colleague Erez Lieberman Aiden, Jean-Baptiste is a Founding Director of Harvard's Cultural Observatory, where their research team pioneers the use of quantitative methods for the study of human culture, language and history. His research was featured on the covers of Science and Nature, on the front pages of the New York Times and the Boston Globe, in The Economist, Wired and many other venues. The online tool he helped create -- ngrams.googlelabs.com -- was used millions of times to browse cultural trends. Jean-Baptiste is an Engineer from Ecole Polytechnique (Paris), and holds an MS in Applied Mathematics and a PhD in Systems Biology from Harvard.

More profile about the speaker
Jean-Baptiste Michel | Speaker | TED.com

Erez Lieberman Aiden - Researcher
Erez Lieberman Aiden pursues a broad range of research interests, spanning genomics, linguistics, mathematics ...

Why you should listen

Erez Lieberman Aiden is a fellow at the Harvard Society of Fellows and Visiting Faculty at Google. His research spans many disciplines and has won numerous awards, including recognition for one of the top 20 "Biotech Breakthroughs that will Change Medicine", by Popular Mechanics; the Lemelson-MIT prize for the best student inventor at MIT; the American Physical Society's Award for the Best Doctoral Dissertation in Biological Physics; and membership in Technology Review's 2009 TR35, recognizing the top 35 innovators under 35. His last three papers -- two with JB Michel -- have all appeared on the cover of Nature and Science.

More profile about the speaker
Erez Lieberman Aiden | Speaker | TED.com

THE ORIGINAL VIDEO ON TED.COM

O que aprendemos de 5 milhões de livros | TED Talk | TED.com