ABOUT THE SPEAKERS

Jean-Baptiste Michel - Data researcher
Jean-Baptiste Michel looks at how we can use large volumes of data to better understand our world.

Why you should listen

Jean-Baptiste Michel holds joint academic appointments at Harvard (FQEB Fellow) and Google (Visiting Faculty). His research focusses on using large volumes of data as tools that help better understand the world around us -- from the way diseases progress in patients over years, to the way cultures change in human societies over centuries. With his colleague Erez Lieberman Aiden, Jean-Baptiste is a Founding Director of Harvard's Cultural Observatory, where their research team pioneers the use of quantitative methods for the study of human culture, language and history. His research was featured on the covers of Science and Nature, on the front pages of the New York Times and the Boston Globe, in The Economist, Wired and many other venues. The online tool he helped create -- ngrams.googlelabs.com -- was used millions of times to browse cultural trends. Jean-Baptiste is an Engineer from Ecole Polytechnique (Paris), and holds an MS in Applied Mathematics and a PhD in Systems Biology from Harvard.

More profile about the speaker
Jean-Baptiste Michel | Speaker | TED.com

Erez Lieberman Aiden - Researcher
Erez Lieberman Aiden pursues a broad range of research interests, spanning genomics, linguistics, mathematics ...

Why you should listen

Erez Lieberman Aiden is a fellow at the Harvard Society of Fellows and Visiting Faculty at Google. His research spans many disciplines and has won numerous awards, including recognition for one of the top 20 "Biotech Breakthroughs that will Change Medicine", by Popular Mechanics; the Lemelson-MIT prize for the best student inventor at MIT; the American Physical Society's Award for the Best Doctoral Dissertation in Biological Physics; and membership in Technology Review's 2009 TR35, recognizing the top 35 innovators under 35. His last three papers -- two with JB Michel -- have all appeared on the cover of Nature and Science.

More profile about the speaker
Erez Lieberman Aiden | Speaker | TED.com

TEDxBoston 2011

Jean-Baptiste Michel + Erez Lieberman Aiden: What we learned from 5 million books

我们从五百万本书里学到了什么

Filmed: 2011-07-24

Readability: 3.9

2,049,453 views

你用过谷歌实验室的Ngram Viewer吗？它是一个非常容易上瘾的书籍词频统计器，数据库里有几个世纪以来的五百万本书。Erez Lieberman Aiden和Jean-Baptiste Michel将像我们展示这个搜索工具该如何使用，以及这5000亿个词汇的奥秘。

Jean-Baptiste Michel - Data researcher
Jean-Baptiste Michel looks at how we can use large volumes of data to better understand our world. Full bioErez Lieberman Aiden - Researcher
Erez Lieberman Aiden pursues a broad range of research interests, spanning genomics, linguistics, mathematics ... Full bio

Double-click the English transcript below to play the video.

00:15

Erez埃雷兹 Lieberman利伯曼 Aiden艾登: Everyone大家 knows知道

0

0

2000

Erez Liberman Aiden：人说

00:17

that a picture图片 is worth价值 a thousand千 words话.

1

2000

3000

一副画面抵过一千个词

00:22

But we at Harvard哈佛

2

7000

2000

但是我们在哈佛大学

00:24

were wondering想知道 if this was really true真正.

3

9000

3000

却在思考这是不是一定正确

00:27

(Laughter笑声)

4

12000

2000

（众人笑）

00:29

So we assembled组装 a team球队 of experts专家,

5

14000

4000

我们召集了各方专家

00:33

spanning跨越 Harvard哈佛, MITMIT,

6

18000

2000

他们来自哈佛麻省理工

00:35

The American美国 Heritage遗产 Dictionary字典, The Encyclopedia百科全书 Britannica大英百科全书

7

20000

3000

《英国大百科全书》《美国传统英语字典》

00:38

and even our proud骄傲 sponsors赞助商,

8

23000

2000

还有我们骄傲的赞助商

00:40

the Google谷歌.

9

25000

3000

谷歌

00:43

And we cogitated沉思起来 about this

10

28000

2000

我们思考了

00:45

for about four四 years年份.

11

30000

2000

大概四年

00:47

And we came来了 to a startling触目惊心 conclusion结论.

12

32000

5000

最后得出一个惊人的结论

00:52

Ladies女士们 and gentlemen绅士, a picture图片 is not worth价值 a thousand千 words话.

13

37000

3000

女士们先生们一副画面可不止一千个词那么简单

00:55

In fact事实, we found发现 some pictures图片

14

40000

2000

事实上我们发现有时候

00:57

that are worth价值 500 billion十亿 words话.

15

42000

5000

一幅画面抵过5千亿个词

01:02

Jean-Baptiste让 - 巴蒂斯特 Michel米歇尔: So how did we get to this conclusion结论?

16

47000

2000

Jean-Baptiste Michel: 我们是如何得出这个结论的呢

01:04

So Erez埃雷兹 and I were thinking思维 about ways方法

17

49000

2000

是这样的 Erez和我

01:06

to get a big大 picture图片 of human人的 culture文化

18

51000

2000

在想怎样找到一幅展现人类文明

01:08

and human人的 history历史: change更改 over time.

19

53000

3000

和人文历史的画面：历史的变迁

01:11

So many许多 books图书 actually其实 have been written书面 over the years年份.

20

56000

2000

人们在漫长岁月中写了很多书

01:13

So we were thinking思维, well the best最好 way to learn学习 from them

21

58000

2000

所以我们想向他们学习的最佳方法

01:15

is to read读 all of these millions百万 of books图书.

22

60000

2000

就是把那几百万本书全都读完

01:17

Now of course课程, if there's a scale规模 for how awesome真棒 that is,

23

62000

3000

当然如果用坐标来表示这样做的好处

01:20

that has to rank秩 extremely非常, extremely非常 high高.

24

65000

3000

那Y轴上的值一定是极高的

01:23

Now the problem问题 is there's an X-axisX轴 for that,

25

68000

2000

但问题是还有X轴

01:25

which哪一个 is the practical实际的 axis轴.

26

70000

2000

也就是可行性

01:27

This is very, very low低.

27

72000

2000

这是极低的

01:29

(Applause掌声)

28

74000

3000

（众人鼓掌）

01:32

Now people tend趋向 to use an alternative替代 approach途径,

29

77000

3000

现在人们倾向于另一种做法

01:35

which哪一个 is to take a few少数 sources来源 and read读 them very carefully小心.

30

80000

2000

那就是选择几本书进行精读

01:37

This is extremely非常 practical实际的, but not so awesome真棒.

31

82000

2000

可行性极高但还不够好

01:39

What you really want to do

32

84000

3000

人们真正想要的

01:42

is to get to the awesome真棒 yet然而 practical实际的 part部分 of this space空间.

33

87000

3000

是一个既好又可行的方法

01:45

So it turns圈 out there was a company公司 across横过 the river河 called叫 Google谷歌

34

90000

3000

结果在水一方有一家叫“谷歌”的公司

01:48

who had started开始 a digitization数字化 project项目 a few少数 years年份 back

35

93000

2000

他们在此之前的几年前就开始了一个数字化工程

01:50

that might威力 just enable启用 this approach途径.

36

95000

2000

有可能帮我们找到这个“既好又可行”的方法

01:52

They have digitized数字化 millions百万 of books图书.

37

97000

2000

他们已经将几百万本书进行了数字化

01:54

So what that means手段 is, one could use computational计算 methods方法

38

99000

3000

这就意味着人们在电脑上点几个键

01:57

to read读 all of the books图书 in a click点击 of a button按键.

39

102000

2000

就能阅读所有的书

01:59

That's very practical实际的 and extremely非常 awesome真棒.

40

104000

3000

这真的是既可行又好

02:03

ELAELA: Let me tell you a little bit位 about where books图书 come from.

41

108000

2000

这些书是哪里来的呢

02:05

Since以来 time immemorial太古, there have been authors作者.

42

110000

3000

从古时候开始人们就开始写作了

02:08

These authors作者 have been striving努力 to write写 books图书.

43

113000

3000

这些作家写书都非常卖力

02:11

And this became成为 considerably相当 easier更轻松

44

116000

2000

几个世纪前印刷机问世了

02:13

with the development发展 of the printing印花 press按 some centuries百年 ago前.

45

118000

2000

写书的过程变得简单多了

02:15

Since以来 then, the authors作者 have won韩元

46

120000

3000

自那以后

02:18

on 129 million百万 distinct不同 occasions场合,

47

123000

2000

作家们已经出版了

02:20

publishing出版 books图书.

48

125000

2000

1.29亿本书

02:22

Now if those books图书 are not lost丢失 to history历史,

49

127000

2000

如果这些书没有随年月而遗失

02:24

then they are somewhere某处 in a library图书馆,

50

129000

2000

就都在图书馆里存着

02:26

and many许多 of those books图书 have been getting得到 retrieved检索 from the libraries图书馆

51

131000

3000

谷歌已经把许多书从图书馆中调了出来

02:29

and digitized数字化 by Google谷歌,

52

134000

2000

进行了数字化

02:31

which哪一个 has scanned扫描 15 million百万 books图书 to date日期.

53

136000

2000

被扫描的书籍到目前已有1500万册

02:33

Now when Google谷歌 digitizes数字化 a book书, they put it into a really nice不错 format格式.

54

138000

3000

谷歌扫描图书时把书的格式做得很好

02:36

Now we've我们已经 got the data数据, plus加 we have metadata元数据.

55

141000

2000

现在我们不但有了数据还有元数据

02:38

We have information信息 about things like where was it published发表,

56

143000

3000

我们掌握了这些书的出版地

02:41

who was the author作者, when was it published发表.

57

146000

2000

作者出版时间等信息

02:43

And what we do is go through通过 all of those records记录

58

148000

3000

接下来我们就要从所有这些记录中

02:46

and exclude排除 everything that's not the highest最高 quality质量 data数据.

59

151000

4000

筛选出质量最高的数据

02:50

What we're left with

60

155000

2000

最后剩下的

02:52

is a collection采集 of five五 million百万 books图书,

61

157000

3000

是5百万本书

02:55

500 billion十亿 words话,

62

160000

3000

5000亿个词

02:58

a string串 of characters人物 a thousand千 times时 longer长

63

163000

2000

这么多词连起来

03:00

than the human人的 genome基因组 --

64

165000

3000

长度是人类基因组的1000倍

03:03

a text文本 which哪一个, when written书面 out,

65

168000

2000

如果把这些词连续写出来

03:05

would stretch伸展 from here to the Moon月亮 and back

66

170000

2000

其长度相当于在地月之间

03:07

10 times时 over --

67

172000

2000

往返10次以上

03:09

a veritable名副其实 shard碎片 of our cultural文化 genome基因组.

68

174000

4000

这还仅是我们文化基因组的小小一段

03:13

Of course课程 what we did

69

178000

2000

当然啦

03:15

when faced面对 with such这样 outrageous蛮横的 hyperbole夸张 ...

70

180000

3000

面对如此令人崩溃的结果

03:18

(Laughter笑声)

71

183000

2000

（众人笑）

03:20

was what any self-respecting自我尊重 researchers研究人员

72

185000

3000

我们做了一个懂得自重的研究者

03:23

would have doneDONE.

73

188000

3000

应该做的事

03:26

We took拿 a page页 out of XKCDXKCD,

74

191000

2000

我们借鉴了XKCD（科学漫画）

03:28

and we said, "Stand站 back.

75

193000

2000

说：" 往后站。

03:30

We're going to try science科学."

76

195000

2000

我们要用科学来解决问题。”

03:32

(Laughter笑声)

77

197000

2000

（众人笑）

03:34

JMJM: Now of course课程, we were thinking思维,

78

199000

2000

当然这时我们在想

03:36

well let's just first put the data数据 out there

79

201000

2000

何不先把数据放上去

03:38

for people to do science科学 to it.

80

203000

2000

让人们通过科学来运用数据

03:40

Now we're thinking思维, what data数据 can we release发布?

81

205000

2000

现在我们在思考哪些数据可以公开

03:42

Well of course课程, you want to take the books图书

82

207000

2000

你当然想把这所有5百万本书

03:44

and release发布 the full充分 text文本 of these five五 million百万 books图书.

83

209000

2000

全文公开

03:46

Now Google谷歌, and Jon乔恩 OrwantOrwant in particular特定,

84

211000

2000

现在谷歌具体地说是乔恩. 奥温特

03:48

told us a little equation方程 that we should learn学习.

85

213000

2000

告诉教给我们一个有用的方程式

03:50

So you have five五 million百万, that is, five五 million百万 authors作者

86

215000

3000

你有5百万本书那就有五百万个作者

03:53

and five五 million百万 plaintiffs原告 is a massive大规模的 lawsuit诉讼.

87

218000

3000

一个有5百万个原告的官司可不小啊

03:56

So, although虽然 that would be really, really awesome真棒,

88

221000

2000

所以尽管这是个好想法

03:58

again, that's extremely非常, extremely非常 impractical不切实际的.

89

223000

3000

但是也极不现实

04:01

(Laughter笑声)

90

226000

2000

（众人笑）

04:03

Now again, we kind类 of caved下陷 in,

91

228000

2000

现在我们做出些许让步

04:05

and we did the very practical实际的 approach途径, which哪一个 was a bit位 less减 awesome真棒.

92

230000

3000

采用一个非常可行但稍微没那么好的方法

04:08

We said, well instead代替 of releasing释放 the full充分 text文本,

93

233000

2000

我们不公开全书内容

04:10

we're going to release发布 statistics统计 about the books图书.

94

235000

2000

而是公开书本的相关统计数据

04:12

So take for instance例 "A gleam闪光 of happiness幸福."

95

237000

2000

拿“A gleam of happiness”这个词组做例子

04:14

It's four四 words话; we call that a four-gram四克.

96

239000

2000

它有四个单词我们称它为四字格

04:16

We're going to tell you how many许多 times时 a particular特定 four-gram四克

97

241000

2000

我们会告诉你直到2008年出版的书中

04:18

appeared出现 in books图书 in 1801, 1802, 1803,

98

243000

2000

在1801年 1802年 1803年一直到2008年

04:20

all the way up to 2008.

99

245000

2000

某个四字格一共出现了多少次

04:22

That gives给 us a time series系列

100

247000

2000

这让我们看到

04:24

of how frequently经常 this particular特定 sentence句子 was used over time.

101

249000

2000

这个词组在这段时期内被使用的频率

04:26

We do that for all the words话 and phrases短语 that appear出现 in those books图书,

102

251000

3000

我们对在这些书中的所有单词和词组都这么处理

04:29

and that gives给 us a big大 table表 of two billion十亿 lines线

103

254000

3000

于是我们得出了一个由20亿曲线

04:32

that tell us about the way culture文化 has been changing改变.

104

257000

2000

表示出文化变化的情况

04:34

ELAELA: So those two billion十亿 lines线,

105

259000

2000

这20亿条曲线

04:36

we call them two billion十亿 n-grams正克.

106

261000

2000

我们成作20亿个n字格

04:38

What do they tell us?

107

263000

2000

它们告诉了我们什么

04:40

Well the individual个人 n-grams正克 measure测量 cultural文化 trends趋势.

108

265000

2000

这些n字格衡量的是文化的走势

04:42

Let me give you an example例.

109

267000

2000

我来举个例子

04:44

Let's suppose假设 that I am thriving熙,

110

269000

2000

假设我正在发财

04:46

then tomorrow明天 I want to tell you about how well I did.

111

271000

2000

明天我告诉你我发财的情况

04:48

And so I might威力 say, "Yesterday昨天, I throve兴盛起来."

112

273000

3000

我会说：“昨天，我发了。”

04:51

Alternatively另外, I could say, "Yesterday昨天, I thrived蓬勃发展."

113

276000

3000

也可以说：“昨天，我发财了。”

04:54

Well which哪一个 one should I use?

114

279000

3000

我到底应该用哪个说法呢

04:57

How to know?

115

282000

2000

怎么找答案

04:59

As of about six六 months个月 ago前,

116

284000

2000

6个月以前

05:01

the state州 of the art艺术 in this field领域

117

286000

2000

很流行的做法是

05:03

is that you would, for instance例,

118

288000

2000

比如说

05:05

go up to the following以下 psychologist心理学家 with fabulous极好 hair头发,

119

290000

2000

你去问这位秀发飘逸的心理学家

05:07

and you'd你 say,

120

292000

2000

你说

05:09

"Steve史蒂夫, you're an expert专家 on the irregular不规则 verbs动词.

121

294000

3000

“史蒂夫，你是不规则动词的专家。

05:12

What should I do?"

122

297000

2000

我该怎么办啊？”

05:14

And he'd他会 tell you, "Well most最 people say thrived蓬勃发展,

123

299000

2000

他会说：“大多数人说‘发财了’，

05:16

but some people say throve兴盛起来."

124

301000

3000

但有些人说‘发了’。”

05:19

And you also也 knew知道, more or less减,

125

304000

2000

如果你可以

05:21

that if you were to go back in time 200 years年份

126

306000

3000

回到200年前

05:24

and ask问 the following以下 statesman政治家 with equally一样 fabulous极好 hair头发,

127

309000

3000

问问这位秀发同样飘逸的政治家

05:27

(Laughter笑声)

128

312000

3000

（众人笑）

05:30

"Tom汤姆, what should I say?"

129

315000

2000

“托马斯，我该怎么说？”

05:32

He'd他会 say, "Well, in my day, most最 people throve兴盛起来,

130

317000

2000

他会回答：“嗯，在我的时代，大多数人说‘发了’，

05:34

but some thrived蓬勃发展."

131

319000

3000

但是少数人说‘发财了’。”

05:37

So now what I'm just going to show显示 you is raw生的 data数据.

132

322000

2000

现在我给你们看一个原始数据

05:39

Two rows行 from this table表 of two billion十亿 entries项.

133

324000

4000

这是20亿本书中的其中两本书的曲线

05:43

What you're seeing眼看 is year年 by year年 frequency频率

134

328000

2000

你们将看到“发了”和“发财了”这两个词

05:45

of "thrived蓬勃发展" and "throve兴盛起来" over time.

135

330000

3000

随时间的推移被使用的频率

05:49

Now this is just two

136

334000

2000

这还只是

05:51

out of two billion十亿 rows行.

137

336000

3000

20亿条曲线中的其中两条

05:54

So the entire整个 data数据 set组

138

339000

2000

整套数据

05:56

is a billion十亿 times时 more awesome真棒 than this slide滑动.

139

341000

3000

比这张幻灯片要宏伟10亿倍

05:59

(Laughter笑声)

140

344000

2000

（众人笑）

06:01

(Applause掌声)

141

346000

4000

（众人鼓掌）

06:05

JMJM: Now there are many许多 other pictures图片 that are worth价值 500 billion十亿 words话.

142

350000

2000

很多画面都相当于5千亿个词

06:07

For instance例, this one.

143

352000

2000

比如这一幅

06:09

If you just take influenza流感,

144

354000

2000

如果你找“流行感冒”这一词

06:11

you will see peaks峰 at the time where you knew知道

145

356000

2000

你会看到几个全球范围内

06:13

big大 flu流感 epidemics流行病 were killing谋杀 people around the globe地球.

146

358000

3000

祸害人命的流感高峰

06:16

ELAELA: If you were not yet然而 convinced相信,

147

361000

3000

如果这不足以令人信服

06:19

sea海 levels水平 are rising升起,

148

364000

2000

海平面正在上升

06:21

so is atmospheric大气的 COCO2 and global全球 temperature温度.

149

366000

3000

大气中二氧化碳含量和全球气温都在升高

06:24

JMJM: You might威力 also也 want to have a look at this particular特定 n-gram正克,

150

369000

3000

你们也可以看看这个n字格

06:27

and that's to tell Nietzsche尼采 that God is not dead死,

151

372000

3000

告诉尼采上帝没死

06:30

although虽然 you might威力 agree同意 that he might威力 need a better publicist公关.

152

375000

3000

你可能也认为他或许要换一个企宣了

06:33

(Laughter笑声)

153

378000

2000

（众人笑）

06:35

ELAELA: You can get at some pretty漂亮 abstract抽象 concepts概念 with this sort分类 of thing.

154

380000

3000

你可以通过这个得到非常抽象的概念

06:38

For instance例, let me tell you the history历史

155

383000

2000

我跟你们说说

06:40

of the year年 1950.

156

385000

2000

1950年的历史

06:42

Pretty漂亮 much for the vast广大 majority多数 of history历史,

157

387000

2000

在漫漫历史长河中

06:44

no one gave给 a damn该死的 about 1950.

158

389000

2000

几乎没人在意1950年

06:46

In 1700, in 1800, in 1900,

159

391000

2000

1700年 1800年 1900年

06:48

no one cared照顾.

160

393000

3000

没有人在意

06:52

Through通过 the 30s and 40s,

161

397000

2000

20世纪三十年代和四十年代

06:54

no one cared照顾.

162

399000

2000

没有人在意

06:56

Suddenly突然, in the mid-中-40s,

163

401000

2000

到了四十年代中期突然间

06:58

there started开始 to be a buzz蜂鸣器.

164

403000

2000

关注度飞升

07:00

People realized实现 that 1950 was going to happen发生,

165

405000

2000

人们意识到1950年快来了

07:02

and it could be big大.

166

407000

2000

这一年可能非同小可啊

07:04

(Laughter笑声)

167

409000

3000

（众人笑）

07:07

But nothing got people interested有兴趣 in 1950

168

412000

3000

1950年正如人们想象的一样

07:10

like the year年 1950.

169

415000

3000

没发生任何有意思的事情

07:13

(Laughter笑声)

170

418000

3000

（众人笑）

07:16

People were walking步行 around obsessed痴迷.

171

421000

2000

人们都着了魔了

07:18

They couldn't不能 stop talking说

172

423000

2000

无时无刻不在谈论

07:20

about all the things they did in 1950,

173

425000

3000

他们1950年做过的事情

07:23

all the things they were planning规划 to do in 1950,

174

428000

3000

他们打算在1950年做的事情

07:26

all the dreams梦 of what they wanted to accomplish完成 in 1950.

175

431000

5000

后者他们1950年想要实现的梦想

07:31

In fact事实, 1950 was so fascinating迷人

176

436000

2000

事实上 1950年是不同凡响的一年

07:33

that for years年份 thereafter其后,

177

438000

2000

即使过了好多年

07:35

people just kept不停 talking说 about all the amazing惊人 things that happened发生,

178

440000

3000

人们还是不停地谈论那年发生的所有美好事情

07:38

in '51, '52, '53.

179

443000

2000

51年 52年 53年

07:40

Finally最后 in 1954,

180

445000

2000

终于到了1954年

07:42

someone有人 woke醒来 up and realized实现

181

447000

2000

人们醒悟过来

07:44

that 1950 had gotten得到 somewhat有些 pass通过é.

182

449000

4000

1950年已成往事了

07:48

(Laughter笑声)

183

453000

2000

（众人笑）

07:50

And just like that, the bubble泡沫 burst爆裂.

184

455000

2000

就这样泡泡破了

07:52

(Laughter笑声)

185

457000

2000

（众人笑）

07:54

And the story故事 of 1950

186

459000

2000

1950年的情况

07:56

is the story故事 of every一切 year年 that we have on record记录,

187

461000

2000

以及每一年的情况我们都记录了下来

07:58

with a little twist捻, because now we've我们已经 got these nice不错 charts图表.

188

463000

3000

多亏了这些漂亮的图表我们的工作顺利多了

08:01

And because we have these nice不错 charts图表, we can measure测量 things.

189

466000

3000

有了这些漂亮的图表我们就能测量各种事物

08:04

We can say, "Well how fast快速 does the bubble泡沫 burst爆裂?"

190

469000

2000

我们会说：“泡泡破掉的速度有多快？”

08:06

And it turns圈 out that we can measure测量 that very precisely恰恰.

191

471000

3000

结果证明我们可以对此进行精准的测量

08:09

Equations方程 were derived派生, graphs图 were produced生成,

192

474000

3000

等式出来了图表也做好了

08:12

and the net净 result结果

193

477000

2000

最终结果是

08:14

is that we find that the bubble泡沫 bursts连发 faster更快 and faster更快

194

479000

3000

泡泡破掉的速度

08:17

with each每 passing通过 year年.

195

482000

2000

每年都在加快

08:19

We are losing失去 interest利益 in the past过去 more rapidly急速.

196

484000

5000

我们对过去的遗忘不断加快

08:24

JMJM: Now a little piece片 of career事业 advice忠告.

197

489000

2000

好现在给大家一些发展事业的建议

08:26

So for those of you who seek寻求 to be famous著名,

198

491000

2000

如果你想成名

08:28

we can learn学习 from the 25 most最 famous著名 political政治 figures人物,

199

493000

2000

我们可以向25位最著名的政治人物

08:30

authors作者, actors演员 and so on.

200

495000

2000

作家演员学习

08:32

So if you want to become成为 famous著名 early早 on, you should be an actor演员,

201

497000

3000

如果你想早点成名你就应该做个演员

08:35

because then fame名誉 starts启动 rising升起 by the end结束 of your 20s --

202

500000

2000

因为演员在20来岁的时候成名

08:37

you're still young年轻, it's really great.

203

502000

2000

你还很年轻这是本钱

08:39

Now if you can wait a little bit位, you should be an author作者,

204

504000

2000

如果你能等一等那就当个作家

08:41

because then you rise上升 to very great heights高度,

205

506000

2000

因为你可以像马克.吐温这样

08:43

like Mark标记 Twain吐温, for instance例: extremely非常 famous著名.

206

508000

2000

成为文坛巨星

08:45

But if you want to reach达到 the very top最佳,

207

510000

2000

如果你想到达万人之上

08:47

you should delay延迟 gratification享乐

208

512000

2000

你就不能安于现状

08:49

and, of course课程, become成为 a politician政治家.

209

514000

2000

要成为一个政治家

08:51

So here you will become成为 famous著名 by the end结束 of your 50s,

210

516000

2000

到了快60岁的时候你就成名了

08:53

and become成为 very, very famous著名 afterward之后.

211

518000

2000

而且之后名声远扬

08:55

So scientists科学家们 also也 tend趋向 to get famous著名 when they're much older旧的.

212

520000

3000

科学家通常在年纪一大把的时候才成名

08:58

Like for instance例, biologists生物学家 and physics物理

213

523000

2000

生物学家和物理学家的名声

09:00

tend趋向 to be almost几乎 as famous著名 as actors演员.

214

525000

2000

通常能跟演员的名声媲美

09:02

One mistake错误 you should not do is become成为 a mathematician数学家.

215

527000

3000

有一个错误你不要犯那就是成为一个数学家

09:05

(Laughter笑声)

216

530000

2000

（众人笑）

09:07

If you do that,

217

532000

2000

如果你成了数学家

09:09

you might威力 think, "Oh great. I'm going to do my best最好 work when I'm in my 20s."

218

534000

3000

你会想：“太好啦，我20多岁的时候会有最辉煌的成就。”

09:12

But guess猜测 what, nobody没有人 will really care关心.

219

537000

2000

谁知道人们连睬都不睬你

09:14

(Laughter笑声)

220

539000

3000

（众人笑）

09:17

ELAELA: There are more sobering发人深省 notes笔记

221

542000

2000

n字格中

09:19

among其中 the n-grams正克.

222

544000

2000

有些情况更为明了

09:21

For instance例, here's这里的 the trajectory弹道 of Marc渣子 Chagall夏加尔,

223

546000

2000

这是Marc Chagall的名声起落

09:23

an artist艺术家 born天生 in 1887.

224

548000

2000

他是出生于1887的一位艺术家

09:25

And this looks容貌 like the normal正常 trajectory弹道 of a famous著名 person人.

225

550000

3000

他的名声起落看似乎没有什么异常

09:28

He gets得到 more and more and more famous著名,

226

553000

4000

他的名声越来越大

09:32

except除 if you look in German德语.

227

557000

2000

然而如果你在德语书中搜索情况就不同了

09:34

If you look in German德语, you see something completely全然 bizarre奇异的,

228

559000

2000

在德语书中你会看到非常奇怪的现象

09:36

something you pretty漂亮 much never see,

229

561000

2000

闻所未闻见所未见

09:38

which哪一个 is he becomes变 extremely非常 famous著名

230

563000

2000

他先是名极一时

09:40

and then all of a sudden突然 plummets骤降,

231

565000

2000

但突然之间名声直线下落

09:42

going through通过 a nadir最低点 between之间 1933 and 1945,

232

567000

3000

在1933年到1945年间达到了低谷

09:45

before rebounding反弹 afterward之后.

233

570000

3000

后来才回升

09:48

And of course课程, what we're seeing眼看

234

573000

2000

当然实际情况是

09:50

is the fact事实 Marc渣子 Chagall夏加尔 was a Jewish犹太 artist艺术家

235

575000

3000

Marc Chagall是一个犹太艺术家

09:53

in Nazi纳粹 Germany德国.

236

578000

2000

当时身在纳粹德国

09:55

Now these signals信号

237

580000

2000

这些信号

09:57

are actually其实 so strong强大

238

582000

2000

实在太强了

09:59

that we don't need to know that someone有人 was censored审查.

239

584000

3000

我们无需知道谁被禁了

10:02

We can actually其实 figure数字 it out

240

587000

2000

我们事实上可以

10:04

using运用 really basic基本 signal信号 processing处理.

241

589000

2000

通过非常基本的信号处理来找出答案

10:06

Here's这里的 a simple简单 way to do it.

242

591000

2000

这里有一个简单的方法

10:08

Well, a reasonable合理 expectation期望

243

593000

2000

一个人在特定时期内

10:10

is that somebody's某人的 fame名誉 in a given特定 period期 of time

244

595000

2000

所拥有的知名度

10:12

should be roughly大致 the average平均 of their其 fame名誉 before

245

597000

2000

应当大致为他成名前与成名后知名度的平均值

10:14

and their其 fame名誉 after.

246

599000

2000

这么想是有道理的

10:16

So that's sort分类 of what we expect期望.

247

601000

2000

我们也是怎么想的

10:18

And we compare比较 that to the fame名誉 that we observe守.

248

603000

3000

我们把观察到的知名度进行对比

10:21

And we just divide划分 one by the other

249

606000

2000

我们把前者比上后者

10:23

to produce生产 something we call a suppression抑制 index指数.

250

608000

2000

产生的结果叫做抑制指数

10:25

If the suppression抑制 index指数 is very, very, very small小,

251

610000

3000

如果抑制指数非常非常小

10:28

then you very well might威力 be being存在 suppressed抑制.

252

613000

2000

那么你的知名度正在被抑制

10:30

If it's very large大, maybe you're benefiting受益 from propaganda宣传.

253

615000

3000

如果数值非常大或许就表明你从宣传中获益

10:34

JMJM: Now you can actually其实 look at

254

619000

2000

你还可以看到

10:36

the distribution分配 of suppression抑制 indexes索引 over whole整个 populations人群.

255

621000

3000

压抑指数在总人数中的分布情况

10:39

So for instance例, here --

256

624000

2000

这里有个例子

10:41

this suppression抑制 index指数 is for 5,000 people

257

626000

2000

这是从没有明显抑制的英文书籍中

10:43

picked采摘的 in English英语 books图书 where there's no known已知 suppression抑制 --

258

628000

2000

选出的5000个人

10:45

it would be like this, basically基本上 tightly紧紧 centered中心 on one.

259

630000

2000

它是这个样子的基本上以1为中心

10:47

What you expect期望 is basically基本上 what you observe守.

260

632000

2000

实际情况与预想差不多

10:49

This is distribution分配 as seen看到 in Germany德国 --

261

634000

2000

而这在是德文书籍中的分布情况

10:51

very different不同, it's shifted移 to the left.

262

636000

2000

与前者大为不同往左偏了

10:53

People talked谈 about it twice两次 less减 as it should have been.

263

638000

3000

人们对它的关注较预期要少了两倍

10:56

But much more importantly重要的, the distribution分配 is much wider更宽的.

264

641000

2000

更重要的是这个分布的跨度更宽

10:58

There are many许多 people who end结束 up on the far远 left on this distribution分配

265

643000

3000

不少人处于左边的部分

11:01

who are talked谈 about 10 times时 fewer少 than they should have been.

266

646000

3000

人数比预期中少了10倍

11:04

But then also也 many许多 people on the far远 right

267

649000

2000

而也有不少人处于更靠右的部分

11:06

who seem似乎 to benefit效益 from propaganda宣传.

268

651000

2000

他们的宣传起了作用

11:08

This picture图片 is the hallmark特点 of censorship审查 in the book书 record记录.

269

653000

3000

这幅图反映了书籍记录中的审查情况

11:11

ELAELA: So culturomicsculturomics

270

656000

2000

我们把这种方法

11:13

is what we call this method方法.

271

658000

2000

称作文化组学

11:15

It's kind类 of like genomics基因组学.

272

660000

2000

有点像基因组学

11:17

Except除 genomics基因组学 is a lens镜片 on biology生物学

273

662000

2000

只不过基因组学是生物学上

11:19

through通过 the window窗口 of the sequence序列 of bases基地 in the human人的 genome基因组.

274

664000

3000

观察人类基因组序列的透镜

11:22

CulturomicsCulturomics is similar类似.

275

667000

2000

文化组学很类似

11:24

It's the application应用 of massive-scale巨大的规模 data数据 collection采集 analysis分析

276

669000

3000

它指的是对人类文明研究的

11:27

to the study研究 of human人的 culture文化.

277

672000

2000

大规模数据收集分析的应用

11:29

Here, instead代替 of through通过 the lens镜片 of a genome基因组,

278

674000

2000

它使用的不是基因组这个透镜

11:31

through通过 the lens镜片 of digitized数字化 pieces件 of the historical历史的 record记录.

279

676000

3000

而是用数字化的历史记录片段作为透镜

11:34

The great thing about culturomicsculturomics

280

679000

2000

文化组学的优点是

11:36

is that everyone大家 can do it.

281

681000

2000

人人都会用它

11:38

Why can everyone大家 do it?

282

683000

2000

为什么呢

11:40

Everyone大家 can do it because three三 guys,

283

685000

2000

这是因为这三个人

11:42

Jon乔恩 OrwantOrwant, Matt马特 Gray灰色 and Will Brockman布罗克曼 over at Google谷歌,

284

687000

3000

谷歌的乔恩.奥温特迈特.格雷和威尔.布洛克曼

11:45

saw the prototype原型 of the NgramNGRAM Viewer查看器,

285

690000

2000

看到了n字格后

11:47

and they said, "This is so fun开玩笑.

286

692000

2000

说：“这太有意思了，

11:49

We have to make this available可得到 for people."

287

694000

3000

我们得让所有人都用上它。”

11:52

So in two weeks周 flat平面 -- the two weeks周 before our paper纸 came来了 out --

288

697000

2000

于是在我们的论文发表之前的整整两个星期中

11:54

they coded编码 up a version版 of the NgramNGRAM Viewer查看器 for the general一般 public上市.

289

699000

3000

他们编了一个面向公众的Ngram Viewer版本

11:57

And so you too can type类型 in any word字 or phrase短语 that you're interested有兴趣 in

290

702000

3000

现在你们也可以输入任何你感兴趣的单词或词组

12:00

and see its n-gram正克 immediately立即 --

291

705000

2000

查看它的n字格

12:02

also也 browse浏览 examples例子 of all the various各个 books图书

292

707000

2000

并阅览所有书籍中

12:04

in which哪一个 your n-gram正克 appears出现.

293

709000

2000

出现n字格的例句

12:06

JMJM: Now this was used over a million百万 times时 on the first day,

294

711000

2000

这个词在第一天就被使用了超过一百万次

12:08

and this is really the best最好 of all the queries查询.

295

713000

2000

这真的是最棒的一个搜索词

12:10

So people want to be their其 best最好, put their其 best最好 foot脚丫子 forward前锋.

296

715000

3000

人们总想做到最好总想展示最好的一面

12:13

But it turns圈 out in the 18th日 century世纪, people didn't really care关心 about that at all.

297

718000

3000

但是在18世纪人们对此并不在乎

12:16

They didn't want to be their其 best最好, they wanted to be their其 beftbeft.

298

721000

3000

他们不想做到最好（“best”）而是“beft”

12:19

So what happened发生 is, of course课程, this is just a mistake错误.

299

724000

3000

实际上这是个错别字

12:22

It's not that strove争取 for mediocrity庸人,

300

727000

2000

这并不是因为人们不识字

12:24

it's just that the S used to be written书面 differently不同, kind类 of like an F.

301

729000

3000

而是因为当时英文字母S的写法跟现在不同看起来像F

12:27

Now of course课程, Google谷歌 didn't pick挑 this up at the time,

302

732000

3000

当然谷歌没有意识到这一点

12:30

so we reported报道 this in the science科学 article文章 that we wrote写.

303

735000

3000

于是我们对此在论文中做了报告

12:33

But it turns圈 out this is just a reminder提醒

304

738000

2000

这实际上只是一个小提示

12:35

that, although虽然 this is a lot of fun开玩笑,

305

740000

2000

尽管这很有趣

12:37

when you interpret译 these graphs图, you have to be very careful小心,

306

742000

2000

但是你在解读这些图表时仍须非常谨慎

12:39

and you have to adopt采用 the base基础 standards标准 in the sciences科学.

307

744000

3000

你必须遵循基本的科学准则

12:42

ELAELA: People have been using运用 this for all kinds种 of fun开玩笑 purposes目的.

308

747000

3000

人们使用它来寻求各种乐趣

12:45

(Laughter笑声)

309

750000

7000

（众人笑）

12:52

Actually其实, we're not going to have to talk,

310

757000

2000

我们不打算多说

12:54

we're just going to show显示 you all the slides幻灯片 and remain留 silent无声.

311

759000

3000

光给你们看这些幻灯片

12:57

This person人 was interested有兴趣 in the history历史 of frustration挫折.

312

762000

3000

这个用户对人们烦躁的历史很感兴趣

13:00

There's various各个 types类型 of frustration挫折.

313

765000

3000

这里有不同类型的烦躁

13:03

If you stub存根 your toe脚趾, that's a one A "argh哎呀."

314

768000

3000

如果你的脚趾被碰了你会说“啊” （“argh”）

13:06

If the planet行星 Earth地球 is annihilated全军覆没 by the Vogons沃贡

315

771000

2000

如果地球被外星人毁灭了

13:08

to make room房间 for an interstellar星际 bypass旁路,

316

773000

2000

开了一条星际航道

13:10

that's an eight八 A "aaaaaaaarghaaaaaaaargh."

317

775000

2000

那就是“啊啊啊啊啊啊啊啊” （"aaaaaaaargh"）

13:12

This person人 studies学习 all the "arghsarghs,"

318

777000

2000

这个人研究了不同长短的“啊” （“argh”）

13:14

from one through通过 eight八 A's如.

319

779000

2000

从1个啊到8个啊

13:16

And it turns圈 out

320

781000

2000

结果

13:18

that the less-frequent不太频繁 "arghsarghs"

321

783000

2000

那些使用频率较低的啊

13:20

are, of course课程, the ones那些 that correspond对应 to things that are more frustrating泄气 --

322

785000

3000

代表程度更高的烦躁

13:23

except除, oddly奇怪, in the early早 80s.

323

788000

3000

八十年代是个例外

13:26

We think that might威力 have something to do with Reagan里根.

324

791000

2000

我们猜这可能跟里根总统有关

13:28

(Laughter笑声)

325

793000

2000

（众人笑）

13:30

JMJM: There are many许多 usages用法 of this data数据,

326

795000

3000

这个数据库的用处很多

13:33

but the bottom底部 line线 is that the historical历史的 record记录 is being存在 digitized数字化.

327

798000

3000

但最重要的是这是一个数字化的历史记录

13:36

Google谷歌 has started开始 to digitize数字化 15 million百万 books图书.

328

801000

2000

谷歌已经开始对1500万本书进行数字化处理

13:38

That's 12 percent百分 of all the books图书 that have ever been published发表.

329

803000

2000

其中12%的书已被出版

13:40

It's a sizable可观 chunk块 of human人的 culture文化.

330

805000

3000

这是人类文明相当大的一部分

13:43

There's much more in culture文化: there's manuscripts手稿, there newspapers报纸,

331

808000

3000

而文明还包括更多的内容有手稿报纸

13:46

there's things that are not text文本, like art艺术 and paintings绘画.

332

811000

2000

非文字的内容例如艺术与绘画

13:48

These all happen发生 to be on our computers电脑,

333

813000

2000

这些内容都会出现在我们的电脑上

13:50

on computers电脑 across横过 the world世界.

334

815000

2000

在世界各地的电脑上

13:52

And when that happens发生, that will transform转变 the way we have

335

817000

3000

如果这成真了

13:55

to understand理解 our past过去, our present当下 and human人的 culture文化.

336

820000

2000

我们对过去现在以及人类文明的认识就被改变了

13:57

Thank you very much.

337

822000

2000

非常感谢大家

13:59

(Applause掌声)

338

824000

3000

（众人鼓掌）

Translated by Lili Liang
Reviewed by dahong zhang

ABOUT THE SPEAKERS

Jean-Baptiste Michel - Data researcher
Jean-Baptiste Michel looks at how we can use large volumes of data to better understand our world.

Why you should listen

Jean-Baptiste Michel holds joint academic appointments at Harvard (FQEB Fellow) and Google (Visiting Faculty). His research focusses on using large volumes of data as tools that help better understand the world around us -- from the way diseases progress in patients over years, to the way cultures change in human societies over centuries. With his colleague Erez Lieberman Aiden, Jean-Baptiste is a Founding Director of Harvard's Cultural Observatory, where their research team pioneers the use of quantitative methods for the study of human culture, language and history. His research was featured on the covers of Science and Nature, on the front pages of the New York Times and the Boston Globe, in The Economist, Wired and many other venues. The online tool he helped create -- ngrams.googlelabs.com -- was used millions of times to browse cultural trends. Jean-Baptiste is an Engineer from Ecole Polytechnique (Paris), and holds an MS in Applied Mathematics and a PhD in Systems Biology from Harvard.

More profile about the speaker
Jean-Baptiste Michel | Speaker | TED.com

Erez Lieberman Aiden - Researcher
Erez Lieberman Aiden pursues a broad range of research interests, spanning genomics, linguistics, mathematics ...

Why you should listen

Erez Lieberman Aiden is a fellow at the Harvard Society of Fellows and Visiting Faculty at Google. His research spans many disciplines and has won numerous awards, including recognition for one of the top 20 "Biotech Breakthroughs that will Change Medicine", by Popular Mechanics; the Lemelson-MIT prize for the best student inventor at MIT; the American Physical Society's Award for the Best Doctoral Dissertation in Biological Physics; and membership in Technology Review's 2009 TR35, recognizing the top 35 innovators under 35. His last three papers -- two with JB Michel -- have all appeared on the cover of Nature and Science.

More profile about the speaker
Erez Lieberman Aiden | Speaker | TED.com

THE ORIGINAL VIDEO ON TED.COM

我们从五百万本书里学到了什么 | TED Talk | TED.com