ABOUT THE SPEAKERS
Jean-Baptiste Michel - Data researcher
Jean-Baptiste Michel looks at how we can use large volumes of data to better understand our world.

Why you should listen

Jean-Baptiste Michel holds joint academic appointments at Harvard (FQEB Fellow) and Google (Visiting Faculty). His research focusses on using large volumes of data as tools that help better understand the world around us -- from the way diseases progress in patients over years, to the way cultures change in human societies over centuries. With his colleague Erez Lieberman Aiden, Jean-Baptiste is a Founding Director of Harvard's Cultural Observatory, where their research team pioneers the use of quantitative methods for the study of human culture, language and history. His research was featured on the covers of Science and Nature, on the front pages of the New York Times and the Boston Globe, in The Economist, Wired and many other venues. The online tool he helped create -- ngrams.googlelabs.com -- was used millions of times to browse cultural trends. Jean-Baptiste is an Engineer from Ecole Polytechnique (Paris), and holds an MS in Applied Mathematics and a PhD in Systems Biology from Harvard.

More profile about the speaker
Jean-Baptiste Michel | Speaker | TED.com
Erez Lieberman Aiden - Researcher
Erez Lieberman Aiden pursues a broad range of research interests, spanning genomics, linguistics, mathematics ...

Why you should listen

Erez Lieberman Aiden is a fellow at the Harvard Society of Fellows and Visiting Faculty at Google. His research spans many disciplines and has won numerous awards, including recognition for one of the top 20 "Biotech Breakthroughs that will Change Medicine", by Popular Mechanics; the Lemelson-MIT prize for the best student inventor at MIT; the American Physical Society's Award for the Best Doctoral Dissertation in Biological Physics; and membership in Technology Review's 2009 TR35, recognizing the top 35 innovators under 35. His last three papers -- two with JB Michel -- have all appeared on the cover of Nature and Science.

More profile about the speaker
Erez Lieberman Aiden | Speaker | TED.com
TEDxBoston 2011

Jean-Baptiste Michel + Erez Lieberman Aiden: What we learned from 5 million books

我们从五百万本书里学到了什么

Filmed:
2,049,453 views

你用过谷歌实验室的Ngram Viewer吗?它是一个非常容易上瘾的书籍词频统计器,数据库里有几个世纪以来的五百万本书。Erez Lieberman Aiden和Jean-Baptiste Michel将像我们展示这个搜索工具该如何使用,以及这5000亿个词汇的奥秘。
- Data researcher
Jean-Baptiste Michel looks at how we can use large volumes of data to better understand our world. Full bio - Researcher
Erez Lieberman Aiden pursues a broad range of research interests, spanning genomics, linguistics, mathematics ... Full bio

Double-click the English transcript below to play the video.

00:15
Erez埃雷兹 Lieberman利伯曼 Aiden艾登: Everyone大家 knows知道
0
0
2000
Erez Liberman Aiden:人说
00:17
that a picture图片 is worth价值 a thousand words.
1
2000
3000
一副画面抵过一千个词
00:22
But we at Harvard哈佛
2
7000
2000
但是我们在哈佛大学
00:24
were wondering想知道 if this was really true真正.
3
9000
3000
却在思考这是不是一定正确
00:27
(Laughter笑声)
4
12000
2000
(众人笑)
00:29
So we assembled组装 a team球队 of experts专家,
5
14000
4000
我们召集了各方专家
00:33
spanning跨越 Harvard哈佛, MITMIT,
6
18000
2000
他们来自哈佛 麻省理工
00:35
The American美国 Heritage遗产 Dictionary字典, The Encyclopedia百科全书 Britannica大英百科全书
7
20000
3000
《英国大百科全书》 《美国传统英语字典》
00:38
and even our proud骄傲 sponsors赞助商,
8
23000
2000
还有我们骄傲的赞助商
00:40
the Google谷歌.
9
25000
3000
谷歌
00:43
And we cogitated沉思起来 about this
10
28000
2000
我们思考了
00:45
for about four years年份.
11
30000
2000
大概四年
00:47
And we came来了 to a startling触目惊心 conclusion结论.
12
32000
5000
最后得出一个惊人的结论
00:52
Ladies女士们 and gentlemen绅士, a picture图片 is not worth价值 a thousand words.
13
37000
3000
女士们先生们 一副画面可不止一千个词那么简单
00:55
In fact事实, we found发现 some pictures图片
14
40000
2000
事实上 我们发现有时候
00:57
that are worth价值 500 billion十亿 words.
15
42000
5000
一幅画面抵过5千亿个词
01:02
Jean-Baptiste让 - 巴蒂斯特 Michel米歇尔: So how did we get to this conclusion结论?
16
47000
2000
Jean-Baptiste Michel: 我们是如何得出这个结论的呢
01:04
So Erez埃雷兹 and I were thinking思维 about ways方法
17
49000
2000
是这样的 Erez和我
01:06
to get a big picture图片 of human人的 culture文化
18
51000
2000
在想怎样找到一幅展现人类文明
01:08
and human人的 history历史: change更改 over time.
19
53000
3000
和人文历史的画面: 历史的变迁
01:11
So many许多 books图书 actually其实 have been written书面 over the years年份.
20
56000
2000
人们在漫长岁月中写了很多书
01:13
So we were thinking思维, well the best最好 way to learn学习 from them
21
58000
2000
所以我们想 向他们学习的最佳方法
01:15
is to read all of these millions百万 of books图书.
22
60000
2000
就是把那几百万本书全都读完
01:17
Now of course课程, if there's a scale规模 for how awesome真棒 that is,
23
62000
3000
当然 如果用坐标来表示这样做的好处
01:20
that has to rank extremely非常, extremely非常 high.
24
65000
3000
那Y轴上的值一定是极高的
01:23
Now the problem问题 is there's an X-axisX轴 for that,
25
68000
2000
但问题是还有X轴
01:25
which哪一个 is the practical实际的 axis.
26
70000
2000
也就是可行性
01:27
This is very, very low.
27
72000
2000
这是极低的
01:29
(Applause掌声)
28
74000
3000
(众人鼓掌)
01:32
Now people tend趋向 to use an alternative替代 approach途径,
29
77000
3000
现在人们倾向于另一种做法
01:35
which哪一个 is to take a few少数 sources来源 and read them very carefully小心.
30
80000
2000
那就是选择几本书进行精读
01:37
This is extremely非常 practical实际的, but not so awesome真棒.
31
82000
2000
可行性极高但还不够好
01:39
What you really want to do
32
84000
3000
人们真正想要的
01:42
is to get to the awesome真棒 yet然而 practical实际的 part部分 of this space空间.
33
87000
3000
是一个既好又可行的方法
01:45
So it turns out there was a company公司 across横过 the river called Google谷歌
34
90000
3000
结果 在水一方 有一家叫“谷歌”的公司
01:48
who had started开始 a digitization数字化 project项目 a few少数 years年份 back
35
93000
2000
他们在此之前的几年前就开始了一个数字化工程
01:50
that might威力 just enable启用 this approach途径.
36
95000
2000
有可能帮我们找到这个“既好又可行”的方法
01:52
They have digitized数字化 millions百万 of books图书.
37
97000
2000
他们已经将几百万本书进行了数字化
01:54
So what that means手段 is, one could use computational计算 methods方法
38
99000
3000
这就意味着人们在电脑上点几个键
01:57
to read all of the books图书 in a click点击 of a button按键.
39
102000
2000
就能阅读所有的书
01:59
That's very practical实际的 and extremely非常 awesome真棒.
40
104000
3000
这真的是既可行又好
02:03
ELAELA: Let me tell you a little bit about where books图书 come from.
41
108000
2000
这些书是哪里来的呢
02:05
Since以来 time immemorial太古, there have been authors作者.
42
110000
3000
从古时候开始 人们就开始写作了
02:08
These authors作者 have been striving努力 to write books图书.
43
113000
3000
这些作家写书都非常卖力
02:11
And this became成为 considerably相当 easier更轻松
44
116000
2000
几个世纪前印刷机问世了
02:13
with the development发展 of the printing印花 press some centuries百年 ago.
45
118000
2000
写书的过程变得简单多了
02:15
Since以来 then, the authors作者 have won韩元
46
120000
3000
自那以后
02:18
on 129 million百万 distinct不同 occasions场合,
47
123000
2000
作家们已经出版了
02:20
publishing出版 books图书.
48
125000
2000
1.29亿本书
02:22
Now if those books图书 are not lost丢失 to history历史,
49
127000
2000
如果这些书没有随年月而遗失
02:24
then they are somewhere某处 in a library图书馆,
50
129000
2000
就都在图书馆里存着
02:26
and many许多 of those books图书 have been getting得到 retrieved检索 from the libraries图书馆
51
131000
3000
谷歌已经把许多书从图书馆中调了出来
02:29
and digitized数字化 by Google谷歌,
52
134000
2000
进行了数字化
02:31
which哪一个 has scanned扫描 15 million百万 books图书 to date日期.
53
136000
2000
被扫描的书籍到目前已有1500万册
02:33
Now when Google谷歌 digitizes数字化 a book, they put it into a really nice不错 format格式.
54
138000
3000
谷歌扫描图书时 把书的格式做得很好
02:36
Now we've我们已经 got the data数据, plus we have metadata元数据.
55
141000
2000
现在我们不但有了数据 还有元数据
02:38
We have information信息 about things like where was it published发表,
56
143000
3000
我们掌握了这些书的出版地
02:41
who was the author作者, when was it published发表.
57
146000
2000
作者 出版时间等信息
02:43
And what we do is go through通过 all of those records记录
58
148000
3000
接下来 我们就要从所有这些记录中
02:46
and exclude排除 everything that's not the highest最高 quality质量 data数据.
59
151000
4000
筛选出质量最高的数据
02:50
What we're left with
60
155000
2000
最后剩下的
02:52
is a collection采集 of five million百万 books图书,
61
157000
3000
是5百万本书
02:55
500 billion十亿 words,
62
160000
3000
5000亿个词
02:58
a string of characters人物 a thousand times longer
63
163000
2000
这么多词连起来
03:00
than the human人的 genome基因组 --
64
165000
3000
长度是人类基因组的1000倍
03:03
a text文本 which哪一个, when written书面 out,
65
168000
2000
如果把这些词连续写出来
03:05
would stretch伸展 from here to the Moon月亮 and back
66
170000
2000
其长度相当于在地月之间
03:07
10 times over --
67
172000
2000
往返10次以上
03:09
a veritable名副其实 shard碎片 of our cultural文化 genome基因组.
68
174000
4000
这还仅是我们文化基因组的小小一段
03:13
Of course课程 what we did
69
178000
2000
当然啦
03:15
when faced面对 with such这样 outrageous蛮横的 hyperbole夸张 ...
70
180000
3000
面对如此令人崩溃的结果
03:18
(Laughter笑声)
71
183000
2000
(众人笑)
03:20
was what any self-respecting自我尊重 researchers研究人员
72
185000
3000
我们做了一个懂得自重的研究者
03:23
would have doneDONE.
73
188000
3000
应该做的事
03:26
We took a page out of XKCDXKCD,
74
191000
2000
我们借鉴了XKCD(科学漫画)
03:28
and we said, "Stand back.
75
193000
2000
说:" 往后站。
03:30
We're going to try science科学."
76
195000
2000
我们要用科学来解决问题。”
03:32
(Laughter笑声)
77
197000
2000
(众人笑)
03:34
JMJM: Now of course课程, we were thinking思维,
78
199000
2000
当然 这时我们在想
03:36
well let's just first put the data数据 out there
79
201000
2000
何不先把数据放上去
03:38
for people to do science科学 to it.
80
203000
2000
让人们通过科学来运用数据
03:40
Now we're thinking思维, what data数据 can we release发布?
81
205000
2000
现在我们在思考 哪些数据可以公开
03:42
Well of course课程, you want to take the books图书
82
207000
2000
你当然想把这所有5百万本书
03:44
and release发布 the full充分 text文本 of these five million百万 books图书.
83
209000
2000
全文公开
03:46
Now Google谷歌, and Jon乔恩 OrwantOrwant in particular特定,
84
211000
2000
现在谷歌 具体地说是乔恩. 奥温特
03:48
told us a little equation方程 that we should learn学习.
85
213000
2000
告诉教给我们一个有用的方程式
03:50
So you have five million百万, that is, five million百万 authors作者
86
215000
3000
你有5百万本书 那就有五百万个作者
03:53
and five million百万 plaintiffs原告 is a massive大规模的 lawsuit诉讼.
87
218000
3000
一个有5百万个原告的官司可不小啊
03:56
So, although虽然 that would be really, really awesome真棒,
88
221000
2000
所以尽管这是个好想法
03:58
again, that's extremely非常, extremely非常 impractical不切实际的.
89
223000
3000
但是也极不现实
04:01
(Laughter笑声)
90
226000
2000
(众人笑)
04:03
Now again, we kind of caved下陷 in,
91
228000
2000
现在我们做出些许让步
04:05
and we did the very practical实际的 approach途径, which哪一个 was a bit less awesome真棒.
92
230000
3000
采用一个非常可行但稍微没那么好的方法
04:08
We said, well instead代替 of releasing释放 the full充分 text文本,
93
233000
2000
我们不公开全书内容
04:10
we're going to release发布 statistics统计 about the books图书.
94
235000
2000
而是公开书本的相关统计数据
04:12
So take for instance "A gleam闪光 of happiness幸福."
95
237000
2000
拿“A gleam of happiness”这个词组做例子
04:14
It's four words; we call that a four-gram四克.
96
239000
2000
它有四个单词 我们称它为四字格
04:16
We're going to tell you how many许多 times a particular特定 four-gram四克
97
241000
2000
我们会告诉你直到2008年出版的书中
04:18
appeared出现 in books图书 in 1801, 1802, 1803,
98
243000
2000
在1801年 1802年 1803年一直到2008年
04:20
all the way up to 2008.
99
245000
2000
某个四字格一共出现了多少次
04:22
That gives us a time series系列
100
247000
2000
这让我们看到
04:24
of how frequently经常 this particular特定 sentence句子 was used over time.
101
249000
2000
这个词组在这段时期内被使用的频率
04:26
We do that for all the words and phrases短语 that appear出现 in those books图书,
102
251000
3000
我们对在这些书中的所有单词和词组都这么处理
04:29
and that gives us a big table of two billion十亿 lines线
103
254000
3000
于是我们得出了一个由20亿曲线
04:32
that tell us about the way culture文化 has been changing改变.
104
257000
2000
表示出文化变化的情况
04:34
ELAELA: So those two billion十亿 lines线,
105
259000
2000
这20亿条曲线
04:36
we call them two billion十亿 n-grams正克.
106
261000
2000
我们成作20亿个n字格
04:38
What do they tell us?
107
263000
2000
它们告诉了我们什么
04:40
Well the individual个人 n-grams正克 measure测量 cultural文化 trends趋势.
108
265000
2000
这些n字格衡量的是文化的走势
04:42
Let me give you an example.
109
267000
2000
我来举个例子
04:44
Let's suppose假设 that I am thriving,
110
269000
2000
假设 我正在发财
04:46
then tomorrow明天 I want to tell you about how well I did.
111
271000
2000
明天我告诉你我发财的情况
04:48
And so I might威力 say, "Yesterday昨天, I throve兴盛起来."
112
273000
3000
我会说:“昨天,我发了。”
04:51
Alternatively另外, I could say, "Yesterday昨天, I thrived蓬勃发展."
113
276000
3000
也可以说:“昨天,我发财了。”
04:54
Well which哪一个 one should I use?
114
279000
3000
我到底应该用哪个说法呢
04:57
How to know?
115
282000
2000
怎么找答案
04:59
As of about six months个月 ago,
116
284000
2000
6个月以前
05:01
the state of the art艺术 in this field领域
117
286000
2000
很流行的做法是
05:03
is that you would, for instance,
118
288000
2000
比如说
05:05
go up to the following以下 psychologist心理学家 with fabulous极好 hair头发,
119
290000
2000
你去问这位秀发飘逸的心理学家
05:07
and you'd say,
120
292000
2000
你说
05:09
"Steve史蒂夫, you're an expert专家 on the irregular不规则 verbs动词.
121
294000
3000
“史蒂夫,你是不规则动词的专家。
05:12
What should I do?"
122
297000
2000
我该怎么办啊?”
05:14
And he'd他会 tell you, "Well most people say thrived蓬勃发展,
123
299000
2000
他会说:“大多数人说‘发财了’,
05:16
but some people say throve兴盛起来."
124
301000
3000
但有些人说‘发了’。”
05:19
And you also knew知道, more or less,
125
304000
2000
如果你可以
05:21
that if you were to go back in time 200 years年份
126
306000
3000
回到200年前
05:24
and ask the following以下 statesman政治家 with equally一样 fabulous极好 hair头发,
127
309000
3000
问问这位秀发同样飘逸的政治家
05:27
(Laughter笑声)
128
312000
3000
(众人笑)
05:30
"Tom汤姆, what should I say?"
129
315000
2000
“托马斯,我该怎么说?”
05:32
He'd他会 say, "Well, in my day, most people throve兴盛起来,
130
317000
2000
他会回答:“嗯,在我的时代,大多数人说‘发了’,
05:34
but some thrived蓬勃发展."
131
319000
3000
但是少数人说‘发财了’。”
05:37
So now what I'm just going to show显示 you is raw生的 data数据.
132
322000
2000
现在我给你们看一个原始数据
05:39
Two rows from this table of two billion十亿 entries.
133
324000
4000
这是20亿本书中的其中两本书的曲线
05:43
What you're seeing眼看 is year by year frequency频率
134
328000
2000
你们将看到“发了”和“发财了”这两个词
05:45
of "thrived蓬勃发展" and "throve兴盛起来" over time.
135
330000
3000
随时间的推移被使用的频率
05:49
Now this is just two
136
334000
2000
这还只是
05:51
out of two billion十亿 rows.
137
336000
3000
20亿条曲线中的其中两条
05:54
So the entire整个 data数据 set
138
339000
2000
整套数据
05:56
is a billion十亿 times more awesome真棒 than this slide滑动.
139
341000
3000
比这张幻灯片要宏伟10亿倍
05:59
(Laughter笑声)
140
344000
2000
(众人笑)
06:01
(Applause掌声)
141
346000
4000
(众人鼓掌)
06:05
JMJM: Now there are many许多 other pictures图片 that are worth价值 500 billion十亿 words.
142
350000
2000
很多画面都相当于5千亿个词
06:07
For instance, this one.
143
352000
2000
比如这一幅
06:09
If you just take influenza流感,
144
354000
2000
如果你找“流行感冒”这一词
06:11
you will see peaks at the time where you knew知道
145
356000
2000
你会看到几个全球范围内
06:13
big flu流感 epidemics流行病 were killing谋杀 people around the globe地球.
146
358000
3000
祸害人命的流感高峰
06:16
ELAELA: If you were not yet然而 convinced相信,
147
361000
3000
如果这不足以令人信服
06:19
sea levels水平 are rising升起,
148
364000
2000
海平面正在上升
06:21
so is atmospheric大气的 COCO2 and global全球 temperature温度.
149
366000
3000
大气中二氧化碳含量和全球气温都在升高
06:24
JMJM: You might威力 also want to have a look at this particular特定 n-gram正克,
150
369000
3000
你们也可以看看这个n字格
06:27
and that's to tell Nietzsche尼采 that God is not dead,
151
372000
3000
告诉尼采上帝没死
06:30
although虽然 you might威力 agree同意 that he might威力 need a better publicist公关.
152
375000
3000
你可能也认为他或许要换一个企宣了
06:33
(Laughter笑声)
153
378000
2000
(众人笑)
06:35
ELAELA: You can get at some pretty漂亮 abstract抽象 concepts概念 with this sort分类 of thing.
154
380000
3000
你可以通过这个得到非常抽象的概念
06:38
For instance, let me tell you the history历史
155
383000
2000
我跟你们说说
06:40
of the year 1950.
156
385000
2000
1950年的历史
06:42
Pretty漂亮 much for the vast广大 majority多数 of history历史,
157
387000
2000
在漫漫历史长河中
06:44
no one gave a damn该死的 about 1950.
158
389000
2000
几乎没人在意1950年
06:46
In 1700, in 1800, in 1900,
159
391000
2000
1700年 1800年 1900年
06:48
no one cared照顾.
160
393000
3000
没有人在意
06:52
Through通过 the 30s and 40s,
161
397000
2000
20世纪三十年代和四十年代
06:54
no one cared照顾.
162
399000
2000
没有人在意
06:56
Suddenly突然, in the mid-中-40s,
163
401000
2000
到了四十年代中期 突然间
06:58
there started开始 to be a buzz蜂鸣器.
164
403000
2000
关注度飞升
07:00
People realized实现 that 1950 was going to happen发生,
165
405000
2000
人们意识到1950年快来了
07:02
and it could be big.
166
407000
2000
这一年可能非同小可啊
07:04
(Laughter笑声)
167
409000
3000
(众人笑)
07:07
But nothing got people interested有兴趣 in 1950
168
412000
3000
1950年 正如人们想象的一样
07:10
like the year 1950.
169
415000
3000
没发生任何有意思的事情
07:13
(Laughter笑声)
170
418000
3000
(众人笑)
07:16
People were walking步行 around obsessed痴迷.
171
421000
2000
人们都着了魔了
07:18
They couldn't不能 stop talking
172
423000
2000
无时无刻不在谈论
07:20
about all the things they did in 1950,
173
425000
3000
他们1950年做过的事情
07:23
all the things they were planning规划 to do in 1950,
174
428000
3000
他们打算在1950年做的事情
07:26
all the dreams of what they wanted to accomplish完成 in 1950.
175
431000
5000
后者他们1950年想要实现的梦想
07:31
In fact事实, 1950 was so fascinating迷人
176
436000
2000
事实上 1950年是不同凡响的一年
07:33
that for years年份 thereafter其后,
177
438000
2000
即使过了好多年
07:35
people just kept不停 talking about all the amazing惊人 things that happened发生,
178
440000
3000
人们还是不停地谈论那年发生的所有美好事情
07:38
in '51, '52, '53.
179
443000
2000
51年 52年 53年
07:40
Finally最后 in 1954,
180
445000
2000
终于到了1954年
07:42
someone有人 woke醒来 up and realized实现
181
447000
2000
人们醒悟过来
07:44
that 1950 had gotten得到 somewhat有些 pass通过é.
182
449000
4000
1950年已成往事了
07:48
(Laughter笑声)
183
453000
2000
(众人笑)
07:50
And just like that, the bubble泡沫 burst爆裂.
184
455000
2000
就这样 泡泡破了
07:52
(Laughter笑声)
185
457000
2000
(众人笑)
07:54
And the story故事 of 1950
186
459000
2000
1950年的情况
07:56
is the story故事 of every一切 year that we have on record记录,
187
461000
2000
以及每一年的情况 我们都记录了下来
07:58
with a little twist, because now we've我们已经 got these nice不错 charts图表.
188
463000
3000
多亏了这些漂亮的图表 我们的工作顺利多了
08:01
And because we have these nice不错 charts图表, we can measure测量 things.
189
466000
3000
有了这些漂亮的图表 我们就能测量各种事物
08:04
We can say, "Well how fast快速 does the bubble泡沫 burst爆裂?"
190
469000
2000
我们会说:“泡泡破掉的速度有多快?”
08:06
And it turns out that we can measure测量 that very precisely恰恰.
191
471000
3000
结果证明 我们可以对此进行精准的测量
08:09
Equations方程 were derived派生, graphs were produced生成,
192
474000
3000
等式出来了 图表也做好了
08:12
and the net result结果
193
477000
2000
最终结果是
08:14
is that we find that the bubble泡沫 bursts连发 faster更快 and faster更快
194
479000
3000
泡泡破掉的速度
08:17
with each passing通过 year.
195
482000
2000
每年都在加快
08:19
We are losing失去 interest利益 in the past过去 more rapidly急速.
196
484000
5000
我们对过去的遗忘不断加快
08:24
JMJM: Now a little piece of career事业 advice忠告.
197
489000
2000
好 现在给大家一些发展事业的建议
08:26
So for those of you who seek寻求 to be famous著名,
198
491000
2000
如果你想成名
08:28
we can learn学习 from the 25 most famous著名 political政治 figures人物,
199
493000
2000
我们可以向25位最著名的政治人物
08:30
authors作者, actors演员 and so on.
200
495000
2000
作家 演员学习
08:32
So if you want to become成为 famous著名 early on, you should be an actor演员,
201
497000
3000
如果你想早点成名 你就应该做个演员
08:35
because then fame名誉 starts启动 rising升起 by the end结束 of your 20s --
202
500000
2000
因为 演员在20来岁的时候成名
08:37
you're still young年轻, it's really great.
203
502000
2000
你还很年轻 这是本钱
08:39
Now if you can wait a little bit, you should be an author作者,
204
504000
2000
如果你能等一等 那就当个作家
08:41
because then you rise上升 to very great heights高度,
205
506000
2000
因为你可以像马克.吐温这样
08:43
like Mark标记 Twain吐温, for instance: extremely非常 famous著名.
206
508000
2000
成为文坛巨星
08:45
But if you want to reach达到 the very top最佳,
207
510000
2000
如果你想到达万人之上
08:47
you should delay延迟 gratification享乐
208
512000
2000
你就不能安于现状
08:49
and, of course课程, become成为 a politician政治家.
209
514000
2000
要成为一个政治家
08:51
So here you will become成为 famous著名 by the end结束 of your 50s,
210
516000
2000
到了快60岁的时候 你就成名了
08:53
and become成为 very, very famous著名 afterward之后.
211
518000
2000
而且之后名声远扬
08:55
So scientists科学家们 also tend趋向 to get famous著名 when they're much older旧的.
212
520000
3000
科学家通常在年纪一大把的时候才成名
08:58
Like for instance, biologists生物学家 and physics物理
213
523000
2000
生物学家和物理学家的名声
09:00
tend趋向 to be almost几乎 as famous著名 as actors演员.
214
525000
2000
通常能跟演员的名声媲美
09:02
One mistake错误 you should not do is become成为 a mathematician数学家.
215
527000
3000
有一个错误你不要犯 那就是成为一个数学家
09:05
(Laughter笑声)
216
530000
2000
(众人笑)
09:07
If you do that,
217
532000
2000
如果你成了数学家
09:09
you might威力 think, "Oh great. I'm going to do my best最好 work when I'm in my 20s."
218
534000
3000
你会想:“太好啦,我20多岁的时候会有最辉煌的成就。”
09:12
But guess猜测 what, nobody没有人 will really care关心.
219
537000
2000
谁知道 人们连睬都不睬你
09:14
(Laughter笑声)
220
539000
3000
(众人笑)
09:17
ELAELA: There are more sobering发人深省 notes笔记
221
542000
2000
n字格中
09:19
among其中 the n-grams正克.
222
544000
2000
有些情况更为明了
09:21
For instance, here's这里的 the trajectory弹道 of Marc渣子 Chagall夏加尔,
223
546000
2000
这是Marc Chagall的名声起落
09:23
an artist艺术家 born天生 in 1887.
224
548000
2000
他是出生于1887的一位艺术家
09:25
And this looks容貌 like the normal正常 trajectory弹道 of a famous著名 person.
225
550000
3000
他的名声起落看似乎没有什么异常
09:28
He gets得到 more and more and more famous著名,
226
553000
4000
他的名声越来越大
09:32
except if you look in German德语.
227
557000
2000
然而如果你在德语书中搜索 情况就不同了
09:34
If you look in German德语, you see something completely全然 bizarre奇异的,
228
559000
2000
在德语书中 你会看到非常奇怪的现象
09:36
something you pretty漂亮 much never see,
229
561000
2000
闻所未闻 见所未见
09:38
which哪一个 is he becomes extremely非常 famous著名
230
563000
2000
他先是名极一时
09:40
and then all of a sudden突然 plummets骤降,
231
565000
2000
但突然之间 名声直线下落
09:42
going through通过 a nadir最低点 between之间 1933 and 1945,
232
567000
3000
在1933年到1945年间达到了低谷
09:45
before rebounding反弹 afterward之后.
233
570000
3000
后来才回升
09:48
And of course课程, what we're seeing眼看
234
573000
2000
当然 实际情况是
09:50
is the fact事实 Marc渣子 Chagall夏加尔 was a Jewish犹太 artist艺术家
235
575000
3000
Marc Chagall是一个犹太艺术家
09:53
in Nazi纳粹 Germany德国.
236
578000
2000
当时身在纳粹德国
09:55
Now these signals信号
237
580000
2000
这些信号
09:57
are actually其实 so strong强大
238
582000
2000
实在太强了
09:59
that we don't need to know that someone有人 was censored审查.
239
584000
3000
我们无需知道谁被禁了
10:02
We can actually其实 figure数字 it out
240
587000
2000
我们事实上可以
10:04
using运用 really basic基本 signal信号 processing处理.
241
589000
2000
通过非常基本的信号处理来找出答案
10:06
Here's这里的 a simple简单 way to do it.
242
591000
2000
这里有一个简单的方法
10:08
Well, a reasonable合理 expectation期望
243
593000
2000
一个人在特定时期内
10:10
is that somebody's某人的 fame名誉 in a given特定 period of time
244
595000
2000
所拥有的知名度
10:12
should be roughly大致 the average平均 of their fame名誉 before
245
597000
2000
应当大致为他成名前与成名后知名度的平均值
10:14
and their fame名誉 after.
246
599000
2000
这么想是有道理的
10:16
So that's sort分类 of what we expect期望.
247
601000
2000
我们也是怎么想的
10:18
And we compare比较 that to the fame名誉 that we observe.
248
603000
3000
我们把观察到的知名度进行对比
10:21
And we just divide划分 one by the other
249
606000
2000
我们把前者比上后者
10:23
to produce生产 something we call a suppression抑制 index指数.
250
608000
2000
产生的结果叫做抑制指数
10:25
If the suppression抑制 index指数 is very, very, very small,
251
610000
3000
如果抑制指数非常非常小
10:28
then you very well might威力 be being存在 suppressed抑制.
252
613000
2000
那么你的知名度正在被抑制
10:30
If it's very large, maybe you're benefiting受益 from propaganda宣传.
253
615000
3000
如果数值非常大 或许就表明你从宣传中获益
10:34
JMJM: Now you can actually其实 look at
254
619000
2000
你还可以看到
10:36
the distribution分配 of suppression抑制 indexes索引 over whole整个 populations人群.
255
621000
3000
压抑指数在总人数中的分布情况
10:39
So for instance, here --
256
624000
2000
这里有个例子
10:41
this suppression抑制 index指数 is for 5,000 people
257
626000
2000
这是从没有明显抑制的英文书籍中
10:43
picked采摘的 in English英语 books图书 where there's no known已知 suppression抑制 --
258
628000
2000
选出的5000个人
10:45
it would be like this, basically基本上 tightly紧紧 centered中心 on one.
259
630000
2000
它是这个样子的 基本上以1为中心
10:47
What you expect期望 is basically基本上 what you observe.
260
632000
2000
实际情况与预想差不多
10:49
This is distribution分配 as seen看到 in Germany德国 --
261
634000
2000
而这在是德文书籍中的分布情况
10:51
very different不同, it's shifted to the left.
262
636000
2000
与前者大为不同 往左偏了
10:53
People talked about it twice两次 less as it should have been.
263
638000
3000
人们对它的关注较预期要少了两倍
10:56
But much more importantly重要的, the distribution分配 is much wider更宽的.
264
641000
2000
更重要的是 这个分布的跨度更宽
10:58
There are many许多 people who end结束 up on the far left on this distribution分配
265
643000
3000
不少人处于左边的部分
11:01
who are talked about 10 times fewer than they should have been.
266
646000
3000
人数比预期中少了10倍
11:04
But then also many许多 people on the far right
267
649000
2000
而也有不少人处于更靠右的部分
11:06
who seem似乎 to benefit效益 from propaganda宣传.
268
651000
2000
他们的宣传起了作用
11:08
This picture图片 is the hallmark特点 of censorship审查 in the book record记录.
269
653000
3000
这幅图反映了书籍记录中的审查情况
11:11
ELAELA: So culturomicsculturomics
270
656000
2000
我们把这种方法
11:13
is what we call this method方法.
271
658000
2000
称作文化组学
11:15
It's kind of like genomics基因组学.
272
660000
2000
有点像基因组学
11:17
Except genomics基因组学 is a lens镜片 on biology生物学
273
662000
2000
只不过 基因组学是生物学上
11:19
through通过 the window窗口 of the sequence序列 of bases基地 in the human人的 genome基因组.
274
664000
3000
观察人类基因组序列的透镜
11:22
CulturomicsCulturomics is similar类似.
275
667000
2000
文化组学很类似
11:24
It's the application应用 of massive-scale巨大的规模 data数据 collection采集 analysis分析
276
669000
3000
它指的是对人类文明研究的
11:27
to the study研究 of human人的 culture文化.
277
672000
2000
大规模数据收集分析的应用
11:29
Here, instead代替 of through通过 the lens镜片 of a genome基因组,
278
674000
2000
它使用的不是基因组这个透镜
11:31
through通过 the lens镜片 of digitized数字化 pieces of the historical历史的 record记录.
279
676000
3000
而是用数字化的历史记录片段作为透镜
11:34
The great thing about culturomicsculturomics
280
679000
2000
文化组学的优点是
11:36
is that everyone大家 can do it.
281
681000
2000
人人都会用它
11:38
Why can everyone大家 do it?
282
683000
2000
为什么呢
11:40
Everyone大家 can do it because three guys,
283
685000
2000
这是因为这三个人
11:42
Jon乔恩 OrwantOrwant, Matt马特 Gray灰色 and Will Brockman布罗克曼 over at Google谷歌,
284
687000
3000
谷歌的乔恩.奥温特 迈特.格雷和威尔.布洛克曼
11:45
saw the prototype原型 of the NgramNGRAM Viewer查看器,
285
690000
2000
看到了n字格后
11:47
and they said, "This is so fun开玩笑.
286
692000
2000
说:“这太有意思了,
11:49
We have to make this available可得到 for people."
287
694000
3000
我们得让所有人都用上它。”
11:52
So in two weeks flat平面 -- the two weeks before our paper came来了 out --
288
697000
2000
于是在我们的论文发表之前的整整两个星期中
11:54
they coded编码 up a version of the NgramNGRAM Viewer查看器 for the general一般 public上市.
289
699000
3000
他们编了一个面向公众的Ngram Viewer版本
11:57
And so you too can type类型 in any word or phrase短语 that you're interested有兴趣 in
290
702000
3000
现在你们也可以输入任何你感兴趣的单词或词组
12:00
and see its n-gram正克 immediately立即 --
291
705000
2000
查看它的n字格
12:02
also browse浏览 examples例子 of all the various各个 books图书
292
707000
2000
并阅览所有书籍中
12:04
in which哪一个 your n-gram正克 appears出现.
293
709000
2000
出现n字格的例句
12:06
JMJM: Now this was used over a million百万 times on the first day,
294
711000
2000
这个词在第一天就被使用了超过一百万次
12:08
and this is really the best最好 of all the queries查询.
295
713000
2000
这真的是最棒的一个搜索词
12:10
So people want to be their best最好, put their best最好 foot脚丫子 forward前锋.
296
715000
3000
人们总想做到最好 总想展示最好的一面
12:13
But it turns out in the 18th century世纪, people didn't really care关心 about that at all.
297
718000
3000
但是在18世纪 人们对此并不在乎
12:16
They didn't want to be their best最好, they wanted to be their beftbeft.
298
721000
3000
他们不想做到最好(“best”)而是“beft”
12:19
So what happened发生 is, of course课程, this is just a mistake错误.
299
724000
3000
实际上 这是个错别字
12:22
It's not that strove争取 for mediocrity庸人,
300
727000
2000
这并不是因为人们不识字
12:24
it's just that the S used to be written书面 differently不同, kind of like an F.
301
729000
3000
而是因为当时英文字母S的写法跟现在不同 看起来像F
12:27
Now of course课程, Google谷歌 didn't pick this up at the time,
302
732000
3000
当然 谷歌没有意识到这一点
12:30
so we reported报道 this in the science科学 article文章 that we wrote.
303
735000
3000
于是我们对此在论文中做了报告
12:33
But it turns out this is just a reminder提醒
304
738000
2000
这实际上只是一个小提示
12:35
that, although虽然 this is a lot of fun开玩笑,
305
740000
2000
尽管这很有趣
12:37
when you interpret these graphs, you have to be very careful小心,
306
742000
2000
但是你在解读这些图表时 仍须非常谨慎
12:39
and you have to adopt采用 the base基础 standards标准 in the sciences科学.
307
744000
3000
你必须遵循基本的科学准则
12:42
ELAELA: People have been using运用 this for all kinds of fun开玩笑 purposes目的.
308
747000
3000
人们使用它来寻求各种乐趣
12:45
(Laughter笑声)
309
750000
7000
(众人笑)
12:52
Actually其实, we're not going to have to talk,
310
757000
2000
我们不打算多说
12:54
we're just going to show显示 you all the slides幻灯片 and remain silent无声.
311
759000
3000
光给你们看这些幻灯片
12:57
This person was interested有兴趣 in the history历史 of frustration挫折.
312
762000
3000
这个用户对人们烦躁的历史很感兴趣
13:00
There's various各个 types类型 of frustration挫折.
313
765000
3000
这里有不同类型的烦躁
13:03
If you stub存根 your toe脚趾, that's a one A "argh哎呀."
314
768000
3000
如果你的脚趾被碰了 你会说“啊” (“argh”)
13:06
If the planet行星 Earth地球 is annihilated全军覆没 by the Vogons沃贡
315
771000
2000
如果地球被外星人毁灭了
13:08
to make room房间 for an interstellar星际 bypass旁路,
316
773000
2000
开了一条星际航道
13:10
that's an eight A "aaaaaaaarghaaaaaaaargh."
317
775000
2000
那就是“啊啊啊啊啊啊啊啊” ("aaaaaaaargh")
13:12
This person studies学习 all the "arghsarghs,"
318
777000
2000
这个人研究了不同长短的“啊” (“argh”)
13:14
from one through通过 eight A's.
319
779000
2000
从1个啊到8个啊
13:16
And it turns out
320
781000
2000
结果
13:18
that the less-frequent不太频繁 "arghsarghs"
321
783000
2000
那些使用频率较低的啊
13:20
are, of course课程, the ones那些 that correspond对应 to things that are more frustrating泄气 --
322
785000
3000
代表程度更高的烦躁
13:23
except, oddly奇怪, in the early 80s.
323
788000
3000
八十年代是个例外
13:26
We think that might威力 have something to do with Reagan里根.
324
791000
2000
我们猜这可能跟里根总统有关
13:28
(Laughter笑声)
325
793000
2000
(众人笑)
13:30
JMJM: There are many许多 usages用法 of this data数据,
326
795000
3000
这个数据库的用处很多
13:33
but the bottom底部 line线 is that the historical历史的 record记录 is being存在 digitized数字化.
327
798000
3000
但最重要的是这是一个数字化的历史记录
13:36
Google谷歌 has started开始 to digitize数字化 15 million百万 books图书.
328
801000
2000
谷歌已经开始对1500万本书进行数字化处理
13:38
That's 12 percent百分 of all the books图书 that have ever been published发表.
329
803000
2000
其中12%的书已被出版
13:40
It's a sizable可观 chunk of human人的 culture文化.
330
805000
3000
这是人类文明相当大的一部分
13:43
There's much more in culture文化: there's manuscripts手稿, there newspapers报纸,
331
808000
3000
而文明还包括更多的内容 有手稿 报纸
13:46
there's things that are not text文本, like art艺术 and paintings绘画.
332
811000
2000
非文字的内容 例如艺术与绘画
13:48
These all happen发生 to be on our computers电脑,
333
813000
2000
这些内容都会出现在我们的电脑上
13:50
on computers电脑 across横过 the world世界.
334
815000
2000
在世界各地的电脑上
13:52
And when that happens发生, that will transform转变 the way we have
335
817000
3000
如果这成真了
13:55
to understand理解 our past过去, our present当下 and human人的 culture文化.
336
820000
2000
我们对过去现在以及人类文明的认识就被改变了
13:57
Thank you very much.
337
822000
2000
非常感谢大家
13:59
(Applause掌声)
338
824000
3000
(众人鼓掌)
Translated by Lili Liang
Reviewed by dahong zhang

▲Back to top

ABOUT THE SPEAKERS
Jean-Baptiste Michel - Data researcher
Jean-Baptiste Michel looks at how we can use large volumes of data to better understand our world.

Why you should listen

Jean-Baptiste Michel holds joint academic appointments at Harvard (FQEB Fellow) and Google (Visiting Faculty). His research focusses on using large volumes of data as tools that help better understand the world around us -- from the way diseases progress in patients over years, to the way cultures change in human societies over centuries. With his colleague Erez Lieberman Aiden, Jean-Baptiste is a Founding Director of Harvard's Cultural Observatory, where their research team pioneers the use of quantitative methods for the study of human culture, language and history. His research was featured on the covers of Science and Nature, on the front pages of the New York Times and the Boston Globe, in The Economist, Wired and many other venues. The online tool he helped create -- ngrams.googlelabs.com -- was used millions of times to browse cultural trends. Jean-Baptiste is an Engineer from Ecole Polytechnique (Paris), and holds an MS in Applied Mathematics and a PhD in Systems Biology from Harvard.

More profile about the speaker
Jean-Baptiste Michel | Speaker | TED.com
Erez Lieberman Aiden - Researcher
Erez Lieberman Aiden pursues a broad range of research interests, spanning genomics, linguistics, mathematics ...

Why you should listen

Erez Lieberman Aiden is a fellow at the Harvard Society of Fellows and Visiting Faculty at Google. His research spans many disciplines and has won numerous awards, including recognition for one of the top 20 "Biotech Breakthroughs that will Change Medicine", by Popular Mechanics; the Lemelson-MIT prize for the best student inventor at MIT; the American Physical Society's Award for the Best Doctoral Dissertation in Biological Physics; and membership in Technology Review's 2009 TR35, recognizing the top 35 innovators under 35. His last three papers -- two with JB Michel -- have all appeared on the cover of Nature and Science.

More profile about the speaker
Erez Lieberman Aiden | Speaker | TED.com