ABOUT THE SPEAKERS
Jean-Baptiste Michel - Data researcher
Jean-Baptiste Michel looks at how we can use large volumes of data to better understand our world.

Why you should listen

Jean-Baptiste Michel holds joint academic appointments at Harvard (FQEB Fellow) and Google (Visiting Faculty). His research focusses on using large volumes of data as tools that help better understand the world around us -- from the way diseases progress in patients over years, to the way cultures change in human societies over centuries. With his colleague Erez Lieberman Aiden, Jean-Baptiste is a Founding Director of Harvard's Cultural Observatory, where their research team pioneers the use of quantitative methods for the study of human culture, language and history. His research was featured on the covers of Science and Nature, on the front pages of the New York Times and the Boston Globe, in The Economist, Wired and many other venues. The online tool he helped create -- ngrams.googlelabs.com -- was used millions of times to browse cultural trends. Jean-Baptiste is an Engineer from Ecole Polytechnique (Paris), and holds an MS in Applied Mathematics and a PhD in Systems Biology from Harvard.

More profile about the speaker
Jean-Baptiste Michel | Speaker | TED.com
Erez Lieberman Aiden - Researcher
Erez Lieberman Aiden pursues a broad range of research interests, spanning genomics, linguistics, mathematics ...

Why you should listen

Erez Lieberman Aiden is a fellow at the Harvard Society of Fellows and Visiting Faculty at Google. His research spans many disciplines and has won numerous awards, including recognition for one of the top 20 "Biotech Breakthroughs that will Change Medicine", by Popular Mechanics; the Lemelson-MIT prize for the best student inventor at MIT; the American Physical Society's Award for the Best Doctoral Dissertation in Biological Physics; and membership in Technology Review's 2009 TR35, recognizing the top 35 innovators under 35. His last three papers -- two with JB Michel -- have all appeared on the cover of Nature and Science.

More profile about the speaker
Erez Lieberman Aiden | Speaker | TED.com
TEDxBoston 2011

Jean-Baptiste Michel + Erez Lieberman Aiden: What we learned from 5 million books

從五百萬本書學到的事

Filmed:
2,049,453 views

你是否使用過Google實驗室開發的Ngram瀏覽器?這是一款吸引人的工具,能讓你從跨世紀以來五百萬本書的資料庫中搜尋字詞和想法。Erez Lieberman Aiden和Jean-Baptiste Michel將為我們展示這款工具如何運作,以及一些我們能從這五千億字中學到的一些驚喜發現。
- Data researcher
Jean-Baptiste Michel looks at how we can use large volumes of data to better understand our world. Full bio - Researcher
Erez Lieberman Aiden pursues a broad range of research interests, spanning genomics, linguistics, mathematics ... Full bio

Double-click the English transcript below to play the video.

00:15
Erez埃雷茲 Lieberman利伯曼 Aiden艾登: Everyone大家 knows知道
0
0
2000
Erez Lieberman Aiden:大家都知道
00:17
that a picture圖片 is worth價值 a thousand words.
1
2000
3000
一張圖勝過千言萬語
00:22
But we at Harvard哈佛
2
7000
2000
但我們在哈佛時
00:24
were wondering想知道 if this was really true真正.
3
9000
3000
卻在思考這道理是否真是如此
00:27
(Laughter笑聲)
4
12000
2000
(笑聲)
00:29
So we assembled組裝 a team球隊 of experts專家,
5
14000
4000
所以我們由來自哈佛大學
00:33
spanning跨越 Harvard哈佛, MITMIT,
6
18000
2000
麻省理工學院
00:35
The American美國 Heritage遺產 Dictionary字典, The Encyclopedia百科全書 Britannica大英百科全書
7
20000
3000
美國傳統英語詞典,大英百科全書
00:38
and even our proud驕傲 sponsors贊助商,
8
23000
2000
甚至我們偉大的贊助商─Google的專家們
00:40
the Google谷歌.
9
25000
3000
組成一個團隊
00:43
And we cogitated沉思起來 about this
10
28000
2000
我們花了四年的時間
00:45
for about four years年份.
11
30000
2000
在思考這個問題
00:47
And we came來了 to a startling觸目驚心 conclusion結論.
12
32000
5000
然後我們得到了一個驚人的結論
00:52
Ladies女士們 and gentlemen紳士, a picture圖片 is not worth價值 a thousand words.
13
37000
3000
女士先生們,一張圖片其實不只勝過千言萬語
00:55
In fact事實, we found發現 some pictures圖片
14
40000
2000
事實上,我們發現某些圖片
00:57
that are worth價值 500 billion十億 words.
15
42000
5000
更是勝過五千億個字
01:02
Jean-Baptiste讓 - 巴蒂斯特 Michel米歇爾: So how did we get to this conclusion結論?
16
47000
2000
Jean-Baptiste Michel:我們是如何得出這項結論的呢?
01:04
So Erez埃雷茲 and I were thinking思維 about ways方法
17
49000
2000
Erez和我思考了不同的方式
01:06
to get a big picture圖片 of human人的 culture文化
18
51000
2000
想更加了解人類文化
01:08
and human人的 history歷史: change更改 over time.
19
53000
3000
以及人類歷史從古到今的變化的全景
01:11
So many許多 books圖書 actually其實 have been written書面 over the years年份.
20
56000
2000
事實上,多年來已經出版了許多書籍。
01:13
So we were thinking思維, well the best最好 way to learn學習 from them
21
58000
2000
所以我們認為最好的學習方式
01:15
is to read all of these millions百萬 of books圖書.
22
60000
2000
就是將這上百萬的書全讀過一遍
01:17
Now of course課程, if there's a scale規模 for how awesome真棒 that is,
23
62000
3000
如果能有一個尺規來說明此舉的驚人程度
01:20
that has to rank extremely非常, extremely非常 high.
24
65000
3000
這將會相當驚人
01:23
Now the problem問題 is there's an X-axisX軸 for that,
25
68000
2000
但問題是這裡的X軸
01:25
which哪一個 is the practical實際的 axis.
26
70000
2000
是表示實用程度
01:27
This is very, very low.
27
72000
2000
這相當不實用
01:29
(Applause掌聲)
28
74000
3000
(掌聲)
01:32
Now people tend趨向 to use an alternative替代 approach途徑,
29
77000
3000
現在人們希望用別的方式
01:35
which哪一個 is to take a few少數 sources來源 and read them very carefully小心.
30
80000
2000
可以讀少一點書,但讀得非常仔細
01:37
This is extremely非常 practical實際的, but not so awesome真棒.
31
82000
2000
這會相當實用,但這一點都不吸引人
01:39
What you really want to do
32
84000
3000
我們真正想做的是
01:42
is to get to the awesome真棒 yet然而 practical實際的 part部分 of this space空間.
33
87000
3000
要用一種吸引人且實用的方法來閱讀這些書
01:45
So it turns out there was a company公司 across橫過 the river called Google谷歌
34
90000
3000
所以在河的對岸有間公司叫做Google
01:48
who had started開始 a digitization數字化 project項目 a few少數 years年份 back
35
93000
2000
他們幾年之前開始了一項數字化計畫
01:50
that might威力 just enable啟用 this approach途徑.
36
95000
2000
這項計畫讓我們能實踐剛說的方法
01:52
They have digitized數字化 millions百萬 of books圖書.
37
97000
2000
他們已將數百萬本書給數位化
01:54
So what that means手段 is, one could use computational計算 methods方法
38
99000
3000
這意味著,我們可以透過電腦
01:57
to read all of the books圖書 in a click點擊 of a button按鍵.
39
102000
2000
簡單按個按鈕就能閱讀所有的書
01:59
That's very practical實際的 and extremely非常 awesome真棒.
40
104000
3000
這非常實用而且相當棒
02:03
ELAELA: Let me tell you a little bit about where books圖書 come from.
41
108000
2000
ELA:讓我為各位介紹這些書都來自何方
02:05
Since以來 time immemorial太古, there have been authors作者.
42
110000
3000
自古以來,有非常多作家
02:08
These authors作者 have been striving努力 to write books圖書.
43
113000
3000
這些作家一直努力寫作
02:11
And this became成為 considerably相當 easier更輕鬆
44
116000
2000
但現在寫作變得相當容易
02:13
with the development發展 of the printing印花 press some centuries百年 ago.
45
118000
2000
這歸功於幾世紀前印刷術的革新
02:15
Since以來 then, the authors作者 have won韓元
46
120000
3000
自那時起作家們
02:18
on 129 million百萬 distinct不同 occasions場合,
47
123000
2000
能在一億兩千九百萬個不同的地方
02:20
publishing出版 books圖書.
48
125000
2000
出版書籍
02:22
Now if those books圖書 are not lost丟失 to history歷史,
49
127000
2000
如果那些書沒有因為時代交替而遺失
02:24
then they are somewhere某處 in a library圖書館,
50
129000
2000
那麼那些書可能在某個圖書館的一處
02:26
and many許多 of those books圖書 have been getting得到 retrieved檢索 from the libraries圖書館
51
131000
3000
有相當多書可以從圖書館中被借閱
02:29
and digitized數字化 by Google谷歌,
52
134000
2000
由Google將其數位化
02:31
which哪一個 has scanned掃描 15 million百萬 books圖書 to date日期.
53
136000
2000
迄今Google已經掃描了一千五百萬本書
02:33
Now when Google谷歌 digitizes數字化 a book, they put it into a really nice不錯 format格式.
54
138000
3000
Google將一本書數位化,並以優良的型式呈現
02:36
Now we've我們已經 got the data數據, plus we have metadata元數據.
55
141000
2000
現在我們有了這些數據,加上這些詮釋資料
02:38
We have information信息 about things like where was it published發表,
56
143000
3000
我們有了相關的資訊,比如出版地區,
02:41
who was the author作者, when was it published發表.
57
146000
2000
作者,出版時間
02:43
And what we do is go through通過 all of those records記錄
58
148000
3000
我們所做的就是透過這些記錄
02:46
and exclude排除 everything that's not the highest最高 quality質量 data數據.
59
151000
4000
並剔除不是最精華的資料
02:50
What we're left with
60
155000
2000
我們後來得到的是
02:52
is a collection採集 of five million百萬 books圖書,
61
157000
3000
五百萬本書
02:55
500 billion十億 words,
62
160000
3000
五千億個詞
02:58
a string of characters人物 a thousand times longer
63
163000
2000
這是一串比人類基因組
03:00
than the human人的 genome基因組 --
64
165000
3000
還要長上一千倍的字符
03:03
a text文本 which哪一個, when written書面 out,
65
168000
2000
如果寫成文章
03:05
would stretch伸展 from here to the Moon月亮 and back
66
170000
2000
將會是從這裡到月球來回距離
03:07
10 times over --
67
172000
2000
的十倍以上
03:09
a veritable名副其實 shard碎片 of our cultural文化 genome基因組.
68
174000
4000
這是我們文化基因名副其實的的一部分
03:13
Of course課程 what we did
69
178000
2000
當然當我們面臨
03:15
when faced面對 with such這樣 outrageous蠻橫的 hyperbole誇張 ...
70
180000
3000
如此誇張的情況時
03:18
(Laughter笑聲)
71
183000
2000
(笑聲)
03:20
was what any self-respecting自我尊重 researchers研究人員
72
185000
3000
我們也跟每一位有自尊心的研究人員一樣
03:23
would have doneDONE.
73
188000
3000
會做相同的事
03:26
We took a page out of XKCDXKCD,
74
191000
2000
我們也和四格漫畫一樣
03:28
and we said, "Stand back.
75
193000
2000
我們決定「等等
03:30
We're going to try science科學."
76
195000
2000
我們要用科學的方式來處理。」
03:32
(Laughter笑聲)
77
197000
2000
(笑聲)
03:34
JMJM: Now of course課程, we were thinking思維,
78
199000
2000
JM:當然,我們在思考
03:36
well let's just first put the data數據 out there
79
201000
2000
首先我們先把資料提取出來
03:38
for people to do science科學 to it.
80
203000
2000
讓其他人以科學的方式去分析
03:40
Now we're thinking思維, what data數據 can we release發布?
81
205000
2000
現在我們在思考,我們能發行何種數據?
03:42
Well of course課程, you want to take the books圖書
82
207000
2000
當然,我們想拿這些書
03:44
and release發布 the full充分 text文本 of these five million百萬 books圖書.
83
209000
2000
將這五百萬本書的內容全部釋出
03:46
Now Google谷歌, and Jon喬恩 OrwantOrwant in particular特定,
84
211000
2000
現在Google,特別是Jon Orwant
03:48
told us a little equation方程 that we should learn學習.
85
213000
2000
告訴我們一個我們該注意的小方程式
03:50
So you have five million百萬, that is, five million百萬 authors作者
86
215000
3000
我們有五百萬本書,也就是有五百萬名作者
03:53
and five million百萬 plaintiffs原告 is a massive大規模的 lawsuit訴訟.
87
218000
3000
而五百萬名原告是一場龐大的訴訟
03:56
So, although雖然 that would be really, really awesome真棒,
88
221000
2000
雖然這個過程是相當地驚人
03:58
again, that's extremely非常, extremely非常 impractical不切實際的.
89
223000
3000
但這還是極度的不切實際
04:01
(Laughter笑聲)
90
226000
2000
(笑聲)
04:03
Now again, we kind of caved下陷 in,
91
228000
2000
然後,我們似乎有點妥協
04:05
and we did the very practical實際的 approach途徑, which哪一個 was a bit less awesome真棒.
92
230000
3000
我們試了比較實際的方式,這方法不怎麼吸引人
04:08
We said, well instead代替 of releasing釋放 the full充分 text文本,
93
233000
2000
我們認為,與其釋出全部的書籍資料
04:10
we're going to release發布 statistics統計 about the books圖書.
94
235000
2000
我們選擇將這些書的數據資料給呈現出來
04:12
So take for instance "A gleam閃光 of happiness幸福."
95
237000
2000
舉個例子「幸福的光」
04:14
It's four words; we call that a four-gram四克.
96
239000
2000
這是四個字,我們稱做「四字詞」
04:16
We're going to tell you how many許多 times a particular特定 four-gram四克
97
241000
2000
我們要告訴各位一個特定的四字詞
04:18
appeared出現 in books圖書 in 1801, 1802, 1803,
98
243000
2000
從1801,1802,1803年開始出現在書本裡
04:20
all the way up to 2008.
99
245000
2000
直到2008年
04:22
That gives us a time series系列
100
247000
2000
這給我們一個時間軸來了解
04:24
of how frequently經常 this particular特定 sentence句子 was used over time.
101
249000
2000
這些特定的字句從過去到現在的使用頻率
04:26
We do that for all the words and phrases短語 that appear出現 in those books圖書,
102
251000
3000
我們計算了所有出現在這些書中的字詞
04:29
and that gives us a big table of two billion十億 lines
103
254000
3000
彙整出的資料畫出了二十億條曲線
04:32
that tell us about the way culture文化 has been changing改變.
104
257000
2000
這告訴了我們文化是如何改變的
04:34
ELAELA: So those two billion十億 lines,
105
259000
2000
ELA:這二十億條曲線
04:36
we call them two billion十億 n-grams正克.
106
261000
2000
我們稱為二十億組詞
04:38
What do they tell us?
107
263000
2000
這告訴了我們
04:40
Well the individual個人 n-grams正克 measure測量 cultural文化 trends趨勢.
108
265000
2000
每一組詞代表了不同的文化趨勢
04:42
Let me give you an example.
109
267000
2000
讓我舉個例子
04:44
Let's suppose假設 that I am thriving,
110
269000
2000
假設我做了件不得了的事
04:46
then tomorrow明天 I want to tell you about how well I did.
111
271000
2000
明天我要告訴你是多不得了
04:48
And so I might威力 say, "Yesterday昨天, I throve興盛起來."
112
273000
3000
我可能會說「"Yesterday, I throve."」
04:51
Alternatively另外, I could say, "Yesterday昨天, I thrived蓬勃發展."
113
276000
3000
或者,我也可以說「"Yesterday, I thrived."」
04:54
Well which哪一個 one should I use?
114
279000
3000
但我應該說哪一種呢?
04:57
How to know?
115
282000
2000
要怎麼知道
04:59
As of about six months個月 ago,
116
284000
2000
大概在六個月前
05:01
the state of the art藝術 in this field領域
117
286000
2000
要知道這一領域最尖端的方法
05:03
is that you would, for instance,
118
288000
2000
你可能得要去詢問
05:05
go up to the following以下 psychologist心理學家 with fabulous極好 hair頭髮,
119
290000
2000
一位有著時髦髮型的心理學家
05:07
and you'd say,
120
292000
2000
你可能會問
05:09
"Steve史蒂夫, you're an expert專家 on the irregular不規則 verbs動詞.
121
294000
3000
「史蒂夫,你是不規則動詞的專家。
05:12
What should I do?"
122
297000
2000
我該怎麼說呢?」
05:14
And he'd他會 tell you, "Well most people say thrived蓬勃發展,
123
299000
2000
而他會告訴你「嗯,大部分的人會說"thrive"
05:16
but some people say throve興盛起來."
124
301000
3000
但有些人會說"throve"。」
05:19
And you also knew知道, more or less,
125
304000
2000
而你也或多或少知道
05:21
that if you were to go back in time 200 years年份
126
306000
3000
如果我們回到兩百年前
05:24
and ask the following以下 statesman政治家 with equally一樣 fabulous極好 hair頭髮,
127
309000
3000
去問一位同樣也有時髦髮型的政治家
05:27
(Laughter笑聲)
128
312000
3000
(笑聲)
05:30
"Tom湯姆, what should I say?"
129
315000
2000
「湯姆,我應該怎麼說呢?」
05:32
He'd他會 say, "Well, in my day, most people throve興盛起來,
130
317000
2000
他說「嗯,在我的年代,大部份的人說"throve",
05:34
but some thrived蓬勃發展."
131
319000
3000
但少部分的人說"thrived"」
05:37
So now what I'm just going to show顯示 you is raw生的 data數據.
132
322000
2000
現在我要向各位展示原始數據
05:39
Two rows from this table of two billion十億 entries.
133
324000
4000
這二十億條目資料中的其中兩條數據
05:43
What you're seeing眼看 is year by year frequency頻率
134
328000
2000
各位將會看到的是"thrived"和"throve"兩個字
05:45
of "thrived蓬勃發展" and "throve興盛起來" over time.
135
330000
3000
在各年時期的出現頻率
05:49
Now this is just two
136
334000
2000
這只是二十億筆資料中
05:51
out of two billion十億 rows.
137
336000
3000
其中兩個詞條的資訊
05:54
So the entire整個 data數據 set
138
339000
2000
這全部的數據資料
05:56
is a billion十億 times more awesome真棒 than this slide滑動.
139
341000
3000
將會比此張投影片還要驚人億萬倍
05:59
(Laughter笑聲)
140
344000
2000
(笑聲)
06:01
(Applause掌聲)
141
346000
4000
(掌聲)
06:05
JMJM: Now there are many許多 other pictures圖片 that are worth價值 500 billion十億 words.
142
350000
2000
JM:還有其他圖片也具有五千億字的價值
06:07
For instance, this one.
143
352000
2000
例如這張
06:09
If you just take influenza流感,
144
354000
2000
如果談到感冒
06:11
you will see peaks at the time where you knew知道
145
356000
2000
從這幾個高峰點我們可以知道
06:13
big flu流感 epidemics流行病 were killing謀殺 people around the globe地球.
146
358000
3000
感冒病毒的大流行在全球造成人類死亡
06:16
ELAELA: If you were not yet然而 convinced相信,
147
361000
3000
ELA:如果各位還不太相信
06:19
sea levels水平 are rising升起,
148
364000
2000
其他像是海平面升高
06:21
so is atmospheric大氣的 COCO2 and global全球 temperature溫度.
149
366000
3000
大氣中的二氧化碳和全球暖化
06:24
JMJM: You might威力 also want to have a look at this particular特定 n-gram正克,
150
369000
3000
JM:你也許會想看看這組特別的詞組
06:27
and that's to tell Nietzsche尼采 that God is not dead,
151
372000
3000
「告訴尼采,上帝還沒死」
06:30
although雖然 you might威力 agree同意 that he might威力 need a better publicist公關.
152
375000
3000
也許你可能還會認為,他可能需要一個更好的公關
06:33
(Laughter笑聲)
153
378000
2000
(笑聲)
06:35
ELAELA: You can get at some pretty漂亮 abstract抽象 concepts概念 with this sort分類 of thing.
154
380000
3000
ELA:從這當中,各位也能獲得一些相當抽象的概念
06:38
For instance, let me tell you the history歷史
155
383000
2000
例如,讓我跟各位說說
06:40
of the year 1950.
156
385000
2000
有關「1950年」的歷史
06:42
Pretty漂亮 much for the vast廣大 majority多數 of history歷史,
157
387000
2000
幾乎在絕大多數的歷史裡
06:44
no one gave a damn該死的 about 1950.
158
389000
2000
沒有特別談論1950這一年
06:46
In 1700, in 1800, in 1900,
159
391000
2000
在1700年,在1800年,1900年
06:48
no one cared照顧.
160
393000
3000
沒有人在乎
06:52
Through通過 the 30s and 40s,
161
397000
2000
甚至到30年代和40年代
06:54
no one cared照顧.
162
399000
2000
也沒有人在談論
06:56
Suddenly突然, in the mid-中-40s,
163
401000
2000
突然到了40年代中期
06:58
there started開始 to be a buzz蜂鳴器.
164
403000
2000
開始出現了風潮
07:00
People realized實現 that 1950 was going to happen發生,
165
405000
2000
人們意識到1950年就要來臨
07:02
and it could be big.
166
407000
2000
這是件大事
07:04
(Laughter笑聲)
167
409000
3000
(笑聲)
07:07
But nothing got people interested有興趣 in 1950
168
412000
3000
但也沒有因此讓大眾對該年份產生興趣
07:10
like the year 1950.
169
415000
3000
像是「那1950年」
07:13
(Laughter笑聲)
170
418000
3000
(笑聲)
07:16
People were walking步行 around obsessed痴迷.
171
421000
2000
人們開始對這一年著迷
07:18
They couldn't不能 stop talking
172
423000
2000
大家無法停止談論
07:20
about all the things they did in 1950,
173
425000
3000
有關他們在1950年所做的一切
07:23
all the things they were planning規劃 to do in 1950,
174
428000
3000
所有他們計畫要在1950年所做的事
07:26
all the dreams of what they wanted to accomplish完成 in 1950.
175
431000
5000
所有他們要在1950年完成的夢想
07:31
In fact事實, 1950 was so fascinating迷人
176
436000
2000
事實上,1950年跟往後幾年相較
07:33
that for years年份 thereafter其後,
177
438000
2000
是相當迷人的一年
07:35
people just kept不停 talking about all the amazing驚人 things that happened發生,
178
440000
3000
人們不停談論所有發生在
07:38
in '51, '52, '53.
179
443000
2000
'51,'52,'53年的驚奇事件
07:40
Finally最後 in 1954,
180
445000
2000
直到1954年
07:42
someone有人 woke醒來 up and realized實現
181
447000
2000
有人驚覺而且意識到
07:44
that 1950 had gotten得到 somewhat有些 pass通過é.
182
449000
4000
1950年已經變得過時了
07:48
(Laughter笑聲)
183
453000
2000
(笑聲)
07:50
And just like that, the bubble泡沫 burst爆裂.
184
455000
2000
這一切就像泡沫破滅一樣
07:52
(Laughter笑聲)
185
457000
2000
(笑聲)
07:54
And the story故事 of 1950
186
459000
2000
1950年的情況
07:56
is the story故事 of every一切 year that we have on record記錄,
187
461000
2000
其實就是我們數據上每一個年份的情況一樣
07:58
with a little twist, because now we've我們已經 got these nice不錯 charts圖表.
188
463000
3000
稍微編排一下,我們有這些精美的圖表
08:01
And because we have these nice不錯 charts圖表, we can measure測量 things.
189
466000
3000
因為有這些不錯的圖表,我們就能計算
08:04
We can say, "Well how fast快速 does the bubble泡沫 burst爆裂?"
190
469000
2000
我們可以了解「風潮消逝的速度是多快?」
08:06
And it turns out that we can measure測量 that very precisely恰恰.
191
471000
3000
結果就是我們能很精確測量出一份數據
08:09
Equations方程 were derived派生, graphs were produced生成,
192
474000
3000
有了方程式,也有圖表
08:12
and the net result結果
193
477000
2000
最終的結果就是
08:14
is that we find that the bubble泡沫 bursts連發 faster更快 and faster更快
194
479000
3000
談論年份的風潮一年比一年
08:17
with each passing通過 year.
195
482000
2000
消退的更快
08:19
We are losing失去 interest利益 in the past過去 more rapidly急速.
196
484000
5000
我們對於過去的興趣日漸消逝
08:24
JMJM: Now a little piece of career事業 advice忠告.
197
489000
2000
JM:這張圖是有關職業建議
08:26
So for those of you who seek尋求 to be famous著名,
198
491000
2000
對於那些想成名的人
08:28
we can learn學習 from the 25 most famous著名 political政治 figures人物,
199
493000
2000
我們可以知道二十五位最有名的政治人物
08:30
authors作者, actors演員 and so on.
200
495000
2000
作家、演員等等
08:32
So if you want to become成為 famous著名 early on, you should be an actor演員,
201
497000
3000
如果各位想在年輕時就成名,那麼各位應該要當演員
08:35
because then fame名譽 starts啟動 rising升起 by the end結束 of your 20s --
202
500000
2000
因為你的名氣會從二十歲後開始累積
08:37
you're still young年輕, it's really great.
203
502000
2000
那時正值青春年華,會相當不錯
08:39
Now if you can wait a little bit, you should be an author作者,
204
504000
2000
如果各位有耐心一點,那麼就應該當個作家
08:41
because then you rise上升 to very great heights高度,
205
506000
2000
因為各位就能攀上高峰
08:43
like Mark標記 Twain吐溫, for instance: extremely非常 famous著名.
206
508000
2000
成為像是馬克吐溫這樣有名望的作家
08:45
But if you want to reach達到 the very top最佳,
207
510000
2000
但如果各位想攀上最頂尖的位置
08:47
you should delay延遲 gratification享樂
208
512000
2000
就得延後滿足自己的慾望
08:49
and, of course課程, become成為 a politician政治家.
209
514000
2000
然後當一位政治家
08:51
So here you will become成為 famous著名 by the end結束 of your 50s,
210
516000
2000
那麼各位會在五十歲過後開始成名
08:53
and become成為 very, very famous著名 afterward之後.
211
518000
2000
然後你的名氣會在未來持續延續
08:55
So scientists科學家們 also tend趨向 to get famous著名 when they're much older舊的.
212
520000
3000
科學家也往往是在老年時才成名
08:58
Like for instance, biologists生物學家 and physics物理
213
523000
2000
而生物學家和物理學家一樣
09:00
tend趨向 to be almost幾乎 as famous著名 as actors演員.
214
525000
2000
往往也是和演員一樣著名
09:02
One mistake錯誤 you should not do is become成為 a mathematician數學家.
215
527000
3000
唯一不要做的職業就是變成數學家
09:05
(Laughter笑聲)
216
530000
2000
(笑聲)
09:07
If you do that,
217
532000
2000
如果各位真要做這行
09:09
you might威力 think, "Oh great. I'm going to do my best最好 work when I'm in my 20s."
218
534000
3000
各位可能會想「太好了,當我在二十多歲時,我會盡一切努力。」
09:12
But guess猜測 what, nobody沒有人 will really care關心.
219
537000
2000
但事實上,沒人會真正去在乎你所做的事
09:14
(Laughter笑聲)
220
539000
3000
(笑聲)
09:17
ELAELA: There are more sobering發人深省 notes筆記
221
542000
2000
ELA:在我們的資料裡
09:19
among其中 the n-grams正克.
222
544000
2000
還有其他更發人省思的紀錄
09:21
For instance, here's這裡的 the trajectory彈道 of Marc渣子 Chagall夏加爾,
223
546000
2000
例如馬克‧夏卡爾的名字出現的頻率軌跡
09:23
an artist藝術家 born天生 in 1887.
224
548000
2000
夏卡爾是位1887年出生的藝術家
09:25
And this looks容貌 like the normal正常 trajectory彈道 of a famous著名 person.
225
550000
3000
這看起來是一位名人名字正常出現在書中的軌跡
09:28
He gets得到 more and more and more famous著名,
226
553000
4000
他的名氣日益響亮
09:32
except if you look in German德語.
227
557000
2000
但如果看德國的數據就不是如此
09:34
If you look in German德語, you see something completely全然 bizarre奇異的,
228
559000
2000
如果看德國的數據,會看到某部份是非常奇怪的
09:36
something you pretty漂亮 much never see,
229
561000
2000
這是幾乎不太可能看到的
09:38
which哪一個 is he becomes extremely非常 famous著名
230
563000
2000
就是他變得非常有名
09:40
and then all of a sudden突然 plummets驟降,
231
565000
2000
卻突然在1933年至1945年間
09:42
going through通過 a nadir最低點 between之間 1933 and 1945,
232
567000
3000
聲勢跌落谷底
09:45
before rebounding反彈 afterward之後.
233
570000
3000
又反彈回升
09:48
And of course課程, what we're seeing眼看
234
573000
2000
當然我們看的出來
09:50
is the fact事實 Marc渣子 Chagall夏加爾 was a Jewish猶太 artist藝術家
235
575000
3000
這是因為馬克‧夏卡爾是一位猶太裔藝術家
09:53
in Nazi納粹 Germany德國.
236
578000
2000
當時德國是納粹統治
09:55
Now these signals信號
237
580000
2000
這些指標
09:57
are actually其實 so strong強大
238
582000
2000
事實上相當明確
09:59
that we don't need to know that someone有人 was censored審查.
239
584000
3000
我們不需要知道有人在審查書籍
10:02
We can actually其實 figure數字 it out
240
587000
2000
我們能運用基本的信號運算方式
10:04
using運用 really basic基本 signal信號 processing處理.
241
589000
2000
實際了解當時狀況
10:06
Here's這裡的 a simple簡單 way to do it.
242
591000
2000
我們可以用簡單的方式來做
10:08
Well, a reasonable合理 expectation期望
243
593000
2000
合理的預期是
10:10
is that somebody's某人的 fame名譽 in a given特定 period of time
244
595000
2000
在一段特定的時間裡某人的名氣指數
10:12
should be roughly大致 the average平均 of their fame名譽 before
245
597000
2000
應該會是他們成名前
10:14
and their fame名譽 after.
246
599000
2000
和成名後的指數的平均值
10:16
So that's sort分類 of what we expect期望.
247
601000
2000
這大概是我們預期的結果
10:18
And we compare比較 that to the fame名譽 that we observe.
248
603000
3000
我們比較了我們觀察到的名人
10:21
And we just divide劃分 one by the other
249
606000
2000
我們將前後的數值相除
10:23
to produce生產 something we call a suppression抑制 index指數.
250
608000
2000
得到的數值,我們稱作抑制指數
10:25
If the suppression抑制 index指數 is very, very, very small,
251
610000
3000
如果抑制指數的值非常的小
10:28
then you very well might威力 be being存在 suppressed抑制.
252
613000
2000
那麼就表示此人也許遭受到打壓
10:30
If it's very large, maybe you're benefiting受益 from propaganda宣傳.
253
615000
3000
但如果數值非常大,也許此人獲得大量的推廣
10:34
JMJM: Now you can actually其實 look at
254
619000
2000
JM:各位現在可以看到
10:36
the distribution分配 of suppression抑制 indexes索引 over whole整個 populations人群.
255
621000
3000
抑制指數在抽樣整體人數中的分佈情況
10:39
So for instance, here --
256
624000
2000
所以,例如這裡 --
10:41
this suppression抑制 index指數 is for 5,000 people
257
626000
2000
這個抑制指數的抽樣人數是五千人
10:43
picked採摘的 in English英語 books圖書 where there's no known已知 suppression抑制 --
258
628000
2000
選自出版時期沒有打壓限制的英文書籍來做調查
10:45
it would be like this, basically基本上 tightly緊緊 centered中心 on one.
259
630000
2000
曲線基本上會在數值1的地方呈現高峰
10:47
What you expect期望 is basically基本上 what you observe.
260
632000
2000
基本上預期的會和觀察到的數值是相同的
10:49
This is distribution分配 as seen看到 in Germany德國 --
261
634000
2000
這份分佈圖則是德國的部分 --
10:51
very different不同, it's shifted to the left.
262
636000
2000
相當不同,曲線移往左側
10:53
People talked about it twice兩次 less as it should have been.
263
638000
3000
人們談論事物的次數比預期的少了兩倍
10:56
But much more importantly重要的, the distribution分配 is much wider更寬的.
264
641000
2000
更重要的是,整體分佈的情況更寬廣
10:58
There are many許多 people who end結束 up on the far left on this distribution分配
265
643000
3000
有相當多人是落在圖表較左側的位置
11:01
who are talked about 10 times fewer than they should have been.
266
646000
3000
因為他們比應該被提及的次數少了十倍
11:04
But then also many許多 people on the far right
267
649000
2000
但也有相當多人是落在較右側的部分
11:06
who seem似乎 to benefit效益 from propaganda宣傳.
268
651000
2000
似乎是因為被大量宣傳
11:08
This picture圖片 is the hallmark特點 of censorship審查 in the book record記錄.
269
653000
3000
這張圖是明顯看出書本中具有審查制度
11:11
ELAELA: So culturomicsculturomics
270
656000
2000
ELA:文化組學
11:13
is what we call this method方法.
271
658000
2000
是我們用的方法
11:15
It's kind of like genomics基因組學.
272
660000
2000
這和基因組學有些類似
11:17
Except genomics基因組學 is a lens鏡片 on biology生物學
273
662000
2000
不過基因組學是透過生物學
11:19
through通過 the window窗口 of the sequence序列 of bases基地 in the human人的 genome基因組.
274
664000
3000
基本的序列基礎來檢視人類基因組
11:22
CulturomicsCulturomics is similar類似.
275
667000
2000
文化組學是類似的
11:24
It's the application應用 of massive-scale巨大的規模 data數據 collection採集 analysis分析
276
669000
3000
這是應用收集分析規模龐大的數據
11:27
to the study研究 of human人的 culture文化.
277
672000
2000
來研究人類文化
11:29
Here, instead代替 of through通過 the lens鏡片 of a genome基因組,
278
674000
2000
不透過檢視基因組
11:31
through通過 the lens鏡片 of digitized數字化 pieces of the historical歷史的 record記錄.
279
676000
3000
而是檢視歷史紀錄的數位資料
11:34
The great thing about culturomicsculturomics
280
679000
2000
文化組學的好處是
11:36
is that everyone大家 can do it.
281
681000
2000
每個人都能執行
11:38
Why can everyone大家 do it?
282
683000
2000
為何每個人都能做呢?
11:40
Everyone大家 can do it because three guys,
283
685000
2000
因為這三位人士
11:42
Jon喬恩 OrwantOrwant, Matt馬特 Gray灰色 and Will Brockman布羅克曼 over at Google谷歌,
284
687000
3000
Google的Jon Orwant,Matt Gray還有Will Brockman
11:45
saw the prototype原型 of the NgramNGRAM Viewer查看器,
285
690000
2000
他們看到Ngram瀏覽器的原型
11:47
and they said, "This is so fun開玩笑.
286
692000
2000
他們說「這太有趣了。」
11:49
We have to make this available可得到 for people."
287
694000
3000
我們要讓大家都可以使用這功能
11:52
So in two weeks flat平面 -- the two weeks before our paper came來了 out --
288
697000
2000
所以在兩週的時間 -- 我們的報告出來的兩週前 --
11:54
they coded編碼 up a version of the NgramNGRAM Viewer查看器 for the general一般 public上市.
289
699000
3000
他們編寫了一個大眾版本的Ngram瀏覽器
11:57
And so you too can type類型 in any word or phrase短語 that you're interested有興趣 in
290
702000
3000
各位可以打上任何各位有興趣的字或詞組
12:00
and see its n-gram正克 immediately立即 --
291
705000
2000
然後立即看到該字詞的頻率變化 --
12:02
also browse瀏覽 examples例子 of all the various各個 books圖書
292
707000
2000
同時根據你搜尋的字詞
12:04
in which哪一個 your n-gram正克 appears出現.
293
709000
2000
瀏覽不同書籍中的各種例子
12:06
JMJM: Now this was used over a million百萬 times on the first day,
294
711000
2000
JM:這功能在首日就被使用了超過一百萬次
12:08
and this is really the best最好 of all the queries查詢.
295
713000
2000
這也是各種查詢工具中最好的一個
12:10
So people want to be their best最好, put their best最好 foot腳丫子 forward前鋒.
296
715000
3000
人們希望做到最好的,以最好的狀態像前進
12:13
But it turns out in the 18th century世紀, people didn't really care關心 about that at all.
297
718000
3000
但事實證明在18世紀,人們一點也不關心這一切
12:16
They didn't want to be their best最好, they wanted to be their beftbeft.
298
721000
3000
他們不想做到最好,他們想變成"beft"
12:19
So what happened發生 is, of course課程, this is just a mistake錯誤.
299
724000
3000
這是怎麼回事,當然這只是個錯誤
12:22
It's not that strove爭取 for mediocrity庸人,
300
727000
2000
這並不是說他們想要平凡
12:24
it's just that the S used to be written書面 differently不同, kind of like an F.
301
729000
3000
這只是因為"S"常被寫的不一樣,寫得像"F"
12:27
Now of course課程, Google谷歌 didn't pick this up at the time,
302
732000
3000
當然,Google並沒有挑出來
12:30
so we reported報導 this in the science科學 article文章 that we wrote.
303
735000
3000
所以我們在自己寫科學文章中提到此事
12:33
But it turns out this is just a reminder提醒
304
738000
2000
不過這只是個提醒
12:35
that, although雖然 this is a lot of fun開玩笑,
305
740000
2000
雖然這相當有趣
12:37
when you interpret these graphs, you have to be very careful小心,
306
742000
2000
當你要解讀這些圖表,你必須非常謹慎
12:39
and you have to adopt採用 the base基礎 standards標準 in the sciences科學.
307
744000
3000
而且必須採納科學的基礎標準
12:42
ELAELA: People have been using運用 this for all kinds of fun開玩笑 purposes目的.
308
747000
3000
ELA:大家一直在使用這工具來滿足各種樂趣
12:45
(Laughter笑聲)
309
750000
7000
(笑聲)
12:52
Actually其實, we're not going to have to talk,
310
757000
2000
事實上,我們不需要說明的
12:54
we're just going to show顯示 you all the slides幻燈片 and remain silent無聲.
311
759000
3000
我們原本只想播放所有的投影片然後在一旁保持沉默
12:57
This person was interested有興趣 in the history歷史 of frustration挫折.
312
762000
3000
此人對於挫折的歷史感興趣
13:00
There's various各個 types類型 of frustration挫折.
313
765000
3000
挫折有非常多種方式
13:03
If you stub存根 your toe腳趾, that's a one A "argh哎呀."
314
768000
3000
如果你踢到腳趾,哀叫聲「啊」就是一個"A"的"argh"
13:06
If the planet行星 Earth地球 is annihilated全軍覆沒 by the Vogons沃貢
315
771000
2000
如果地球被外星人毀滅
13:08
to make room房間 for an interstellar星際 bypass旁路,
316
773000
2000
變成星際間的通道
13:10
that's an eight A "aaaaaaaarghaaaaaaaargh."
317
775000
2000
那麼哀叫聲「啊」就是有八個"A"的"aaaaaaaargh"
13:12
This person studies學習 all the "arghsarghs,"
318
777000
2000
此人研究了所有書籍上出現的哀叫聲「啊」
13:14
from one through通過 eight A's.
319
779000
2000
有從一個"A"到八個"A"
13:16
And it turns out
320
781000
2000
結果是
13:18
that the less-frequent不太頻繁 "arghsarghs"
321
783000
2000
較不頻繁的「啊」“arghs”
13:20
are, of course課程, the ones那些 that correspond對應 to things that are more frustrating洩氣 --
322
785000
3000
對應了那些相對較令人沮喪的的事情
13:23
except, oddly奇怪, in the early 80s.
323
788000
3000
也有例外,奇怪的是在80年代初
13:26
We think that might威力 have something to do with Reagan裡根.
324
791000
2000
我們認為這也許是受到雷根的影響
13:28
(Laughter笑聲)
325
793000
2000
(笑聲)
13:30
JMJM: There are many許多 usages用法 of this data數據,
326
795000
3000
JM:這份書據資料有相當多用途
13:33
but the bottom底部 line is that the historical歷史的 record記錄 is being存在 digitized數字化.
327
798000
3000
不過最終就是歷史紀錄都被數位化了
13:36
Google谷歌 has started開始 to digitize數字化 15 million百萬 books圖書.
328
801000
2000
Google已經開始將一千五百萬本書數位化
13:38
That's 12 percent百分 of all the books圖書 that have ever been published發表.
329
803000
2000
其中百分之十二的書是已出版的
13:40
It's a sizable可觀 chunk of human人的 culture文化.
330
805000
3000
這涵蓋了相當大量的人類文化
13:43
There's much more in culture文化: there's manuscripts手稿, there newspapers報紙,
331
808000
3000
這當中有非常多的文化資料:裡頭有手稿,報紙
13:46
there's things that are not text文本, like art藝術 and paintings繪畫.
332
811000
2000
也有不是文字的資料,像是藝術品和畫作
13:48
These all happen發生 to be on our computers電腦,
333
813000
2000
現在這都存放在我們的電腦裡
13:50
on computers電腦 across橫過 the world世界.
334
815000
2000
在世界各處的電腦裡
13:52
And when that happens發生, that will transform轉變 the way we have
335
817000
3000
如果這一切成真,就會改變
13:55
to understand理解 our past過去, our present當下 and human人的 culture文化.
336
820000
2000
我們了解過去、現在和人類文化的方式
13:57
Thank you very much.
337
822000
2000
非常謝謝各位
13:59
(Applause掌聲)
338
824000
3000
(掌聲)
Translated by Joyce Chou
Reviewed by Qi Gu

▲Back to top

ABOUT THE SPEAKERS
Jean-Baptiste Michel - Data researcher
Jean-Baptiste Michel looks at how we can use large volumes of data to better understand our world.

Why you should listen

Jean-Baptiste Michel holds joint academic appointments at Harvard (FQEB Fellow) and Google (Visiting Faculty). His research focusses on using large volumes of data as tools that help better understand the world around us -- from the way diseases progress in patients over years, to the way cultures change in human societies over centuries. With his colleague Erez Lieberman Aiden, Jean-Baptiste is a Founding Director of Harvard's Cultural Observatory, where their research team pioneers the use of quantitative methods for the study of human culture, language and history. His research was featured on the covers of Science and Nature, on the front pages of the New York Times and the Boston Globe, in The Economist, Wired and many other venues. The online tool he helped create -- ngrams.googlelabs.com -- was used millions of times to browse cultural trends. Jean-Baptiste is an Engineer from Ecole Polytechnique (Paris), and holds an MS in Applied Mathematics and a PhD in Systems Biology from Harvard.

More profile about the speaker
Jean-Baptiste Michel | Speaker | TED.com
Erez Lieberman Aiden - Researcher
Erez Lieberman Aiden pursues a broad range of research interests, spanning genomics, linguistics, mathematics ...

Why you should listen

Erez Lieberman Aiden is a fellow at the Harvard Society of Fellows and Visiting Faculty at Google. His research spans many disciplines and has won numerous awards, including recognition for one of the top 20 "Biotech Breakthroughs that will Change Medicine", by Popular Mechanics; the Lemelson-MIT prize for the best student inventor at MIT; the American Physical Society's Award for the Best Doctoral Dissertation in Biological Physics; and membership in Technology Review's 2009 TR35, recognizing the top 35 innovators under 35. His last three papers -- two with JB Michel -- have all appeared on the cover of Nature and Science.

More profile about the speaker
Erez Lieberman Aiden | Speaker | TED.com