ABOUT THE SPEAKERS

Jean-Baptiste Michel - Data researcher
Jean-Baptiste Michel looks at how we can use large volumes of data to better understand our world.

Why you should listen

Jean-Baptiste Michel holds joint academic appointments at Harvard (FQEB Fellow) and Google (Visiting Faculty). His research focusses on using large volumes of data as tools that help better understand the world around us -- from the way diseases progress in patients over years, to the way cultures change in human societies over centuries. With his colleague Erez Lieberman Aiden, Jean-Baptiste is a Founding Director of Harvard's Cultural Observatory, where their research team pioneers the use of quantitative methods for the study of human culture, language and history. His research was featured on the covers of Science and Nature, on the front pages of the New York Times and the Boston Globe, in The Economist, Wired and many other venues. The online tool he helped create -- ngrams.googlelabs.com -- was used millions of times to browse cultural trends. Jean-Baptiste is an Engineer from Ecole Polytechnique (Paris), and holds an MS in Applied Mathematics and a PhD in Systems Biology from Harvard.

More profile about the speaker
Jean-Baptiste Michel | Speaker | TED.com

Erez Lieberman Aiden - Researcher
Erez Lieberman Aiden pursues a broad range of research interests, spanning genomics, linguistics, mathematics ...

Why you should listen

Erez Lieberman Aiden is a fellow at the Harvard Society of Fellows and Visiting Faculty at Google. His research spans many disciplines and has won numerous awards, including recognition for one of the top 20 "Biotech Breakthroughs that will Change Medicine", by Popular Mechanics; the Lemelson-MIT prize for the best student inventor at MIT; the American Physical Society's Award for the Best Doctoral Dissertation in Biological Physics; and membership in Technology Review's 2009 TR35, recognizing the top 35 innovators under 35. His last three papers -- two with JB Michel -- have all appeared on the cover of Nature and Science.

More profile about the speaker
Erez Lieberman Aiden | Speaker | TED.com

TEDxBoston 2011

Jean-Baptiste Michel + Erez Lieberman Aiden: What we learned from 5 million books

從五百萬本書學到的事

Filmed: 2011-07-24

Readability: 3.9

2,049,453 views

你是否使用過Google實驗室開發的Ngram瀏覽器?這是一款吸引人的工具，能讓你從跨世紀以來五百萬本書的資料庫中搜尋字詞和想法。Erez Lieberman Aiden和Jean-Baptiste Michel將為我們展示這款工具如何運作，以及一些我們能從這五千億字中學到的一些驚喜發現。

Jean-Baptiste Michel - Data researcher
Jean-Baptiste Michel looks at how we can use large volumes of data to better understand our world. Full bioErez Lieberman Aiden - Researcher
Erez Lieberman Aiden pursues a broad range of research interests, spanning genomics, linguistics, mathematics ... Full bio

Double-click the English transcript below to play the video.

00:15

Erez埃雷茲 Lieberman利伯曼 Aiden艾登: Everyone大家 knows知道

0

0

2000

Erez Lieberman Aiden：大家都知道

00:17

that a picture圖片 is worth價值 a thousand千 words話.

1

2000

3000

一張圖勝過千言萬語

00:22

But we at Harvard哈佛

2

7000

2000

但我們在哈佛時

00:24

were wondering想知道 if this was really true真正.

3

9000

3000

卻在思考這道理是否真是如此

00:27

(Laughter笑聲)

4

12000

2000

(笑聲)

00:29

So we assembled組裝 a team球隊 of experts專家,

5

14000

4000

所以我們由來自哈佛大學

00:33

spanning跨越 Harvard哈佛, MITMIT,

6

18000

2000

麻省理工學院

00:35

The American美國 Heritage遺產 Dictionary字典, The Encyclopedia百科全書 Britannica大英百科全書

7

20000

3000

美國傳統英語詞典，大英百科全書

00:38

and even our proud驕傲 sponsors贊助商,

8

23000

2000

甚至我們偉大的贊助商─Google的專家們

00:40

the Google谷歌.

9

25000

3000

組成一個團隊

00:43

And we cogitated沉思起來 about this

10

28000

2000

我們花了四年的時間

00:45

for about four四 years年份.

11

30000

2000

在思考這個問題

00:47

And we came來了 to a startling觸目驚心 conclusion結論.

12

32000

5000

然後我們得到了一個驚人的結論

00:52

Ladies女士們 and gentlemen紳士, a picture圖片 is not worth價值 a thousand千 words話.

13

37000

3000

女士先生們，一張圖片其實不只勝過千言萬語

00:55

In fact事實, we found發現 some pictures圖片

14

40000

2000

事實上，我們發現某些圖片

00:57

that are worth價值 500 billion十億 words話.

15

42000

5000

更是勝過五千億個字

01:02

Jean-Baptiste讓 - 巴蒂斯特 Michel米歇爾: So how did we get to this conclusion結論?

16

47000

2000

Jean-Baptiste Michel：我們是如何得出這項結論的呢？

01:04

So Erez埃雷茲 and I were thinking思維 about ways方法

17

49000

2000

Erez和我思考了不同的方式

01:06

to get a big大 picture圖片 of human人的 culture文化

18

51000

2000

想更加了解人類文化

01:08

and human人的 history歷史: change更改 over time.

19

53000

3000

以及人類歷史從古到今的變化的全景

01:11

So many許多 books圖書 actually其實 have been written書面 over the years年份.

20

56000

2000

事實上，多年來已經出版了許多書籍。

01:13

So we were thinking思維, well the best最好 way to learn學習 from them

21

58000

2000

所以我們認為最好的學習方式

01:15

is to read讀 all of these millions百萬 of books圖書.

22

60000

2000

就是將這上百萬的書全讀過一遍

01:17

Now of course課程, if there's a scale規模 for how awesome真棒 that is,

23

62000

3000

如果能有一個尺規來說明此舉的驚人程度

01:20

that has to rank秩 extremely非常, extremely非常 high高.

24

65000

3000

這將會相當驚人

01:23

Now the problem問題 is there's an X-axisX軸 for that,

25

68000

2000

但問題是這裡的X軸

01:25

which哪一個 is the practical實際的 axis軸.

26

70000

2000

是表示實用程度

01:27

This is very, very low低.

27

72000

2000

這相當不實用

01:29

(Applause掌聲)

28

74000

3000

(掌聲)

01:32

Now people tend趨向 to use an alternative替代 approach途徑,

29

77000

3000

現在人們希望用別的方式

01:35

which哪一個 is to take a few少數 sources來源 and read讀 them very carefully小心.

30

80000

2000

可以讀少一點書，但讀得非常仔細

01:37

This is extremely非常 practical實際的, but not so awesome真棒.

31

82000

2000

這會相當實用，但這一點都不吸引人

01:39

What you really want to do

32

84000

3000

我們真正想做的是

01:42

is to get to the awesome真棒 yet然而 practical實際的 part部分 of this space空間.

33

87000

3000

要用一種吸引人且實用的方法來閱讀這些書

01:45

So it turns圈 out there was a company公司 across橫過 the river河 called叫 Google谷歌

34

90000

3000

所以在河的對岸有間公司叫做Google

01:48

who had started開始 a digitization數字化 project項目 a few少數 years年份 back

35

93000

2000

他們幾年之前開始了一項數字化計畫

01:50

that might威力 just enable啟用 this approach途徑.

36

95000

2000

這項計畫讓我們能實踐剛說的方法

01:52

They have digitized數字化 millions百萬 of books圖書.

37

97000

2000

他們已將數百萬本書給數位化

01:54

So what that means手段 is, one could use computational計算 methods方法

38

99000

3000

這意味著，我們可以透過電腦

01:57

to read讀 all of the books圖書 in a click點擊 of a button按鍵.

39

102000

2000

簡單按個按鈕就能閱讀所有的書

01:59

That's very practical實際的 and extremely非常 awesome真棒.

40

104000

3000

這非常實用而且相當棒

02:03

ELAELA: Let me tell you a little bit位 about where books圖書 come from.

41

108000

2000

ELA：讓我為各位介紹這些書都來自何方

02:05

Since以來 time immemorial太古, there have been authors作者.

42

110000

3000

自古以來，有非常多作家

02:08

These authors作者 have been striving努力 to write寫 books圖書.

43

113000

3000

這些作家一直努力寫作

02:11

And this became成為 considerably相當 easier更輕鬆

44

116000

2000

但現在寫作變得相當容易

02:13

with the development發展 of the printing印花 press按 some centuries百年 ago前.

45

118000

2000

這歸功於幾世紀前印刷術的革新

02:15

Since以來 then, the authors作者 have won韓元

46

120000

3000

自那時起作家們

02:18

on 129 million百萬 distinct不同 occasions場合,

47

123000

2000

能在一億兩千九百萬個不同的地方

02:20

publishing出版 books圖書.

48

125000

2000

出版書籍

02:22

Now if those books圖書 are not lost丟失 to history歷史,

49

127000

2000

如果那些書沒有因為時代交替而遺失

02:24

then they are somewhere某處 in a library圖書館,

50

129000

2000

那麼那些書可能在某個圖書館的一處

02:26

and many許多 of those books圖書 have been getting得到 retrieved檢索 from the libraries圖書館

51

131000

3000

有相當多書可以從圖書館中被借閱

02:29

and digitized數字化 by Google谷歌,

52

134000

2000

由Google將其數位化

02:31

which哪一個 has scanned掃描 15 million百萬 books圖書 to date日期.

53

136000

2000

迄今Google已經掃描了一千五百萬本書

02:33

Now when Google谷歌 digitizes數字化 a book書, they put it into a really nice不錯 format格式.

54

138000

3000

Google將一本書數位化，並以優良的型式呈現

02:36

Now we've我們已經 got the data數據, plus加 we have metadata元數據.

55

141000

2000

現在我們有了這些數據，加上這些詮釋資料

02:38

We have information信息 about things like where was it published發表,

56

143000

3000

我們有了相關的資訊，比如出版地區，

02:41

who was the author作者, when was it published發表.

57

146000

2000

作者，出版時間

02:43

And what we do is go through通過 all of those records記錄

58

148000

3000

我們所做的就是透過這些記錄

02:46

and exclude排除 everything that's not the highest最高 quality質量 data數據.

59

151000

4000

並剔除不是最精華的資料

02:50

What we're left with

60

155000

2000

我們後來得到的是

02:52

is a collection採集 of five五 million百萬 books圖書,

61

157000

3000

五百萬本書

02:55

500 billion十億 words話,

62

160000

3000

五千億個詞

02:58

a string串 of characters人物 a thousand千 times時 longer長

63

163000

2000

這是一串比人類基因組

03:00

than the human人的 genome基因組 --

64

165000

3000

還要長上一千倍的字符

03:03

a text文本 which哪一個, when written書面 out,

65

168000

2000

如果寫成文章

03:05

would stretch伸展 from here to the Moon月亮 and back

66

170000

2000

將會是從這裡到月球來回距離

03:07

10 times時 over --

67

172000

2000

的十倍以上

03:09

a veritable名副其實 shard碎片 of our cultural文化 genome基因組.

68

174000

4000

這是我們文化基因名副其實的的一部分

03:13

Of course課程 what we did

69

178000

2000

當然當我們面臨

03:15

when faced面對 with such這樣 outrageous蠻橫的 hyperbole誇張 ...

70

180000

3000

如此誇張的情況時

03:18

(Laughter笑聲)

71

183000

2000

(笑聲)

03:20

was what any self-respecting自我尊重 researchers研究人員

72

185000

3000

我們也跟每一位有自尊心的研究人員一樣

03:23

would have doneDONE.

73

188000

3000

會做相同的事

03:26

We took拿 a page頁 out of XKCDXKCD,

74

191000

2000

我們也和四格漫畫一樣

03:28

and we said, "Stand站 back.

75

193000

2000

我們決定「等等

03:30

We're going to try science科學."

76

195000

2000

我們要用科學的方式來處理。」

03:32

(Laughter笑聲)

77

197000

2000

(笑聲)

03:34

JMJM: Now of course課程, we were thinking思維,

78

199000

2000

JM：當然，我們在思考

03:36

well let's just first put the data數據 out there

79

201000

2000

首先我們先把資料提取出來

03:38

for people to do science科學 to it.

80

203000

2000

讓其他人以科學的方式去分析

03:40

Now we're thinking思維, what data數據 can we release發布?

81

205000

2000

現在我們在思考，我們能發行何種數據？

03:42

Well of course課程, you want to take the books圖書

82

207000

2000

當然，我們想拿這些書

03:44

and release發布 the full充分 text文本 of these five五 million百萬 books圖書.

83

209000

2000

將這五百萬本書的內容全部釋出

03:46

Now Google谷歌, and Jon喬恩 OrwantOrwant in particular特定,

84

211000

2000

現在Google，特別是Jon Orwant

03:48

told us a little equation方程 that we should learn學習.

85

213000

2000

告訴我們一個我們該注意的小方程式

03:50

So you have five五 million百萬, that is, five五 million百萬 authors作者

86

215000

3000

我們有五百萬本書，也就是有五百萬名作者

03:53

and five五 million百萬 plaintiffs原告 is a massive大規模的 lawsuit訴訟.

87

218000

3000

而五百萬名原告是一場龐大的訴訟

03:56

So, although雖然 that would be really, really awesome真棒,

88

221000

2000

雖然這個過程是相當地驚人

03:58

again, that's extremely非常, extremely非常 impractical不切實際的.

89

223000

3000

但這還是極度的不切實際

04:01

(Laughter笑聲)

90

226000

2000

(笑聲)

04:03

Now again, we kind類 of caved下陷 in,

91

228000

2000

然後，我們似乎有點妥協

04:05

and we did the very practical實際的 approach途徑, which哪一個 was a bit位 less減 awesome真棒.

92

230000

3000

我們試了比較實際的方式，這方法不怎麼吸引人

04:08

We said, well instead代替 of releasing釋放 the full充分 text文本,

93

233000

2000

我們認為，與其釋出全部的書籍資料

04:10

we're going to release發布 statistics統計 about the books圖書.

94

235000

2000

我們選擇將這些書的數據資料給呈現出來

04:12

So take for instance例 "A gleam閃光 of happiness幸福."

95

237000

2000

舉個例子「幸福的光」

04:14

It's four四 words話; we call that a four-gram四克.

96

239000

2000

這是四個字，我們稱做「四字詞」

04:16

We're going to tell you how many許多 times時 a particular特定 four-gram四克

97

241000

2000

我們要告訴各位一個特定的四字詞

04:18

appeared出現 in books圖書 in 1801, 1802, 1803,

98

243000

2000

從1801，1802，1803年開始出現在書本裡

04:20

all the way up to 2008.

99

245000

2000

直到2008年

04:22

That gives給 us a time series系列

100

247000

2000

這給我們一個時間軸來了解

04:24

of how frequently經常 this particular特定 sentence句子 was used over time.

101

249000

2000

這些特定的字句從過去到現在的使用頻率

04:26

We do that for all the words話 and phrases短語 that appear出現 in those books圖書,

102

251000

3000

我們計算了所有出現在這些書中的字詞

04:29

and that gives給 us a big大 table表 of two billion十億 lines線

103

254000

3000

彙整出的資料畫出了二十億條曲線

04:32

that tell us about the way culture文化 has been changing改變.

104

257000

2000

這告訴了我們文化是如何改變的

04:34

ELAELA: So those two billion十億 lines線,

105

259000

2000

ELA：這二十億條曲線

04:36

we call them two billion十億 n-grams正克.

106

261000

2000

我們稱為二十億組詞

04:38

What do they tell us?

107

263000

2000

這告訴了我們

04:40

Well the individual個人 n-grams正克 measure測量 cultural文化 trends趨勢.

108

265000

2000

每一組詞代表了不同的文化趨勢

04:42

Let me give you an example例.

109

267000

2000

讓我舉個例子

04:44

Let's suppose假設 that I am thriving熙,

110

269000

2000

假設我做了件不得了的事

04:46

then tomorrow明天 I want to tell you about how well I did.

111

271000

2000

明天我要告訴你是多不得了

04:48

And so I might威力 say, "Yesterday昨天, I throve興盛起來."

112

273000

3000

我可能會說「"Yesterday, I throve."」

04:51

Alternatively另外, I could say, "Yesterday昨天, I thrived蓬勃發展."

113

276000

3000

或者，我也可以說「"Yesterday, I thrived."」

04:54

Well which哪一個 one should I use?

114

279000

3000

但我應該說哪一種呢？

04:57

How to know?

115

282000

2000

要怎麼知道

04:59

As of about six六 months個月 ago前,

116

284000

2000

大概在六個月前

05:01

the state州 of the art藝術 in this field領域

117

286000

2000

要知道這一領域最尖端的方法

05:03

is that you would, for instance例,

118

288000

2000

你可能得要去詢問

05:05

go up to the following以下 psychologist心理學家 with fabulous極好 hair頭髮,

119

290000

2000

一位有著時髦髮型的心理學家

05:07

and you'd你 say,

120

292000

2000

你可能會問

05:09

"Steve史蒂夫, you're an expert專家 on the irregular不規則 verbs動詞.

121

294000

3000

「史蒂夫，你是不規則動詞的專家。

05:12

What should I do?"

122

297000

2000

我該怎麼說呢？」

05:14

And he'd他會 tell you, "Well most最 people say thrived蓬勃發展,

123

299000

2000

而他會告訴你「嗯，大部分的人會說"thrive"

05:16

but some people say throve興盛起來."

124

301000

3000

但有些人會說"throve"。」

05:19

And you also也 knew知道, more or less減,

125

304000

2000

而你也或多或少知道

05:21

that if you were to go back in time 200 years年份

126

306000

3000

如果我們回到兩百年前

05:24

and ask問 the following以下 statesman政治家 with equally一樣 fabulous極好 hair頭髮,

127

309000

3000

去問一位同樣也有時髦髮型的政治家

05:27

(Laughter笑聲)

128

312000

3000

(笑聲)

05:30

"Tom湯姆, what should I say?"

129

315000

2000

「湯姆，我應該怎麼說呢？」

05:32

He'd他會 say, "Well, in my day, most最 people throve興盛起來,

130

317000

2000

他說「嗯，在我的年代，大部份的人說"throve"，

05:34

but some thrived蓬勃發展."

131

319000

3000

但少部分的人說"thrived"」

05:37

So now what I'm just going to show顯示 you is raw生的 data數據.

132

322000

2000

現在我要向各位展示原始數據

05:39

Two rows行 from this table表 of two billion十億 entries項.

133

324000

4000

這二十億條目資料中的其中兩條數據

05:43

What you're seeing眼看 is year年 by year年 frequency頻率

134

328000

2000

各位將會看到的是"thrived"和"throve"兩個字

05:45

of "thrived蓬勃發展" and "throve興盛起來" over time.

135

330000

3000

在各年時期的出現頻率

05:49

Now this is just two

136

334000

2000

這只是二十億筆資料中

05:51

out of two billion十億 rows行.

137

336000

3000

其中兩個詞條的資訊

05:54

So the entire整個 data數據 set組

138

339000

2000

這全部的數據資料

05:56

is a billion十億 times時 more awesome真棒 than this slide滑動.

139

341000

3000

將會比此張投影片還要驚人億萬倍

05:59

(Laughter笑聲)

140

344000

2000

(笑聲)

06:01

(Applause掌聲)

141

346000

4000

(掌聲)

06:05

JMJM: Now there are many許多 other pictures圖片 that are worth價值 500 billion十億 words話.

142

350000

2000

JM：還有其他圖片也具有五千億字的價值

06:07

For instance例, this one.

143

352000

2000

例如這張

06:09

If you just take influenza流感,

144

354000

2000

如果談到感冒

06:11

you will see peaks峰 at the time where you knew知道

145

356000

2000

從這幾個高峰點我們可以知道

06:13

big大 flu流感 epidemics流行病 were killing謀殺 people around the globe地球.

146

358000

3000

感冒病毒的大流行在全球造成人類死亡

06:16

ELAELA: If you were not yet然而 convinced相信,

147

361000

3000

ELA：如果各位還不太相信

06:19

sea海 levels水平 are rising升起,

148

364000

2000

其他像是海平面升高

06:21

so is atmospheric大氣的 COCO2 and global全球 temperature溫度.

149

366000

3000

大氣中的二氧化碳和全球暖化

06:24

JMJM: You might威力 also也 want to have a look at this particular特定 n-gram正克,

150

369000

3000

JM：你也許會想看看這組特別的詞組

06:27

and that's to tell Nietzsche尼采 that God is not dead死,

151

372000

3000

「告訴尼采，上帝還沒死」

06:30

although雖然 you might威力 agree同意 that he might威力 need a better publicist公關.

152

375000

3000

也許你可能還會認為，他可能需要一個更好的公關

06:33

(Laughter笑聲)

153

378000

2000

(笑聲)

06:35

ELAELA: You can get at some pretty漂亮 abstract抽象 concepts概念 with this sort分類 of thing.

154

380000

3000

ELA：從這當中，各位也能獲得一些相當抽象的概念

06:38

For instance例, let me tell you the history歷史

155

383000

2000

例如，讓我跟各位說說

06:40

of the year年 1950.

156

385000

2000

有關「1950年」的歷史

06:42

Pretty漂亮 much for the vast廣大 majority多數 of history歷史,

157

387000

2000

幾乎在絕大多數的歷史裡

06:44

no one gave給 a damn該死的 about 1950.

158

389000

2000

沒有特別談論1950這一年

06:46

In 1700, in 1800, in 1900,

159

391000

2000

在1700年，在1800年，1900年

06:48

no one cared照顧.

160

393000

3000

沒有人在乎

06:52

Through通過 the 30s and 40s,

161

397000

2000

甚至到30年代和40年代

06:54

no one cared照顧.

162

399000

2000

也沒有人在談論

06:56

Suddenly突然, in the mid-中-40s,

163

401000

2000

突然到了40年代中期

06:58

there started開始 to be a buzz蜂鳴器.

164

403000

2000

開始出現了風潮

07:00

People realized實現 that 1950 was going to happen發生,

165

405000

2000

人們意識到1950年就要來臨

07:02

and it could be big大.

166

407000

2000

這是件大事

07:04

(Laughter笑聲)

167

409000

3000

(笑聲)

07:07

But nothing got people interested有興趣 in 1950

168

412000

3000

但也沒有因此讓大眾對該年份產生興趣

07:10

like the year年 1950.

169

415000

3000

像是「那1950年」

07:13

(Laughter笑聲)

170

418000

3000

(笑聲)

07:16

People were walking步行 around obsessed痴迷.

171

421000

2000

人們開始對這一年著迷

07:18

They couldn't不能 stop talking說

172

423000

2000

大家無法停止談論

07:20

about all the things they did in 1950,

173

425000

3000

有關他們在1950年所做的一切

07:23

all the things they were planning規劃 to do in 1950,

174

428000

3000

所有他們計畫要在1950年所做的事

07:26

all the dreams夢 of what they wanted to accomplish完成 in 1950.

175

431000

5000

所有他們要在1950年完成的夢想

07:31

In fact事實, 1950 was so fascinating迷人

176

436000

2000

事實上，1950年跟往後幾年相較

07:33

that for years年份 thereafter其後,

177

438000

2000

是相當迷人的一年

07:35

people just kept不停 talking說 about all the amazing驚人 things that happened發生,

178

440000

3000

人們不停談論所有發生在

07:38

in '51, '52, '53.

179

443000

2000

'51，'52，'53年的驚奇事件

07:40

Finally最後 in 1954,

180

445000

2000

直到1954年

07:42

someone有人 woke醒來 up and realized實現

181

447000

2000

有人驚覺而且意識到

07:44

that 1950 had gotten得到 somewhat有些 pass通過é.

182

449000

4000

1950年已經變得過時了

07:48

(Laughter笑聲)

183

453000

2000

(笑聲)

07:50

And just like that, the bubble泡沫 burst爆裂.

184

455000

2000

這一切就像泡沫破滅一樣

07:52

(Laughter笑聲)

185

457000

2000

(笑聲)

07:54

And the story故事 of 1950

186

459000

2000

1950年的情況

07:56

is the story故事 of every一切 year年 that we have on record記錄,

187

461000

2000

其實就是我們數據上每一個年份的情況一樣

07:58

with a little twist捻, because now we've我們已經 got these nice不錯 charts圖表.

188

463000

3000

稍微編排一下，我們有這些精美的圖表

08:01

And because we have these nice不錯 charts圖表, we can measure測量 things.

189

466000

3000

因為有這些不錯的圖表，我們就能計算

08:04

We can say, "Well how fast快速 does the bubble泡沫 burst爆裂?"

190

469000

2000

我們可以了解「風潮消逝的速度是多快？」

08:06

And it turns圈 out that we can measure測量 that very precisely恰恰.

191

471000

3000

結果就是我們能很精確測量出一份數據

08:09

Equations方程 were derived派生, graphs圖 were produced生成,

192

474000

3000

有了方程式，也有圖表

08:12

and the net淨 result結果

193

477000

2000

最終的結果就是

08:14

is that we find that the bubble泡沫 bursts連發 faster更快 and faster更快

194

479000

3000

談論年份的風潮一年比一年

08:17

with each每 passing通過 year年.

195

482000

2000

消退的更快

08:19

We are losing失去 interest利益 in the past過去 more rapidly急速.

196

484000

5000

我們對於過去的興趣日漸消逝

08:24

JMJM: Now a little piece片 of career事業 advice忠告.

197

489000

2000

JM：這張圖是有關職業建議

08:26

So for those of you who seek尋求 to be famous著名,

198

491000

2000

對於那些想成名的人

08:28

we can learn學習 from the 25 most最 famous著名 political政治 figures人物,

199

493000

2000

我們可以知道二十五位最有名的政治人物

08:30

authors作者, actors演員 and so on.

200

495000

2000

作家、演員等等

08:32

So if you want to become成為 famous著名 early早 on, you should be an actor演員,

201

497000

3000

如果各位想在年輕時就成名，那麼各位應該要當演員

08:35

because then fame名譽 starts啟動 rising升起 by the end結束 of your 20s --

202

500000

2000

因為你的名氣會從二十歲後開始累積

08:37

you're still young年輕, it's really great.

203

502000

2000

那時正值青春年華，會相當不錯

08:39

Now if you can wait a little bit位, you should be an author作者,

204

504000

2000

如果各位有耐心一點，那麼就應該當個作家

08:41

because then you rise上升 to very great heights高度,

205

506000

2000

因為各位就能攀上高峰

08:43

like Mark標記 Twain吐溫, for instance例: extremely非常 famous著名.

206

508000

2000

成為像是馬克吐溫這樣有名望的作家

08:45

But if you want to reach達到 the very top最佳,

207

510000

2000

但如果各位想攀上最頂尖的位置

08:47

you should delay延遲 gratification享樂

208

512000

2000

就得延後滿足自己的慾望

08:49

and, of course課程, become成為 a politician政治家.

209

514000

2000

然後當一位政治家

08:51

So here you will become成為 famous著名 by the end結束 of your 50s,

210

516000

2000

那麼各位會在五十歲過後開始成名

08:53

and become成為 very, very famous著名 afterward之後.

211

518000

2000

然後你的名氣會在未來持續延續

08:55

So scientists科學家們 also也 tend趨向 to get famous著名 when they're much older舊的.

212

520000

3000

科學家也往往是在老年時才成名

08:58

Like for instance例, biologists生物學家 and physics物理

213

523000

2000

而生物學家和物理學家一樣

09:00

tend趨向 to be almost幾乎 as famous著名 as actors演員.

214

525000

2000

往往也是和演員一樣著名

09:02

One mistake錯誤 you should not do is become成為 a mathematician數學家.

215

527000

3000

唯一不要做的職業就是變成數學家

09:05

(Laughter笑聲)

216

530000

2000

(笑聲)

09:07

If you do that,

217

532000

2000

如果各位真要做這行

09:09

you might威力 think, "Oh great. I'm going to do my best最好 work when I'm in my 20s."

218

534000

3000

各位可能會想「太好了，當我在二十多歲時，我會盡一切努力。」

09:12

But guess猜測 what, nobody沒有人 will really care關心.

219

537000

2000

但事實上，沒人會真正去在乎你所做的事

09:14

(Laughter笑聲)

220

539000

3000

(笑聲)

09:17

ELAELA: There are more sobering發人深省 notes筆記

221

542000

2000

ELA：在我們的資料裡

09:19

among其中 the n-grams正克.

222

544000

2000

還有其他更發人省思的紀錄

09:21

For instance例, here's這裡的 the trajectory彈道 of Marc渣子 Chagall夏加爾,

223

546000

2000

例如馬克‧夏卡爾的名字出現的頻率軌跡

09:23

an artist藝術家 born天生 in 1887.

224

548000

2000

夏卡爾是位1887年出生的藝術家

09:25

And this looks容貌 like the normal正常 trajectory彈道 of a famous著名 person人.

225

550000

3000

這看起來是一位名人名字正常出現在書中的軌跡

09:28

He gets得到 more and more and more famous著名,

226

553000

4000

他的名氣日益響亮

09:32

except除 if you look in German德語.

227

557000

2000

但如果看德國的數據就不是如此

09:34

If you look in German德語, you see something completely全然 bizarre奇異的,

228

559000

2000

如果看德國的數據，會看到某部份是非常奇怪的

09:36

something you pretty漂亮 much never see,

229

561000

2000

這是幾乎不太可能看到的

09:38

which哪一個 is he becomes變 extremely非常 famous著名

230

563000

2000

就是他變得非常有名

09:40

and then all of a sudden突然 plummets驟降,

231

565000

2000

卻突然在1933年至1945年間

09:42

going through通過 a nadir最低點 between之間 1933 and 1945,

232

567000

3000

聲勢跌落谷底

09:45

before rebounding反彈 afterward之後.

233

570000

3000

又反彈回升

09:48

And of course課程, what we're seeing眼看

234

573000

2000

當然我們看的出來

09:50

is the fact事實 Marc渣子 Chagall夏加爾 was a Jewish猶太 artist藝術家

235

575000

3000

這是因為馬克‧夏卡爾是一位猶太裔藝術家

09:53

in Nazi納粹 Germany德國.

236

578000

2000

當時德國是納粹統治

09:55

Now these signals信號

237

580000

2000

這些指標

09:57

are actually其實 so strong強大

238

582000

2000

事實上相當明確

09:59

that we don't need to know that someone有人 was censored審查.

239

584000

3000

我們不需要知道有人在審查書籍

10:02

We can actually其實 figure數字 it out

240

587000

2000

我們能運用基本的信號運算方式

10:04

using運用 really basic基本 signal信號 processing處理.

241

589000

2000

實際了解當時狀況

10:06

Here's這裡的 a simple簡單 way to do it.

242

591000

2000

我們可以用簡單的方式來做

10:08

Well, a reasonable合理 expectation期望

243

593000

2000

合理的預期是

10:10

is that somebody's某人的 fame名譽 in a given特定 period期 of time

244

595000

2000

在一段特定的時間裡某人的名氣指數

10:12

should be roughly大致 the average平均 of their其 fame名譽 before

245

597000

2000

應該會是他們成名前

10:14

and their其 fame名譽 after.

246

599000

2000

和成名後的指數的平均值

10:16

So that's sort分類 of what we expect期望.

247

601000

2000

這大概是我們預期的結果

10:18

And we compare比較 that to the fame名譽 that we observe守.

248

603000

3000

我們比較了我們觀察到的名人

10:21

And we just divide劃分 one by the other

249

606000

2000

我們將前後的數值相除

10:23

to produce生產 something we call a suppression抑制 index指數.

250

608000

2000

得到的數值，我們稱作抑制指數

10:25

If the suppression抑制 index指數 is very, very, very small小,

251

610000

3000

如果抑制指數的值非常的小

10:28

then you very well might威力 be being存在 suppressed抑制.

252

613000

2000

那麼就表示此人也許遭受到打壓

10:30

If it's very large大, maybe you're benefiting受益 from propaganda宣傳.

253

615000

3000

但如果數值非常大，也許此人獲得大量的推廣

10:34

JMJM: Now you can actually其實 look at

254

619000

2000

JM：各位現在可以看到

10:36

the distribution分配 of suppression抑制 indexes索引 over whole整個 populations人群.

255

621000

3000

抑制指數在抽樣整體人數中的分佈情況

10:39

So for instance例, here --

256

624000

2000

所以，例如這裡 --

10:41

this suppression抑制 index指數 is for 5,000 people

257

626000

2000

這個抑制指數的抽樣人數是五千人

10:43

picked採摘的 in English英語 books圖書 where there's no known已知 suppression抑制 --

258

628000

2000

選自出版時期沒有打壓限制的英文書籍來做調查

10:45

it would be like this, basically基本上 tightly緊緊 centered中心 on one.

259

630000

2000

曲線基本上會在數值1的地方呈現高峰

10:47

What you expect期望 is basically基本上 what you observe守.

260

632000

2000

基本上預期的會和觀察到的數值是相同的

10:49

This is distribution分配 as seen看到 in Germany德國 --

261

634000

2000

這份分佈圖則是德國的部分 --

10:51

very different不同, it's shifted移 to the left.

262

636000

2000

相當不同，曲線移往左側

10:53

People talked談 about it twice兩次 less減 as it should have been.

263

638000

3000

人們談論事物的次數比預期的少了兩倍

10:56

But much more importantly重要的, the distribution分配 is much wider更寬的.

264

641000

2000

更重要的是，整體分佈的情況更寬廣

10:58

There are many許多 people who end結束 up on the far遠 left on this distribution分配

265

643000

3000

有相當多人是落在圖表較左側的位置

11:01

who are talked談 about 10 times時 fewer少 than they should have been.

266

646000

3000

因為他們比應該被提及的次數少了十倍

11:04

But then also也 many許多 people on the far遠 right

267

649000

2000

但也有相當多人是落在較右側的部分

11:06

who seem似乎 to benefit效益 from propaganda宣傳.

268

651000

2000

似乎是因為被大量宣傳

11:08

This picture圖片 is the hallmark特點 of censorship審查 in the book書 record記錄.

269

653000

3000

這張圖是明顯看出書本中具有審查制度

11:11

ELAELA: So culturomicsculturomics

270

656000

2000

ELA：文化組學

11:13

is what we call this method方法.

271

658000

2000

是我們用的方法

11:15

It's kind類 of like genomics基因組學.

272

660000

2000

這和基因組學有些類似

11:17

Except除 genomics基因組學 is a lens鏡片 on biology生物學

273

662000

2000

不過基因組學是透過生物學

11:19

through通過 the window窗口 of the sequence序列 of bases基地 in the human人的 genome基因組.

274

664000

3000

基本的序列基礎來檢視人類基因組

11:22

CulturomicsCulturomics is similar類似.

275

667000

2000

文化組學是類似的

11:24

It's the application應用 of massive-scale巨大的規模 data數據 collection採集 analysis分析

276

669000

3000

這是應用收集分析規模龐大的數據

11:27

to the study研究 of human人的 culture文化.

277

672000

2000

來研究人類文化

11:29

Here, instead代替 of through通過 the lens鏡片 of a genome基因組,

278

674000

2000

不透過檢視基因組

11:31

through通過 the lens鏡片 of digitized數字化 pieces件 of the historical歷史的 record記錄.

279

676000

3000

而是檢視歷史紀錄的數位資料

11:34

The great thing about culturomicsculturomics

280

679000

2000

文化組學的好處是

11:36

is that everyone大家 can do it.

281

681000

2000

每個人都能執行

11:38

Why can everyone大家 do it?

282

683000

2000

為何每個人都能做呢？

11:40

Everyone大家 can do it because three三 guys,

283

685000

2000

因為這三位人士

11:42

Jon喬恩 OrwantOrwant, Matt馬特 Gray灰色 and Will Brockman布羅克曼 over at Google谷歌,

284

687000

3000

Google的Jon Orwant，Matt Gray還有Will Brockman

11:45

saw the prototype原型 of the NgramNGRAM Viewer查看器,

285

690000

2000

他們看到Ngram瀏覽器的原型

11:47

and they said, "This is so fun開玩笑.

286

692000

2000

他們說「這太有趣了。」

11:49

We have to make this available可得到 for people."

287

694000

3000

我們要讓大家都可以使用這功能

11:52

So in two weeks週 flat平面 -- the two weeks週 before our paper紙 came來了 out --

288

697000

2000

所以在兩週的時間 -- 我們的報告出來的兩週前 --

11:54

they coded編碼 up a version版 of the NgramNGRAM Viewer查看器 for the general一般 public上市.

289

699000

3000

他們編寫了一個大眾版本的Ngram瀏覽器

11:57

And so you too can type類型 in any word字 or phrase短語 that you're interested有興趣 in

290

702000

3000

各位可以打上任何各位有興趣的字或詞組

12:00

and see its n-gram正克 immediately立即 --

291

705000

2000

然後立即看到該字詞的頻率變化 --

12:02

also也 browse瀏覽 examples例子 of all the various各個 books圖書

292

707000

2000

同時根據你搜尋的字詞

12:04

in which哪一個 your n-gram正克 appears出現.

293

709000

2000

瀏覽不同書籍中的各種例子

12:06

JMJM: Now this was used over a million百萬 times時 on the first day,

294

711000

2000

JM：這功能在首日就被使用了超過一百萬次

12:08

and this is really the best最好 of all the queries查詢.

295

713000

2000

這也是各種查詢工具中最好的一個

12:10

So people want to be their其 best最好, put their其 best最好 foot腳丫子 forward前鋒.

296

715000

3000

人們希望做到最好的，以最好的狀態像前進

12:13

But it turns圈 out in the 18th日 century世紀, people didn't really care關心 about that at all.

297

718000

3000

但事實證明在18世紀，人們一點也不關心這一切

12:16

They didn't want to be their其 best最好, they wanted to be their其 beftbeft.

298

721000

3000

他們不想做到最好，他們想變成"beft"

12:19

So what happened發生 is, of course課程, this is just a mistake錯誤.

299

724000

3000

這是怎麼回事，當然這只是個錯誤

12:22

It's not that strove爭取 for mediocrity庸人,

300

727000

2000

這並不是說他們想要平凡

12:24

it's just that the S used to be written書面 differently不同, kind類 of like an F.

301

729000

3000

這只是因為"S"常被寫的不一樣，寫得像"F"

12:27

Now of course課程, Google谷歌 didn't pick挑 this up at the time,

302

732000

3000

當然，Google並沒有挑出來

12:30

so we reported報導 this in the science科學 article文章 that we wrote寫.

303

735000

3000

所以我們在自己寫科學文章中提到此事

12:33

But it turns圈 out this is just a reminder提醒

304

738000

2000

不過這只是個提醒

12:35

that, although雖然 this is a lot of fun開玩笑,

305

740000

2000

雖然這相當有趣

12:37

when you interpret譯 these graphs圖, you have to be very careful小心,

306

742000

2000

當你要解讀這些圖表，你必須非常謹慎

12:39

and you have to adopt採用 the base基礎 standards標準 in the sciences科學.

307

744000

3000

而且必須採納科學的基礎標準

12:42

ELAELA: People have been using運用 this for all kinds種 of fun開玩笑 purposes目的.

308

747000

3000

ELA：大家一直在使用這工具來滿足各種樂趣

12:45

(Laughter笑聲)

309

750000

7000

(笑聲)

12:52

Actually其實, we're not going to have to talk,

310

757000

2000

事實上，我們不需要說明的

12:54

we're just going to show顯示 you all the slides幻燈片 and remain留 silent無聲.

311

759000

3000

我們原本只想播放所有的投影片然後在一旁保持沉默

12:57

This person人 was interested有興趣 in the history歷史 of frustration挫折.

312

762000

3000

此人對於挫折的歷史感興趣

13:00

There's various各個 types類型 of frustration挫折.

313

765000

3000

挫折有非常多種方式

13:03

If you stub存根 your toe腳趾, that's a one A "argh哎呀."

314

768000

3000

如果你踢到腳趾，哀叫聲「啊」就是一個"A"的"argh"

13:06

If the planet行星 Earth地球 is annihilated全軍覆沒 by the Vogons沃貢

315

771000

2000

如果地球被外星人毀滅

13:08

to make room房間 for an interstellar星際 bypass旁路,

316

773000

2000

變成星際間的通道

13:10

that's an eight八 A "aaaaaaaarghaaaaaaaargh."

317

775000

2000

那麼哀叫聲「啊」就是有八個"A"的"aaaaaaaargh"

13:12

This person人 studies學習 all the "arghsarghs,"

318

777000

2000

此人研究了所有書籍上出現的哀叫聲「啊」

13:14

from one through通過 eight八 A's如.

319

779000

2000

有從一個"A"到八個"A"

13:16

And it turns圈 out

320

781000

2000

結果是

13:18

that the less-frequent不太頻繁 "arghsarghs"

321

783000

2000

較不頻繁的「啊」“arghs”

13:20

are, of course課程, the ones那些 that correspond對應 to things that are more frustrating洩氣 --

322

785000

3000

對應了那些相對較令人沮喪的的事情

13:23

except除, oddly奇怪, in the early早 80s.

323

788000

3000

也有例外，奇怪的是在80年代初

13:26

We think that might威力 have something to do with Reagan裡根.

324

791000

2000

我們認為這也許是受到雷根的影響

13:28

(Laughter笑聲)

325

793000

2000

(笑聲)

13:30

JMJM: There are many許多 usages用法 of this data數據,

326

795000

3000

JM：這份書據資料有相當多用途

13:33

but the bottom底部 line線 is that the historical歷史的 record記錄 is being存在 digitized數字化.

327

798000

3000

不過最終就是歷史紀錄都被數位化了

13:36

Google谷歌 has started開始 to digitize數字化 15 million百萬 books圖書.

328

801000

2000

Google已經開始將一千五百萬本書數位化

13:38

That's 12 percent百分 of all the books圖書 that have ever been published發表.

329

803000

2000

其中百分之十二的書是已出版的

13:40

It's a sizable可觀 chunk塊 of human人的 culture文化.

330

805000

3000

這涵蓋了相當大量的人類文化

13:43

There's much more in culture文化: there's manuscripts手稿, there newspapers報紙,

331

808000

3000

這當中有非常多的文化資料：裡頭有手稿，報紙

13:46

there's things that are not text文本, like art藝術 and paintings繪畫.

332

811000

2000

也有不是文字的資料，像是藝術品和畫作

13:48

These all happen發生 to be on our computers電腦,

333

813000

2000

現在這都存放在我們的電腦裡

13:50

on computers電腦 across橫過 the world世界.

334

815000

2000

在世界各處的電腦裡

13:52

And when that happens發生, that will transform轉變 the way we have

335

817000

3000

如果這一切成真，就會改變

13:55

to understand理解 our past過去, our present當下 and human人的 culture文化.

336

820000

2000

我們了解過去、現在和人類文化的方式

13:57

Thank you very much.

337

822000

2000

非常謝謝各位

13:59

(Applause掌聲)

338

824000

3000

(掌聲)

Translated by Joyce Chou
Reviewed by Qi Gu

ABOUT THE SPEAKERS

Jean-Baptiste Michel - Data researcher
Jean-Baptiste Michel looks at how we can use large volumes of data to better understand our world.

Why you should listen

Jean-Baptiste Michel holds joint academic appointments at Harvard (FQEB Fellow) and Google (Visiting Faculty). His research focusses on using large volumes of data as tools that help better understand the world around us -- from the way diseases progress in patients over years, to the way cultures change in human societies over centuries. With his colleague Erez Lieberman Aiden, Jean-Baptiste is a Founding Director of Harvard's Cultural Observatory, where their research team pioneers the use of quantitative methods for the study of human culture, language and history. His research was featured on the covers of Science and Nature, on the front pages of the New York Times and the Boston Globe, in The Economist, Wired and many other venues. The online tool he helped create -- ngrams.googlelabs.com -- was used millions of times to browse cultural trends. Jean-Baptiste is an Engineer from Ecole Polytechnique (Paris), and holds an MS in Applied Mathematics and a PhD in Systems Biology from Harvard.

More profile about the speaker
Jean-Baptiste Michel | Speaker | TED.com

Erez Lieberman Aiden - Researcher
Erez Lieberman Aiden pursues a broad range of research interests, spanning genomics, linguistics, mathematics ...

Why you should listen

Erez Lieberman Aiden is a fellow at the Harvard Society of Fellows and Visiting Faculty at Google. His research spans many disciplines and has won numerous awards, including recognition for one of the top 20 "Biotech Breakthroughs that will Change Medicine", by Popular Mechanics; the Lemelson-MIT prize for the best student inventor at MIT; the American Physical Society's Award for the Best Doctoral Dissertation in Biological Physics; and membership in Technology Review's 2009 TR35, recognizing the top 35 innovators under 35. His last three papers -- two with JB Michel -- have all appeared on the cover of Nature and Science.

More profile about the speaker
Erez Lieberman Aiden | Speaker | TED.com

THE ORIGINAL VIDEO ON TED.COM

從五百萬本書學到的事 | TED Talk | TED.com