ABOUT THE SPEAKER
Ben Wellington - Data scientist
Ben Wellington blends his love of statistics, the city, and comedy in his entertaining analysis of the story of New York City, told through data.

Why you should listen

Ben Wellington runs the I Quant NY blog, in which he crunches city-released data to find out what's really going on in the Big Apple. To date he has tackled topics such as measles outbreaks in New York City schools, analyzed how companies like Airbnb are really doing in NYC, and asked questions such as "does gentrification cause a reduction in laundromats?" (Answer: inconclusive.)

Ben is a visiting assistant professor in the City & Regional Planning program at the Pratt Institute in Brooklyn; his day job involves working as a quantitative analyst at the investment management firm, Two Sigma. A budding comedian and performer, he also teaches team building workshops through Cherub Improv, a non-profit that uses improv comedy for social good.

More profile about the speaker
Ben Wellington | Speaker | TED.com
TEDxNewYork

Ben Wellington: How we found the worst place to park in New York City -- using big data

Filmed:
1,055,247 views

City agencies have access to a wealth of data and statistics reflecting every part of urban life. But as data analyst Ben Wellington suggests in this entertaining talk, sometimes they just don't know what to do with it. He shows how a combination of unexpected questions and smart data crunching can produce strangely useful insights, and shares tips on how to release large sets of data so that anyone can use them.
- Data scientist
Ben Wellington blends his love of statistics, the city, and comedy in his entertaining analysis of the story of New York City, told through data. Full bio

Double-click the English transcript below to play the video.

00:12
Six thousand miles of road,
0
711
2820
00:15
600 miles of subway track,
1
3531
2203
00:17
400 miles of bike lanes
2
5734
1644
00:19
and a half a mile of tram track,
3
7378
1821
00:21
if you've ever been to Roosevelt Island.
4
9199
1953
00:23
These are the numbers that make up
the infrastructure of New York City.
5
11152
3334
00:26
These are the statistics
of our infrastructure.
6
14486
2619
00:29
They're the kind of numbers you can find
released in reports by city agencies.
7
17105
3706
00:32
For example, the Department
of Transportation will probably tell you
8
20811
3199
00:36
how many miles of road they maintain.
9
24010
1781
00:37
The MTA will boast how many miles
of subway track there are.
10
25791
2821
00:40
Most city agencies give us statistics.
11
28612
1807
00:42
This is from a report this year
12
30419
1483
00:43
from the Taxi and Limousine Commission,
13
31902
1892
00:45
where we learn that there's about
13,500 taxis here in New York City.
14
33794
3276
00:49
Pretty interesting, right?
15
37070
1290
00:50
But did you ever think about
where these numbers came from?
16
38360
2784
00:53
Because for these numbers to exist,
someone at the city agency
17
41144
2903
00:56
had to stop and say, hmm, here's a number
that somebody might want want to know.
18
44047
3880
00:59
Here's a number
that our citizens want to know.
19
47927
2250
01:02
So they go back to their raw data,
20
50177
1830
01:04
they count, they add, they calculate,
21
52007
1797
01:05
and then they put out reports,
22
53804
1467
01:07
and those reports
will have numbers like this.
23
55271
2177
01:09
The problem is, how do they know
all of our questions?
24
57448
2540
01:11
We have lots of questions.
25
59988
1243
01:13
In fact, in some ways there's literally
an infinite number of questions
26
61231
3340
01:16
that we can ask about our city.
27
64571
1649
01:18
The agencies can never keep up.
28
66220
1475
01:19
So the paradigm isn't exactly working,
and I think our policymakers realize that,
29
67695
4056
01:23
because in 2012, Mayor Bloomberg
signed into law what he called
30
71751
3959
01:27
the most ambitious and comprehensive
open data legislation in the country.
31
75710
3837
01:31
In a lot of ways, he's right.
32
79547
1573
01:33
In the last two years,
the city has released 1,000 datasets
33
81120
2861
01:35
on our open data portal,
34
83981
1610
01:37
and it's pretty awesome.
35
85591
1764
01:39
So you go and look at data like this,
36
87355
1968
01:41
and instead of just counting
the number of cabs,
37
89323
2289
01:43
we can start to ask different questions.
38
91612
1943
01:45
So I had a question.
39
93555
1200
01:46
When's rush hour in New York City?
40
94755
1701
01:48
It can be pretty bothersome.
When is rush hour exactly?
41
96456
2581
01:51
And I thought to myself,
these cabs aren't just numbers,
42
99037
2625
01:53
these are GPS recorders
driving around in our city streets
43
101662
2711
01:56
recording each and every ride they take.
44
104373
1913
01:58
There's data there,
and I looked at that data,
45
106286
2322
02:00
and I made a plot of the average speed of
taxis in New York City throughout the day.
46
108608
3961
02:04
You can see that from about midnight
to around 5:18 in the morning,
47
112569
3412
02:07
speed increases, and at that point,
things turn around,
48
115981
3563
02:11
and they get slower and slower and slower
until about 8:35 in the morning,
49
119544
3962
02:15
when they end up at around
11 and a half miles per hour.
50
123506
2693
02:18
The average taxi is going 11 and a half
miles per hour on our city streets,
51
126199
3562
02:21
and it turns out it stays that way
52
129761
1987
02:23
for the entire day.
53
131748
3368
02:27
(Laughter)
54
135116
1373
02:28
So I said to myself, I guess
there's no rush hour in New York City.
55
136489
3180
02:31
There's just a rush day.
56
139669
1537
02:33
Makes sense. And this is important
for a couple of reasons.
57
141206
2850
02:36
If you're a transportation planner,
this might be pretty interesting to know.
58
144056
3637
02:39
But if you want to get somewhere quickly,
59
147693
1975
02:41
you now know to set your alarm for
4:45 in the morning and you're all set.
60
149668
3468
02:45
New York, right?
61
153136
1044
02:46
But there's a story behind this data.
62
154180
1762
02:47
This data wasn't
just available, it turns out.
63
155942
2185
02:50
It actually came from something called
a Freedom of Information Law Request,
64
158127
3619
02:53
or a FOIL Request.
65
161746
1076
02:54
This is a form you can find on the
Taxi and Limousine Commission website.
66
162822
3466
02:58
In order to access this data,
you need to go get this form,
67
166288
2826
03:01
fill it out, and they will notify you,
68
169114
1846
03:02
and a guy named Chris Whong
did exactly that.
69
170960
2130
03:05
Chris went down, and they told him,
70
173090
1890
03:06
"Just bring a brand new hard drive
down to our office,
71
174980
2827
03:09
leave it here for five hours,
we'll copy the data and you take it back."
72
177807
3424
03:13
And that's where this data came from.
73
181231
2032
03:15
Now, Chris is the kind of guy
who wants to make the data public,
74
183263
3005
03:18
and so it ended up online for all to use,
and that's where this graph came from.
75
186268
3784
03:22
And the fact that it exists is amazing.
These GPS recorders -- really cool.
76
190052
3518
03:25
But the fact that we have citizens
walking around with hard drives
77
193570
3118
03:28
picking up data from city agencies
to make it public --
78
196688
2582
03:31
it was already kind of public,
you could get to it,
79
199270
2390
03:33
but it was "public," it wasn't public.
80
201660
1812
03:35
And we can do better than that as a city.
81
203472
1962
03:37
We don't need our citizens
walking around with hard drives.
82
205434
2756
03:40
Now, not every dataset
is behind a FOIL Request.
83
208190
2337
03:42
Here is a map I made with the most
dangerous intersections in New York City
84
210527
3802
03:46
based on cyclist accidents.
85
214329
1878
03:48
So the red areas are more dangerous.
86
216207
1939
03:50
And what it shows is first
the East side of Manhattan,
87
218146
2553
03:52
especially in the lower area of Manhattan,
has more cyclist accidents.
88
220699
3611
03:56
That might make sense
89
224310
1019
03:57
because there are more cyclists
coming off the bridges there.
90
225329
2896
04:00
But there's other hotspots worth studying.
91
228225
2014
04:02
There's Williamsburg.
There's Roosevelt Avenue in Queens.
92
230239
2669
04:04
And this is exactly the kind of data
we need for Vision Zero.
93
232908
2852
04:07
This is exactly what we're looking for.
94
235760
1990
04:09
But there's a story
behind this data as well.
95
237750
2135
04:11
This data didn't just appear.
96
239885
2067
04:13
How many of you guys know this logo?
97
241952
2391
04:16
Yeah, I see some shakes.
98
244343
1352
04:17
Have you ever tried to copy
and paste data out of a PDF
99
245695
2655
04:20
and make sense of it?
100
248350
1357
04:21
I see more shakes.
101
249707
1060
04:22
More of you tried copying and pasting
than knew the logo. I like that.
102
250767
3345
04:26
So what happened is, the data
that you just saw was actually on a PDF.
103
254112
3510
04:29
In fact, hundreds and hundreds
and hundreds of pages of PDF
104
257622
3105
04:32
put out by our very own NYPD,
105
260727
2159
04:34
and in order to access it,
you would either have to copy and paste
106
262886
3152
04:38
for hundreds and hundreds of hours,
107
266038
1726
04:39
or you could be John Krauss.
108
267764
1344
04:41
John Krauss was like,
109
269108
1043
04:42
I'm not going to copy and paste this data.
I'm going to write a program.
110
270151
3413
04:45
It's called the NYPD Crash Data Band-Aid,
111
273564
2288
04:47
and it goes to the NYPD's website
and it would download PDFs.
112
275852
3032
04:50
Every day it would search;
if it found a PDF, it would download it
113
278884
3126
04:54
and then it would run
some PDF-scraping program,
114
282010
2250
04:56
and out would come the text,
115
284260
1336
04:57
and it would go on the Internet,
and then people could make maps like that.
116
285596
3565
05:01
And the fact that the data's here,
the fact that we have access to it --
117
289161
3429
05:04
Every accident, by the way,
is a row in this table.
118
292590
2450
05:07
You can imagine how many PDFs that is.
119
295040
1836
05:08
The fact that we
have access to that is great,
120
296876
2207
05:11
but let's not release it in PDF form,
121
299083
2110
05:13
because then we're having our citizens
write PDF scrapers.
122
301193
2739
05:15
It's not the best use
of our citizens' time,
123
303932
2076
05:18
and we as a city can do better than that.
124
306008
2004
05:20
Now, the good news is that
the de Blasio administration
125
308012
2736
05:22
actually recently released this data
a few months ago,
126
310748
2532
05:25
and so now we can
actually have access to it,
127
313280
2158
05:27
but there's a lot of data
still entombed in PDF.
128
315438
2536
05:29
For example, our crime data
is still only available in PDF.
129
317974
3197
05:33
And not just our crime data,
our own city budget.
130
321171
3755
05:36
Our city budget is only readable
right now in PDF form.
131
324926
3729
05:40
And it's not just us
that can't analyze it --
132
328655
2141
05:42
our own legislators
who vote for the budget
133
330796
2955
05:45
also only get it in PDF.
134
333751
1943
05:47
So our legislators cannot analyze
the budget that they are voting for.
135
335694
3844
05:51
And I think as a city we can do
a little better than that as well.
136
339538
3608
05:55
Now, there's a lot of data
that's not hidden in PDFs.
137
343146
2488
05:57
This is an example of a map I made,
138
345634
1700
05:59
and this is the dirtiest waterways
in New York City.
139
347334
2926
06:02
Now, how do I measure dirty?
140
350260
1509
06:03
Well, it's kind of a little weird,
141
351769
1857
06:05
but I looked at the level
of fecal coliform,
142
353626
2113
06:07
which is a measurement of fecal matter
in each of our waterways.
143
355739
3506
06:11
The larger the circle,
the dirtier the water,
144
359245
3274
06:14
so the large circles are dirty water,
the small circles are cleaner.
145
362519
3357
06:17
What you see is inland waterways.
146
365876
1644
06:19
This is all data that was sampled
by the city over the last five years.
147
367520
3404
06:22
And inland waterways are,
in general, dirtier.
148
370924
2694
06:25
That makes sense, right?
149
373618
1218
06:26
And the bigger circles are dirty.
And I learned a few things from this.
150
374836
3374
06:30
Number one: Never swim in anything
that ends in "creek" or "canal."
151
378210
3164
06:33
But number two: I also found
the dirtiest waterway in New York City,
152
381374
4318
06:37
by this measure, one measure.
153
385692
1834
06:39
In Coney Island Creek, which is not
the Coney Island you swim in, luckily.
154
387526
3648
06:43
It's on the other side.
155
391174
1158
06:44
But Coney Island Creek, 94 percent
of samples taken over the last five years
156
392332
3878
06:48
have had fecal levels so high
157
396210
2157
06:50
that it would be against state law
to swim in the water.
158
398367
3093
06:53
And this is not the kind of fact
that you're going to see
159
401460
2729
06:56
boasted in a city report, right?
160
404189
1537
06:57
It's not going to be
the front page on nyc.gov.
161
405726
2250
06:59
You're not going to see it there,
162
407976
1580
07:01
but the fact that we can get
to that data is awesome.
163
409556
2518
07:04
But once again, it wasn't super easy,
164
412074
1773
07:05
because this data was not
on the open data portal.
165
413847
2358
07:08
If you were to go to the open data portal,
166
416205
2013
07:10
you'd see just a snippet of it,
a year or a few months.
167
418218
2613
07:12
It was actually on the Department
of Environmental Protection's website.
168
420831
3390
07:16
And each one of these links is an Excel
sheet, and each Excel sheet is different.
169
424221
3878
07:20
Every heading is different:
you copy, paste, reorganize.
170
428099
2630
07:22
When you do you can make maps
and that's great, but once again,
171
430729
2952
07:25
we can do better than that
as a city, we can normalize things.
172
433681
2969
07:28
And we're getting there, because
there's this website that Socrata makes
173
436650
3384
07:32
called the Open Data Portal NYC.
174
440034
1541
07:33
This is where 1,100 data sets
that don't suffer
175
441575
2257
07:35
from the things I just told you live,
176
443832
1781
07:37
and that number is growing,
and that's great.
177
445613
2148
07:39
You can download data in any format,
be it CSV or PDF or Excel document.
178
447761
3412
07:43
Whatever you want,
you can download the data that way.
179
451173
2547
07:45
The problem is, once you do,
180
453720
1352
07:47
you will find that each agency
codes their addresses differently.
181
455072
3686
07:50
So one is street name,
intersection street,
182
458758
2141
07:52
street, borough, address, building,
building address.
183
460899
2491
07:55
So once again, you're spending time,
even when we have this portal,
184
463390
3180
07:58
you're spending time
normalizing our address fields.
185
466570
2606
08:01
And that's not the best use
of our citizens' time.
186
469176
2423
08:03
We can do better than that as a city.
187
471599
1796
08:05
We can standardize our addresses,
188
473395
1645
08:07
and if we do,
we can get more maps like this.
189
475040
2185
08:09
This is a map of fire hydrants
in New York City,
190
477225
2285
08:11
but not just any fire hydrants.
191
479510
1531
08:13
These are the top 250 grossing fire
hydrants in terms of parking tickets.
192
481041
4726
08:17
(Laughter)
193
485767
1986
08:19
So I learned a few things from this map,
and I really like this map.
194
487753
3358
08:23
Number one, just don't park
on the Upper East Side.
195
491111
2402
08:25
Just don't. It doesn't matter where
you park, you will get a hydrant ticket.
196
493513
3587
08:29
Number two, I found the two highest
grossing hydrants in all of New York City,
197
497100
4153
08:33
and they're on the Lower East Side,
198
501253
1886
08:35
and they were bringing in over
55,000 dollars a year in parking tickets.
199
503139
5098
08:40
And that seemed a little strange
to me when I noticed it,
200
508237
2738
08:42
so I did a little digging and it turns out
what you had is a hydrant
201
510975
3269
08:46
and then something called
a curb extension,
202
514244
1996
08:48
which is like a seven-foot
space to walk on,
203
516240
2059
08:50
and then a parking spot.
204
518299
1156
08:51
And so these cars came along,
and the hydrant --
205
519455
2254
08:53
"It's all the way over there, I'm fine,"
206
521709
1911
08:55
and there was actually a parking spot
painted there beautifully for them.
207
523620
3474
08:59
They would park there, and the NYPD
disagreed with this designation
208
527094
3155
09:02
and would ticket them.
209
530249
1058
09:03
And it wasn't just me
who found a parking ticket.
210
531307
2344
09:05
This is the Google
Street View car driving by
211
533651
2146
09:07
finding the same parking ticket.
212
535797
1617
09:09
So I wrote about this on my blog,
on I Quant NY, and the DOT responded,
213
537414
4504
09:13
and they said,
214
541918
1020
09:14
"While the DOT has not received
any complaints about this location,
215
542938
3410
09:18
we will review the roadway markings
and make any appropriate alterations."
216
546348
4542
09:22
And I thought to myself,
typical government response,
217
550890
2959
09:25
all right, moved on with my life.
218
553849
1881
09:27
But then, a few weeks later,
something incredible happened.
219
555730
3970
09:31
They repainted the spot,
220
559700
2520
09:34
and for a second I thought I saw
the future of open data,
221
562220
2690
09:36
because think about what happened here.
222
564910
2000
09:38
For five years, this spot was being
ticketed, and it was confusing,
223
566910
5100
09:44
and then a citizen found something,
they told the city, and within a few weeks
224
572010
4306
09:48
the problem was fixed.
225
576316
1294
09:49
It's amazing. And a lot of people
see open data as being a watchdog.
226
577610
3200
09:52
It's not, it's about being a partner.
227
580810
1772
09:54
We can empower our citizens
to be better partners for government,
228
582582
3138
09:57
and it's not that hard.
229
585720
1881
09:59
All we need are a few changes.
230
587601
1459
10:01
If you're FOILing data,
231
589060
1107
10:02
if you're seeing your data
being FOILed over and over again,
232
590167
2867
10:05
let's release it to the public, that's
a sign that it should be made public.
233
593034
3574
10:08
And if you're a government agency
releasing a PDF,
234
596608
2482
10:11
let's pass legislation that requires you
to post it with the underlying data,
235
599090
3649
10:14
because that data
is coming from somewhere.
236
602739
2028
10:16
I don't know where, but it's
coming from somewhere,
237
604767
2482
10:19
and you can release it with the PDF.
238
607249
1725
10:20
And let's adopt and share
some open data standards.
239
608974
2411
10:23
Let's start with our addresses
here in New York City.
240
611385
2481
10:25
Let's just start
normalizing our addresses.
241
613866
2074
10:27
Because New York is a leader in open data.
242
615940
2062
10:30
Despite all this, we are absolutely
a leader in open data,
243
618002
2789
10:32
and if we start normalizing things,
and set an open data standard,
244
620791
3121
10:35
others will follow. The state will follow,
and maybe the federal government,
245
623912
3634
10:39
Other countries could follow,
246
627546
1445
10:40
and we're not that far off from a time
where you could write one program
247
628991
3411
10:44
and map information from 100 countries.
248
632402
1890
10:46
It's not science fiction.
We're actually quite close.
249
634292
2487
10:48
And by the way, who are we
empowering with this?
250
636779
2240
10:51
Because it's not just John Krauss
and it's not just Chris Whong.
251
639019
3005
10:54
There are hundreds of meetups
going on in New York City right now,
252
642024
3095
10:57
active meetups.
253
645119
1025
10:58
There are thousands of people
attending these meetups.
254
646144
2572
11:00
These people are going after work
and on weekends,
255
648716
2368
11:03
and they're attending these meetups
to look at open data
256
651084
2636
11:05
and make our city a better place.
257
653720
1640
11:07
Groups like BetaNYC, who just last week
released something called citygram.nyc
258
655360
4073
11:11
that allows you to subscribe
to 311 complaints
259
659433
2147
11:13
around your own home,
or around your office.
260
661580
2068
11:15
You put in your address,
you get local complaints.
261
663648
2427
11:18
And it's not just the tech community
that are after these things.
262
666075
3374
11:21
It's urban planners like
the students I teach at Pratt.
263
669449
2622
11:24
It's policy advocates, it's everyone,
264
672071
1919
11:25
it's citizens from a diverse
set of backgrounds.
265
673990
2563
11:28
And with some small, incremental changes,
266
676553
2786
11:31
we can unlock the passion
and the ability of our citizens
267
679339
3225
11:34
to harness open data
and make our city even better,
268
682564
3156
11:37
whether it's one dataset,
or one parking spot at a time.
269
685720
3626
11:41
Thank you.
270
689346
2322
11:43
(Applause)
271
691668
3305

▲Back to top

ABOUT THE SPEAKER
Ben Wellington - Data scientist
Ben Wellington blends his love of statistics, the city, and comedy in his entertaining analysis of the story of New York City, told through data.

Why you should listen

Ben Wellington runs the I Quant NY blog, in which he crunches city-released data to find out what's really going on in the Big Apple. To date he has tackled topics such as measles outbreaks in New York City schools, analyzed how companies like Airbnb are really doing in NYC, and asked questions such as "does gentrification cause a reduction in laundromats?" (Answer: inconclusive.)

Ben is a visiting assistant professor in the City & Regional Planning program at the Pratt Institute in Brooklyn; his day job involves working as a quantitative analyst at the investment management firm, Two Sigma. A budding comedian and performer, he also teaches team building workshops through Cherub Improv, a non-profit that uses improv comedy for social good.

More profile about the speaker
Ben Wellington | Speaker | TED.com