Archive for the ‘Open Data’ Category

De-anonymizing open data, just because you can… should you?

Thursday, October 23rd, 2014

If an essential part of the data reveals personally identifiable information (PII), should the data not be released? Should the users of open data be the ones responsible for ensuring proper use of the data?

I mention this question because of an article by an intrepid Gawker reporter who decided he could correlate photos of celebrities in NYC taxis (with visible Taxi medallions) and the de-anonymized database on every NYC cab ride in 2013 to determine whether celebrities tipped their cab drivers. Of course, this article is another example of “Celebrities doing normal people things like using taxis”, but the underlying question here is just because you can violate people’s privacy does it mean you should?

Identifying celebrities and their cab rides was first done by an intern at Neustar, Anthony Tockar. In his post he recognizes that it is relatively easy to reveal personal information about people. Not only could he match cab rides to a couple of celebrities, but he also showed how you can easily see who frequently visits Hustler’s. Tockar says:

Now while this information is relatively benign, particularly a year down the line, I have revealed information that was not previously in the public domain.

He uses these examples to introduce a method of privatizing data called “differential privacy.” Differential privacy basically adds noise to the data when you zoom in on it so you can’t identify specific information about an individual, but you can still get accurate results when you look at the data as a whole. This is best exemplified by the graphic below.

This shows the average speed of cab drivers throughout the day. The top half is the actual average speed of all drivers and the average speed of all drivers after the data is run through the differential privacy algorithm. The bottom half shows the same for an individual cab driver. Click on the graphic to go to an interactive tool that lets you play around with the privacy parameter, ε.

But we’re still struggling with getting data off PDF’s or worse, filing cabinets. It’ll take years before we can create such privacy mechanisms for current open data! What to do in the meantime? It would seem that Gawker stopped reading after “Bradley Cooper left no tip” (actually, we don’t know since tips are not recorded if paid in cash). Just because someone could look up ten celebrities’ cab rides does it mean they should have? The reporter even quotes Tockar’s quote about “revealing information not previously in the public domain”. The irony seems to have been lost on Gawker. I’m of the opinion that Gawker shouldn’t have published an article about celebrities’ cab rides no more than it should publish their phone numbers if they were available inside a phone book. Unless it was trying to make a point about privacy and open data, which would’ve made for a great conversation piece.  Except it wasn’t since it was all about tipping. They even reached out to publicists for comments on the tipping.

Ultimately, who cares about Bradley Cooper taking a taxi. But when you go “hey, let’s see how many celebrities I can ID from this data” and write an article about it without questioning the privacy implications, you’re basically saying “Yes, because you can, it means you should.”

UPDATE: ok, so apparently there is a reason it’s called “Gawker”. See this example where this same author tries to out a Fox News reporter. Today I learned.

Mexican Gov’t Tries to Buy $10M App, Coders Respond by Building It For Free

Thursday, April 4th, 2013

A few weeks ago, Grupo Reforma reported* the Mexican Chamber of Deputies had signed a contract with an external consulting firm, Pulso Legislativo, to develop a mobile application that would allow Representatives to monitor and publish up-to-date legislative information from their mobile devices for the outrageous sum of about $10 million dollars ($115 million pesos). Making matters worse, was the fact that the Chamber of Deputies wanted to develop this app despite the fact that they already have four main agencies that generate the app’s information, five research centers, and three offices in charge of documentation.

How did the members of the Mexican tech community respond? They created a week-long hackathon and got coders to build an open-source version of the app, for free. The group Codeando Mexico responded to this ludicrous news by setting up the #app115 challenge to which over 160 coders signed up to participate. Tomorrow, Codeando Mexico will present five app submissions at the Legislative Palace of San Lázaro, the same building where the Deputies hold their sessions. For information on tomorrow’s event click here.

Other than ridiculing the Chamber of Deputies who thinks it can get away with trying to buy a $10 million dollar app (not sure if that was actually their intention), Codeando México is trying to highlight the importance of civic participation. It is unfortunate that the government has not yet realized the importance of engaging its citizenry, an effort which might help bridge the gap between the citizens and their representatives (and potentially save a lot of money). Hopefully initiatives like Codeando Mexico will gather more attention in the near future. I would love to see more coders getting together on a Saturday night and coding “over some tequilas”.

 *Linked to different article, Grupo Reform has a paywall.

Update: To read more about this, Eric Tecayehuatl, has covered this on Gizmodo (In Spanish).

De-anonymizing Stop and Frisk Data.

Wednesday, January 2nd, 2013

I started with the premise that 87% of Americans are uniquely identifiable by knowing their date of birth, zip code, and gender. The Stop and Frisk (SNF) data gives you date of birth, precinct, gender, race, height, weight, eye color, hair color, and build. The original SNF data set contains 685,724 stops for 2011. However, out of those stops, only 2/3 had valid dates of birth. By ‘valid’ I mean, between the ages of 0 to 112 (around 275,000 stops where of people born on Dec. 31, 1900). Since date of birth is crucial to de-anonymization, I excluded those data points from the analysis. My numbers will therefore differ from the NYCLU’s report, since they did include these entries.

After cleaning the data a bit, I chose to only use D.O.B, gender, race, precinct, and height to de-anonymize the data. I did not choose the rest of the descriptors because the police officer conducting the stop might not always enter the same information for the same person. First, only in 55% of stops did the suspect provide a photo ID which could provide accurate details of their weight, hair color, etc. The police officer would have had to guess all of the person’s information correctly every time for the other 45% of stops. Second, people’s weight, build and hair color can change over the year or not easily identifiable at night. Lastly, I realize that height also, changes, especially in people below 20 years, but I wanted to play it a little safe somehow. I thought height would be easy for a police officer to guess correctly, so I kept height. Using these descriptors, I found 364,706 unique individuals. 22,649 of whom were stopped more than once. Here are the top 20 people stopped in 2011, and the number of times they have been stopped.

Screen Shot 2012-12-20 at 7.44.34 PM

The string of numbers is the person’s “name”. From left to right, the numbers mean precinct, gender, race, DOB, and height. You’ll notice that 18/20 precincts are precincts 60, 61, and 101 (Coney Island, Gravesend, and Far Rockaway) I’ll write more on these later, but first some numbers on people stopped more than once.

  • 6.2% of people stopped where stopped more than once (22,476 out of 364,706).
  • 60.7% of people stopped more than once where black.
  • 29.0% were hispanic.
  • 7.9% were white.
  • 2.4% were others (Asians, Pacific Islanders, Native Americans, Others)

Going back to precincts 60, 61, and 101. After I first noticed that the overwhelming presence of these three precincts in the top 20 list, I mapped out all the people who had been stopped more than once and got a map with points pretty much all over New York City. Notice, each dot represents a person, not a stop. The position of the person is the average position of all the person’s stops.

Then I mapped out everyone stopped more than 5 times.
People Stopped More Than 5 Times

Out of the 22,000+ people stopped more than once, there where 340 that were stopped more than 5 times. Here is a table of the top 10 precincts with people stopped more than 5 times.

Area/Neighborhood Precinct # People Stopped > 5 Times
Far Rockaways 101 83
Sheepshead Bay 61 58
East New York 75 24
Williamsburg 90 23
Coney Island 60 21

These top 5 precincts contain 61% of the people stopped more than 5 times. It would be interesting to find out what is going on there, but there doesn’t seem to be an evident explanation. So far I have not found a common characteristic that these precincts share. Here are some facts about the 340 people that have been stopped more than 5 times:

  • Precinct 101 accounts for 1/4 of the 340 people stopped more than 5 times.
  • One woman was stopped 14 times in Precinct 61 (Sheepshead Bay)
  • 72% of people stopped > 5 times were African American/Black
  • 13% were Hispanic
  • 15% were White
  • The average age is 24.7 years (max 56, min 16)
  • These 340 people make up 2,686 stops.
  • 71% of those stops included a frisk.
  • Less than 5% (124) of stops led to an arrest.
  • Only 7 of those arrests were because of the criminal possession of a weapon (0.26%).

I don’t expect these numbers to be an accurate representation of all multiple stops in NYC. However, I do think that 1) they reveal a pattern, and 2) these numbers are a best case scenario, and in fact, I think the real numbers are way worse. After all, we know that there is at least one person who was stopped more than 60 times before he turn 18.

If you would like to read about the Top 10 most stopped individuals in New York City, check out the comic book I made for my Data Rep class.