Posts Tagged ‘data’

De-anonymizing open data, just because you can… should you?

Thursday, October 23rd, 2014

If an essential part of the data reveals personally identifiable information (PII), should the data not be released? Should the users of open data be the ones responsible for ensuring proper use of the data?

I mention this question because of an article by an intrepid Gawker reporter who decided he could correlate photos of celebrities in NYC taxis (with visible Taxi medallions) and the de-anonymized database on every NYC cab ride in 2013 to determine whether celebrities tipped their cab drivers. Of course, this article is another example of “Celebrities doing normal people things like using taxis”, but the underlying question here is just because you can violate people’s privacy does it mean you should?

Identifying celebrities and their cab rides was first done by an intern at Neustar, Anthony Tockar. In his post he recognizes that it is relatively easy to reveal personal information about people. Not only could he match cab rides to a couple of celebrities, but he also showed how you can easily see who frequently visits Hustler’s. Tockar says:

Now while this information is relatively benign, particularly a year down the line, I have revealed information that was not previously in the public domain.

He uses these examples to introduce a method of privatizing data called “differential privacy.” Differential privacy basically adds noise to the data when you zoom in on it so you can’t identify specific information about an individual, but you can still get accurate results when you look at the data as a whole. This is best exemplified by the graphic below.

This shows the average speed of cab drivers throughout the day. The top half is the actual average speed of all drivers and the average speed of all drivers after the data is run through the differential privacy algorithm. The bottom half shows the same for an individual cab driver. Click on the graphic to go to an interactive tool that lets you play around with the privacy parameter, ε.

But we’re still struggling with getting data off PDF’s or worse, filing cabinets. It’ll take years before we can create such privacy mechanisms for current open data! What to do in the meantime? It would seem that Gawker stopped reading after “Bradley Cooper left no tip” (actually, we don’t know since tips are not recorded if paid in cash). Just because someone could look up ten celebrities’ cab rides does it mean they should have? The reporter even quotes Tockar’s quote about “revealing information not previously in the public domain”. The irony seems to have been lost on Gawker. I’m of the opinion that Gawker shouldn’t have published an article about celebrities’ cab rides no more than it should publish their phone numbers if they were available inside a phone book. Unless it was trying to make a point about privacy and open data, which would’ve made for a great conversation piece.  Except it wasn’t since it was all about tipping. They even reached out to publicists for comments on the tipping.

Ultimately, who cares about Bradley Cooper taking a taxi. But when you go “hey, let’s see how many celebrities I can ID from this data” and write an article about it without questioning the privacy implications, you’re basically saying “Yes, because you can, it means you should.”

UPDATE: ok, so apparently there is a reason it’s called “Gawker”. See this example where this same author tries to out a Fox News reporter. Today I learned.

Placemeter pays YOU for your data…

Wednesday, October 8th, 2014
Note: I have set a new goal to post at least once a week, even if the posts are short.

Turns out you may have some data to offer that is actually more valuable than just your online shopping patterns: the view outside your window. Placemeter is a relatively new startup that pays New Yorkers up to $50 to place their phones against their windows and record movements on the street below. Using nifty computer vision algorithms, Placemeter extracts data from the images recorded by your phone. The short video below gives a sense of what they are trying to track.

The front page immediately addresses the issue of privacy. The company will not use the data to record anything that goes on inside your home, they will not use the data to identify people on the street, and the video they record isn’t stored. They only store raw data extracted from the video.

Their business model is simple: they pay you a little bit per month to record information which they will later sell to third parties. You provide the product they later sell (hey, at least they pay you for it). Since their goal is to sell data to businesses and city governments, they are mostly interested in views of restaurants, shops, or bars. This means lots of people like me can’t participate (I have a very lovely view of a wall). This got me thinking on who else can and can’t participate. If you happen to live in (and have a view of) Times Square, your view could be worth dozens of dollars! What about a view from a quiet Staten Island street? Or from the Bronx? Basically in order to participate you just have to live in the right place. A place that is probably expensive too.

One redditor applied to sell his/her view and was rejected because the street wasn’t busy enough, but that he/she would be considered when the company started “sending out unpaid meters”. I imagine this means the company would mail you a sensor for free and you would record data for them. If this happens I can see them shifting the rhetoric towards “help us analyse and improve your urban environment”, which this article already does.

Seeing as how the most valuable data would come from a select group of New Yorkers, most of their most valuable data might come from the already freely available video feeds around the city (they should fill out the survey for the OD500).

What One Database Marketing Company Knows About Me

Sunday, September 8th, 2013

It’s no surprise that marketing companies gather data about you to sell off to advertisers who then deliver targeted ads via mail, email, or while you surf the internet. Sometimes it’s even creepy how much they know about you. So far, it’s been a bit of a mystery finding out exactly how much of your information these companies have. A few days ago one marketing technology company, Acxiom, launched a new service called AboutTheData.com which allows people to take a peek into how much information the company has gathered on them.  Acxiom is no small marketing company. According to the NYTimes, it has created the world’s largest commercial database on consumers. I decided to give the service a try to see just how much data this company had about me.

Since this is such a large company, and I’m such an active internet user, I expected to find Acxiom to have gathered a lot of information about me. I was slightly disappointed–or relieved–when I found out that they didn’t have that much information on me at all (honestly, I don’t know how I should feel about this). Before going into the data, here is a little more information about where this data comes from and what we are shown.

According to Acxiom, this data is collected from:

  • Government records, public records and publicly available data – like data from telephone directories, website directories and postings, property and assessor files, and government issued licenses
  • Data from surveys and questionnaires consumers fill out
  • General data from other commercial entities where consumers have received notice of how data about them will be used, and offered a choice about whether or not to allow those uses – like demographic data

The data they show us, is their “core data”. This data is used to to generate the modeled insights and analytics used for marketing, which they do not show. Acxiom says that we are shown all of their core data. They make no mention about whether there is other non-core, non modeled insights data.

The site allows you to view data from six categories categories. Below is the information that has been gathered on me. Economic and Shopping data is over the past 24 months.

Characteristic Data: Male, Hispanic, inferred single
Home Data: No data.
Vehicle Data: No data.
Economic Data: Regular credit card holder (as opposed to Preimum/Gold), Regular Visa, 2 cash purchases (includes checks), 1 Visa purchase.
Shopping Data: $139 spent on 3 purchases (the ones referred to above?), 2 offline totalling $100, average $50 each (one purchase < $50, the other >$50, so I guess it’s a coincidence they add up to $100), 1 online for $39. My supposed interests include books, magazine, Christmas gift purchase, ethnic products (??), lifestyles, interests, and passions.
Households Interests Data: No data.

It makes sense that there is not be a lot of information about my home data or vehicle data, since I currently own neither (although there was no info on my previous vehicle ownership). Perhaps car and homeowners would have these sections filled out entirely. The household interests category is meant to include data related to interests of me or people in my household (examples given from the site include: gardening, traveling, sports). Not so surprised this is also empty, but I’m not sure why they guess that my shopping interests include ethnic products and yet they are not able to guess that I enjoy traveling. As for Characteristic Data? My Twitter feed should be enough to reveal that I’m a single male hispanic. Since you have to provide your name, email, address, and last 4 digits of your SSN, it’s pretty safe to assume that they also have this information.

**To skip Luis’ short history of shopping, jump to the next paragraph.
Economic and Shopping Data provide a little more hints as to where the data are coming from. First of all, they only have three purchases. That’s it. Out of the 3,100 card/check purchases I’ve made over the past 24 months, they have 3. I tried looking for two offline purchases on my Mint which add up to $100, but this proved to be a very difficult exercise. Even after filtering offline purchases and sorting data, there were too many possible combinations. For now, those two offline purchases remain a mystery. I was able to find a suspect for the online payment of $39. The most suspicious purchase came from a $39 seat upgrade at United Airlines. I can’t be sure if this is the one since I happened to buy a $39 upgrade, plus a plane ticket which does not show up in my AboutTheData. However, my suspicion arises from the fact that Mint had prepared a targeted ad for me by placing a green flashy dollar sign next to the purchase. This also could’ve been a coincidence.

Conclusions/Best Guesses
Given the fact that I spend A LOT of time on the internet and the high amount of purchases I’ve made over the years (I should cut down on those), I am surprised that Acxiom does not have more data about me. Basically, they know I’m a single, male, hispanic, and that’s about it. I can’t possibly imagine what they could gather from the rest of my data that’s worth $$$ to advertisers. Additionally, it seems a lot of their data comes from publicly available government data sets (home and car ownership), and–at least in my case–not a lot of data comes from neither my online habits or my shopping habits. I presume most of my important data is owned by Facebook and Google, and I’m pretty confident that they do not sell/share my data with Acxiom.

Last thought: AboutTheData let’s you edit your data so that you can receive more accurate targeted advertising. I’m curious to know who uses Acxiom data to target me, so I would’ve loved to enter distinctive preferences that do not apply to me (yet) such as “pregnancy”, “colonoscopies”, “underwater basket weaving”, or “Cook Islands National Women’s Football League” to see where these ads pop up. Unfortunately, AboutTheData only lets you change the above mentioned interests to ‘true’ or ‘false’. I guess they thought about the trolls.