Posts Tagged ‘Stop and Frisk’

De-anonymizing Stop and Frisk Data.

Wednesday, January 2nd, 2013

I started with the premise that 87% of Americans are uniquely identifiable by knowing their date of birth, zip code, and gender. The Stop and Frisk (SNF) data gives you date of birth, precinct, gender, race, height, weight, eye color, hair color, and build. The original SNF data set contains 685,724 stops for 2011. However, out of those stops, only 2/3 had valid dates of birth. By ‘valid’ I mean, between the ages of 0 to 112 (around 275,000 stops where of people born on Dec. 31, 1900). Since date of birth is crucial to de-anonymization, I excluded those data points from the analysis. My numbers will therefore differ from the NYCLU’s report, since they did include these entries.

After cleaning the data a bit, I chose to only use D.O.B, gender, race, precinct, and height to de-anonymize the data. I did not choose the rest of the descriptors because the police officer conducting the stop might not always enter the same information for the same person. First, only in 55% of stops did the suspect provide a photo ID which could provide accurate details of their weight, hair color, etc. The police officer would have had to guess all of the person’s information correctly every time for the other 45% of stops. Second, people’s weight, build and hair color can change over the year or not easily identifiable at night. Lastly, I realize that height also, changes, especially in people below 20 years, but I wanted to play it a little safe somehow. I thought height would be easy for a police officer to guess correctly, so I kept height. Using these descriptors, I found 364,706 unique individuals. 22,649 of whom were stopped more than once. Here are the top 20 people stopped in 2011, and the number of times they have been stopped.

Screen Shot 2012-12-20 at 7.44.34 PM

The string of numbers is the person’s “name”. From left to right, the numbers mean precinct, gender, race, DOB, and height. You’ll notice that 18/20 precincts are precincts 60, 61, and 101 (Coney Island, Gravesend, and Far Rockaway) I’ll write more on these later, but first some numbers on people stopped more than once.

  • 6.2% of people stopped where stopped more than once (22,476 out of 364,706).
  • 60.7% of people stopped more than once where black.
  • 29.0% were hispanic.
  • 7.9% were white.
  • 2.4% were others (Asians, Pacific Islanders, Native Americans, Others)

Going back to precincts 60, 61, and 101. After I first noticed that the overwhelming presence of these three precincts in the top 20 list, I mapped out all the people who had been stopped more than once and got a map with points pretty much all over New York City. Notice, each dot represents a person, not a stop. The position of the person is the average position of all the person’s stops.

Then I mapped out everyone stopped more than 5 times.
People Stopped More Than 5 Times

Out of the 22,000+ people stopped more than once, there where 340 that were stopped more than 5 times. Here is a table of the top 10 precincts with people stopped more than 5 times.

Area/Neighborhood Precinct # People Stopped > 5 Times
Far Rockaways 101 83
Sheepshead Bay 61 58
East New York 75 24
Williamsburg 90 23
Coney Island 60 21

These top 5 precincts contain 61% of the people stopped more than 5 times. It would be interesting to find out what is going on there, but there doesn’t seem to be an evident explanation. So far I have not found a common characteristic that these precincts share. Here are some facts about the 340 people that have been stopped more than 5 times:

  • Precinct 101 accounts for 1/4 of the 340 people stopped more than 5 times.
  • One woman was stopped 14 times in Precinct 61 (Sheepshead Bay)
  • 72% of people stopped > 5 times were African American/Black
  • 13% were Hispanic
  • 15% were White
  • The average age is 24.7 years (max 56, min 16)
  • These 340 people make up 2,686 stops.
  • 71% of those stops included a frisk.
  • Less than 5% (124) of stops led to an arrest.
  • Only 7 of those arrests were because of the criminal possession of a weapon (0.26%).

I don’t expect these numbers to be an accurate representation of all multiple stops in NYC. However, I do think that 1) they reveal a pattern, and 2) these numbers are a best case scenario, and in fact, I think the real numbers are way worse. After all, we know that there is at least one person who was stopped more than 60 times before he turn 18.

If you would like to read about the Top 10 most stopped individuals in New York City, check out the comic book I made for my Data Rep class.

Mexican Drug War Data

Sunday, November 4th, 2012

A project to gather crime drug war crimes data.

The recent publishing of stop and frisk data from the NYCLU has stirred a lot of controversy, particularly because the data showed strong evidence of discrimination by the NYPD towards black and Hispanic New Yorkers. The NYCLU was able to conduct this study because of a piece of legislation introduced by the New York City Council requiring the NYPD to provide quarterly reports on stop-and-frisk data. Since then, the NYPD has also kept a computerized database of its stop-and-frisk program. The level of detail and granularity in these reports have made it possible for organizations such as the NYCLU to conduct successful studies on the results of the NYPD’s stop-and-frisk program. As a result, these studies have brought important social issues into the limelight and have demonstrated (as has been demonstrated countless times before) the importance of publicly accessible data.

The stop-and-frisk data’s level of detail is something to be dreamed of in Mexico. Unfortunately, after 6 years of violence, there is still no publicly accessible data set of stop-and-frisk-data granularity available on the violence in Mexico. We think that having a database with detailed reports on each incident would help better understand what has and is happening in Mexico. Our project will attempt to create a database of detailed “incidents” that have occurred in Mexico since the start of the drug war. We will attempt to do this in two ways. The populating of the database will first rely on people visiting the site to input past events by skimming through different news sources and providing a detailed account of what happened by using a friendly form on the site. Even if we got sufficient participation to go through all the news sources and capture all reported events, there is an inherent problem in that not all events get reported in the main stream media. In fact, the media has been doing such a bad job at reporting the violence in Mexico that people have taken matters into their own hands and have taken on the role of citizen reporters to warn other people about shootings in Mexico. Twitter has become one of the greater tools of citizen reporting with people’s prolific use of hashtags. From the project launch we plan on keeping a live recording of events as they unfold on Twitter and we will rely on people to confirm events and provide greater detail.

Preliminary twitter stream analysis indicates that events are more widely reported in twitter and much faster than any news source in Mexico. Events also seem to be relatively easy to spot. The graph below is a histogram of tweets containing the word “balacera” (shooting) over a period of twelve days. The spikes represent increased activity that could potentially indicate a shooting is taking place. By identifying these events (by measuring ∂Tweet/∂t) we could reach out to people and ask them to help validate and provide more information on the shooting.

Gathering information live from tweets is not a new idea, and there is by no means an absence of information pertaining to the violence in Mexico. However, the data contains only what is reported by the government (I, for one, find it hard to believe that in 48 months, my hometown of Monterrey only had 297 murders and the border town of Nuevo Laredo only had 159) and the smallest level of detail is murder per city per month. One example of a provider of crime data is the Citizen Institute of Studies on Insecurity (ICESI). The site contains several accessible data sets but their data is organized by state per year. Whatever information they used to come up with their results is not accessible to the public, and attempts to contact them about obtaining more information have been stressful and ultimately futile. UPDATE: They seem to not exist anymore.

Several attempts at harnessing Twitter data for live reporting of shootings have been attempted. Most notable is Retio, a project started over a year ago by a group of engineers in Mérida. While it has mixed success rates in different cities, Retio has been able to harness the power of citizen reporting in major cities like Monterrey, Guadalajara and Mexico City quite effectively. However, Retio relies on users to actively tweet to one of Retio’s many different Twitter accounts (1 account per city). The tweets are then automatically categorized by report type and then retweeted from the respective city account. But the site has several shortcomings. First, while the site relies on people mentioning certain hashtags and accounts, the system still does a bad job eliminating spam because it looks at individual tweets. Second, reports contain very little information; it seems like Retio’s job is to simply map events and retweet the incident. Third, users are given no choice in anonymity (we are still debating on the benefits of anonymity). Lastly, and perhaps more importantly, the information gathered by Retio is not publicly available in a machine-readable format. Another big citizen reporting tool available is Centro de Integración Ciudadana (CIC). CIC does not rely on Twitter data, and just like Retio their reports do not contain much information and is not freely available in a machine-readable format. Whatever the shortcomings are for these different tools, they do prove that the Mexican citizenry is engaged and willing to participate.

The goal of our project is to create an easily accessible database that will (hopefully) provide better information than what is currently out there. We hope to gain support from citizens by actively reaching out to engaged citizens via Twitter and asking for their support. The idea is still very much in its early development, and it might seem like we’re reaching for “low-hanging fruit” but we’re fairly confident that we can provide a better service than some of the other citizen reporting projects that currently exist.