The Mexican GovBots Did NOT Take Down #YaMeCanse, But We Can Keep #YaMeCanse# Trending

December 8th, 2014

Perhaps it was the excitement of hearing about a new phenomenon in censorship that prompted me to write a little too hastily about how the government of Mexico might have used Twitter bots to spam and trash the #YaMeCanse hashtag out of the trending topics list. As reported by Lo Que Sigue, Sopitas.com, Aristegui Noticias, and myself, #YaMeCanse, the hashtag used as the rallying cry for Mexico’s 43 missing students, was suddenly dropped from the trending topic list by an army of bots, presumably coordinated by the federal government. No proof is provided by any of us that the government was behind this, but a series of videos and screenshots originally provided by Lo Que Sigue lead us to believe that a swarm of bots is at least responsible.

It was NOT the bots

December 3 at 10:36 AM was the last time @TrendieMX reported #YaMeCanse to be trending. By 9:36 PM of the next day, #YaMeCanse2 was already trending. Let’s take a look at what the Topsy trends for #YaMeCanse looks like for the month of November and the first days of December.

Usage of #YaMeCanse

 

To make sense of what happened, we need to understand what Twitter is doing to calculate a trending topic. We don’t have access to specific information about how the trending algorithm functions, but we do know how trending algorithms in work general and we have some clues about what Twitter has done in the past to tweak its algorithms. The relevant issue here can be described as “the Justin Bieber problem”. Many of you might remember how some years ago Justin Bieber was constantly trending due to the millions of Beliebers continuously tweeting about him. Twitter wants to tell us what’s trending right now, and not one hour ago or one month ago. As Twitter is quoted saying in this Mashable article:

“The new algorithm identifies topics that are immediately popular, rather than topics that have been popular for a while or on a daily basis, to help people discover the ‘most breaking’ breaking news from across the world. (We had previously built in this ‘emergent’ algorithm for all local trends, described below.) We think that trending topics which capture the hottest emerging trends and topics of discussion on Twitter are the most interesting.”

Instead of merely looking at volume of Bieber tweets (of which there are many), Twitter looks at speed and “burstiness” of the tweets. However, there’s more to it. If Twitter only measured “burstiness”, you might see “Good Morning” trending every morning of every day. For this, Twitter establishes a baseline of expected frequencies based on history. Twitter “knows” there is usually a spike of “Good Morning” tweets every morning and corrects for this. As this video on trend detection in twitter social data explains, a ratio is calculated for each term based on the past frequency of the term and the present frequency.

What most likely happened is that after a couple of weeks of trending, the baseline for #YaMeCanse rose from zero (it didn’t exist before 11/7) to the frequency of people tweeting at the end of November. Twitter treated the volume and speed of the hashtag as something it would expect and dropped it off the trending list.

Baseline shift on #YaMeCanse

Spam bots should have no impact on the algorithm. The spam team at Twitter identifies the bots and they are not counted towards the algorithm. Additionally, it’s worth mentioning that Twitter has a team of low-paid human workers manually sorting through hashtags to eliminate advertiser spam. Even so, there is no evidence of an increase in bots during the time the hashtag was dropped from the list. The team at Lo Que Sea provided this video as proof of the presence of bots (not that we need proof of that in general)

Screenshot of Lo Que Sea video

Why are individual and unconnected tweets labeled as bots? If I tweet and only one person RT’s me, by their standards, I’m a bot. You can run the simulation from the video yourself on flocker.outliers.es. Use #YaMeCanse2 and wait for the same pattern of connected and disconnected tweets to occur. Then zoom in on the disconnected tweets and look up a couple of usernames. You’ll find a lot of those disconnected nodes are real people. You’ll also run into bots, but having no one retweet your tweet does not make you a bot.

This has happened before.

This would not be the first time that people have cried censorship upon the disappearance of a hashtag from the TT list. The Mashable article quoted above was a response to Beliebers accusing Twitter of censorship. Similarly, occupiers accused Twitter of censorship when #OccupyWallStreet was taken off the list. In both occasions Twitter had to step in and say this was just a result of how the algorithm works. In some cases we should be glad the algorithm works like this, otherwise we’d see #JustinBieber constantly trending. But how about when it’s something important like #YaMeCanse?

At this point I should say that if it were possible for the Mexican government to use such a tactic to censor people on social media, they probably would. We’ve already seen how Peña Nieto’s campaign used bots to promote the candidate on Twitter. And earlier this year, an initiative put forth by Peña Nieto on Radio and Telecom caused a lot of controversy when people claimed the law would allow the government to censor online content and to interrupt cell reception during protests. There’s also the case of 1DMX.org which was censored by GoDaddy under pressure from the US Consulate in Mexico.

We’ve Discovered How To Get Around the Algorithm

I believe the immediate response by the Mexican Twitterverse in the creation of #YaMeCanse2 has revealed an exploitable feature in the algorithm. It took less than two days for people to adapt to the new hashtag. Topsy Trends shows that #YaMeCanse2 doesn’t have significantly more traffic than #YaMeCanse had before being taken down, but the reason why #YaMeCanse2 was able to trend so quickly is because its baseline at the time was zero. This means whenever #YaMeCanse2’s baseline shifts up enough for it to de-trend, we can just start again with #YaMeCanse3. We can keep going with this as long as we keep the speed at which people tweet constant or as long as Twitter doesn’t catch on and modifies the algorithm to account for us just adding a number at the end of the phrase (In which case we can just add the word “tres”). This is also why we keep seeing so many different Bieber hashtags, they’re all different phrases that didn’t exist before.

The lesson here for the Mexican folk is that if we want to continue to have a #YaMeCanse hashtag trending, we need to coordinate to increment the number at the end of the tag each time it expires. When #YaMeCanse2 falls off the list, we simply switch over to #YaMeCanse3.

Censorship on Twitter Using Bots? How #YaMeCanse Was Knocked Off Twitter Trending Topics

December 4th, 2014

UPDATE BELOW, AND ALSO THIS POST WITH MORE INFO.

In late September of this year, 43 students in the Mexican state of Guerrero went missing. In an attempt to prevent students from disrupting a political event for his wife, the mayor of Iguala ordered local police to stop and detain the students. This set in motion a series of events that resulted in several murdered students and 43 missing students. People later learned the missing students were handed over to a local cartel and were subsequently killed and burned until no traces of their bodies were left behind. This announcement was made during a press conference by Mexico’s Attorney General, Jesús Murillo Karam where at the end of the conference, tired and exasperated, said “Ya me cansé.” I’ve had enough.

#YaMeCanse

Mexicans took to social media and responded with “We’re tired too…” Of the violence. Of the injustice. Of the impunity. Of the corruption. The #YaMeCanse hashtag became the rallying cry for discourse online and protests all over Mexico. The hashtag has been on Twitter trending topics almost since Murillo Karam’s press conference. Yesterday, the hashtag suddenly disappeared from the list even though usage had not waned.

Usage of #YaMeCanse

 

This sudden disappearance of such a popular hashtag raised some eyebrows. Determining trending topics is a little more complicated than simply calculating the number of mentions of a hashtag. Twitter has an algorithm that determines trending topics based on several factors. According to Twitter, one of the rules against usage of trending topics is “Repeatedly Tweeting the same topic/hashtag without adding value to the conversation in an attempt to get the topic trending or trending higher.” It is very likely that the overwhelming spamming of the #YaMeCanse caused Twitter’s algorithms to treat the hashtag as spam and proceded to remove it from the trending list.

As reported in sopitas.com, an army of bots had been RT and tweeting the #YaMeCanse hashtag for several days.

“Who says that online censorship and repression does not exist online? A storm of bots tries to disappear #YaMeCanse”

Spam Tweets

Another analysis by Lo Que Sigue shows the difference between connected and disconnected tweets symbolizing real people versus bots.

#YaMeCanse2

Not to be easily dissuaded, the Mexican twitterverse quickly came up with a simple solution: #YaMeCanse2, which is currently trending. An added cleverness to adding the number ‘2’ is that it forces people to ask “What happened to regular #YaMeCanse? Where’s #YaMeCanse1?” which leads people to find out about the attack. It’s a sort of the Barbra Streissand effect where in an attempt to censor one hashtag, not only do people evade the censorship, but in doing so call attention to the attempt at censorship.

It’s quite possible that this is not a coordinated attack on the hashtag by some entity. It could be just regular bots hijacking a popular hashtag. And it is very tempting to attribute to this “attack” to the government of Mexico. I would not be surprised at all if it was, and I’d be willing to bet that the Mexican government is behind this (it wouldn’t be the first time), but I would like to find definitive proof. The people behind Lo Que Sigue working to start an Indiegogo campaign to try and find the origin of these bots. Perhaps we don’t have to wait around for this to get funded and we can crowdsource/collaborate to try and see if tracing the origin of the bots is possible. I would welcome any ideas on how to do this.

UPDATE:

So Trending Topics are more complicated than they seem. It’s hard to tell whether bots had any role in dropping the hashtag from the trending list. It seems that Twitter is actually looking for “bursts” of tweets, and how fast these tweets appear ( ∂Tw/∂t?). It is entirely possible that volume of tweets remained stable but the “burstiness” was gone. I don’t know. Twitter’s algorithms are very private. Even if bots played no part in dropping the hashtag, the possibility of that happening might still exist. After all, riding hashtags to promote unrelated content is shunned by Twitter. Whether they can detect that algorithmically, I’m not sure, but I wouldn’t be surprised. If they can detect that, then it’s entirely possible to spam a hashtag using bots. Perhaps the only way to find out is to actually measure the volume and speed of the bots. Doing this, it turns out, is very hard.

 

“Social Physics: How Good Ideas Spread–The Lessons From a New Science” – Alex Pentland

November 10th, 2014

I started reading Alex “Sandy” Pentland’s book, Social Physics. Several things interest me about this book. I’m very interested in how society behaves in today’s world where we are increasingly connected to more people by weak social ties. Also interesting is that advances in data collection and analysis are bound to reach a point where we can continuously monitor and analyze people’s behavior. Who will have this knowledge? How will they use it? What will this world look like? Lastly, I’m interested on how good ideas spread and how that can help us design better organizations and institutions.

Alex Pentland thinks it is possible to create a mathematical explanation about why society behaves the way it does. He calls this discipline, social physics.

“Social physics is a quantitative social science that describes reliable, mathematical connections between information and idea flow on the one hand and people’s behavior on the other. Social physics helps us understand how ideas flow from person to person through the mechanism of social learning and how this flow of ideas ends up shaping the norms, productivity, and creative output of our companies, cities, and societies. “

The goal of applying this science to society is to shape outcomes. Pentland believes we can create systems that build a society “better at avoiding market crashes, ethnic and religious violence, political stalemates, widespread corruption, and dangerous concentration of power.”

All of this would sound great, if it didn’t sound kind of scary. There are a lot of concerns about privacy, which Pentland addresses, and which I’m sure he’ll talk more about in the coming chapters. However, even if he is able to get around the privacy issues, the ability to affect how society behaves would give whoever has the ability to do so great power. This is perhaps a little paranoid on my part, but I don’t think misusing the ability to “fix” society, as he puts it, is out of the question. Pentland does write about it:

“This vision of a data-driven society implicitly assumes the data will not be abused. The ability to see the details of the market, political revolutions, and to be able to predict and control them is a case of Promethean fire—it could be used for good or for ill.”

My second concern is best summarized by Nicholas Carr in his article in “The Limits of Social Engineering”.

“Pentland may be right that our behavior is determined largely by social norms and the influences of our peers, but what he fails to see is that those norms and influences are themselves shaped by history, politics, and economics, not to mention power and prejudice. People don’t have complete freedom in choosing their peer groups. Their choices are constrained by where they live, where they come from, how much money they have, and what they look like. A statistical model of society that ignores issues of class, that takes patterns of influence as givens rather than as historical contingencies, will tend to perpetuate existing social structures and dynamics. It will encourage us to optimize the status quo rather than challenge it.” (h/t to Cathy O’Neil for linking to this piece).

The case studies in the book so far take place in groups where this might not be a huge issue like eToro, an online trading and investment network. Carr’s (and my) concern may not be a huge issue in these scenarios especially because Pentland is measuring very specific metrics like return on investments. However I do believe there is real danger in applying this sort of analyses in places like, say Ferguson, MO. It will be interesting to read the different case studies and to try and identify places where this concern might arise.

It would be very unfair of me to end this without writing about the actual focus of the book (although I’m already a little nauseous fro writing this on the train). The book will focus on the two most important concepts of social physics: idea flow within social networks and social learning, that is, how we take these new ideas and turn them into habit and how learning can be accelerated and shaped by social pressure.

I like to believe that there are better systems of collaboration and cooperation that can make organizations more effective, communities more resilient, and authorities more accountable. Elinor Ostrom developed her work on governing the commons by studying how communities behaved around issues like irrigation and water management. Similarly, I do think Pentland’s insights on idea flow and social learning can help us understand how to design better organizations, communities, and institutions.

The Dangers of Evidence-Based Sentencing

October 27th, 2014
Note: This post was originally published on mathbabe.org and cross-posted on thegovlab.org.

What is Evidence-based Sentencing?

For several decades, parole and probation departments have been using research-backed assessments to determine the best supervision and treatment strategies for offenders to try and reduce the risk of recidivism. In recent years, state and county justice systems have started to apply these risk and needs assessment tools (RNA’s) to other parts of the criminal process.

Of particular concern is the use of automated tools to determine imprisonment terms. This relatively new practice of applying RNA information into the sentencing process is known as evidence-based sentencing (EBS).

What the Models Do

The different parameters used to determine risk vary by state, and most EBS tools use information that has been central to sentencing schemes for many years such as an offender’s criminal history. However, an increasing amount of states have been utilizing static factors such as gender, age, marital status, education level, employment history, and other demographic information to determine risk and inform sentencing. Especially alarming is the fact that the majority of these risk assessment tools do not take an offender’s particular case into account.

This practice has drawn sharp criticism from Attorney General Eric Holder who says “using static factors from a criminal’s background could perpetuate racial bias in a system that already delivers 20% longer sentences for young black men than for other offenders.” In the annual letter to the US Sentencing Commission, the Attorney General’s Office states that “utilizing such tools for determining prison sentences to be served will have a disparate and adverse impact on offenders from poor communities already struggling with social ills.” Other concerns cite the probable unconstitutionality of using group-based characteristics in risk assessments.

Where the Models Are Used

It is difficult to precisely quantify how many states and counties currently implement these instruments, although at least 20 states have implemented some form of EBS. Some of the states or states with counties that have implemented some sort of EBS (any type of sentencing: parole, imprisonment, etc) are: Pennsylvania, Tennessee, Vermont, Kentucky, Virginia, ArizonaColorado, California, Idaho, Indiana, Missouri, Nebraska, Ohio, Oregon, Texas, and Wisconsin.

The Role of Race, Education, and Friendship

Overwhelmingly states do not include race in the risk assessments since there seems to be a general consensus that doing so would be unconstitutional. However, even though these tools do not take race into consideration directly, many of the variables used such as economic status, education level, and employment correlate with race. African-Americans and Hispanics are already disproportionately incarcerated and determining sentences based on these variables might cause further racial disparities.

The very socioeconomic characteristics such as income and education level used in risk assessments are the characteristics that are already strong predictors of whether someone will go to prison. For example, high school dropouts are 47 times more likely to be incarcerated than people in their similar age group who received a four-year college degree. It is reasonable to suspect that courts that include education level as a risk predictor will further exacerbate thesedisparities.

Some states, such as Texas, take into account peer relations and considers associating with other offenders as a “salient problem”. Considering that Texas is in 4th place in the rate of people under some sort of correctional control (parole, probation, etc) and that the rate is 1 in 11 for black males in the United States it is likely that this metric would disproportionately affect African-Americans.

Sonja Starr’s paper

Even so, in some cases, socioeconomic and demographic variables receive significant weight. In her forthcoming paper in the Stanford Law Review, Sonja Starr provides a telling example of how these factors are used in presentence reports. From her paper:

For instance, in Missouri, pre-sentence reports include a score for each defendant on a scale from -8 to 7, where “4-7 is rated ‘good,’ 2-3 is ‘above average,’ 0-1 is ‘average’, -1 to -2 is ‘below average,’ and -3 to -8 is ‘poor.’ Unlike most instruments in use, Missouri’s does not include gender. However, an unemployed high school dropout will score three points worse than an employed high school graduate—potentially making the difference between “good” and “average,” or between “average” and “poor.” Likewise, a defendant under age 22 will score three points worse than a defendant over 45. By comparison, having previously served time in prison is worth one point; having four or more prior misdemeanor convictions that resulted in jail time adds one point (three or fewer adds none); having previously had parole or probation revoked is worth one point; and a prison escape is worth one point. Meanwhile, current crime type and severity receive no weight.

Starr argues that such simple point systems may “linearize” a variable’s effect. In the underlying regression models used to calculate risk, some of the variable’s effects do not translate linearly into changes in probability of recidivism, but they are treated as such by the model.

Another criticism Starr makes is that they often make predictions on an individual based on averages of a group. Starr says these predictions can predict with reasonable precision the average recidivism rate for all offenders who share the same characteristics as the defendant, but that does not make it necessarily useful for individual predictions.

The Future of EBS Tools

The Model Penal Code is currently in the process of being revised and is set to include these risk assessment tools in the sentencing process. According to Starr, this is a serious development because it reflects the increased support of these practices and because of the Model Penal Code’s great influence in guiding penal codes in other states. Attorney General Eric Holder has already spoken against the practice, but it will be interesting to see whether his successor will continue this campaign.

Even if EBS can accurately measure risk of recidivism (which is uncertain according to Starr), does that mean that a greater prison sentence will result in less future offenses after the offender is released? EBS does not seek to answer this question. Further, if knowing there is a harsh penalty for a particular crime is a deterrent to commit said crime, wouldn’t adding more uncertainty to sentencing (EBS tools are not always transparent and sometimes proprietary) effectively remove this deterrent?

Even though many questions remain unanswered and while several people have been critical of the practice, it seems like there is great support for the use of these instruments. They are especially easy to support when they are overwhelmingly regarded as progressive and scientific, something Starr refutes. While there is certainly a place for data analytics and actuarial methods in the criminal justice system, it is important that such research be applied with the appropriate caution. Or perhaps not at all. Even if the tools had full statistical support, the risk of further exacerbating an already disparate criminal justice system should be enough to halt this practice.

Both Starr and Holder believe there is a strong case to be made that the risk prediction instruments now in use are unconstitutional. But EBS has strong advocates, so it’s a difficult subject. Ultimately, evidence-based sentencing is used to determine a person’s sentencing not based on what the person has done, but who that person is.

De-anonymizing open data, just because you can… should you?

October 23rd, 2014

If an essential part of the data reveals personally identifiable information (PII), should the data not be released? Should the users of open data be the ones responsible for ensuring proper use of the data?

I mention this question because of an article by an intrepid Gawker reporter who decided he could correlate photos of celebrities in NYC taxis (with visible Taxi medallions) and the de-anonymized database on every NYC cab ride in 2013 to determine whether celebrities tipped their cab drivers. Of course, this article is another example of “Celebrities doing normal people things like using taxis”, but the underlying question here is just because you can violate people’s privacy does it mean you should?

Identifying celebrities and their cab rides was first done by an intern at Neustar, Anthony Tockar. In his post he recognizes that it is relatively easy to reveal personal information about people. Not only could he match cab rides to a couple of celebrities, but he also showed how you can easily see who frequently visits Hustler’s. Tockar says:

Now while this information is relatively benign, particularly a year down the line, I have revealed information that was not previously in the public domain.

He uses these examples to introduce a method of privatizing data called “differential privacy.” Differential privacy basically adds noise to the data when you zoom in on it so you can’t identify specific information about an individual, but you can still get accurate results when you look at the data as a whole. This is best exemplified by the graphic below.

This shows the average speed of cab drivers throughout the day. The top half is the actual average speed of all drivers and the average speed of all drivers after the data is run through the differential privacy algorithm. The bottom half shows the same for an individual cab driver. Click on the graphic to go to an interactive tool that lets you play around with the privacy parameter, ε.

But we’re still struggling with getting data off PDF’s or worse, filing cabinets. It’ll take years before we can create such privacy mechanisms for current open data! What to do in the meantime? It would seem that Gawker stopped reading after “Bradley Cooper left no tip” (actually, we don’t know since tips are not recorded if paid in cash). Just because someone could look up ten celebrities’ cab rides does it mean they should have? The reporter even quotes Tockar’s quote about “revealing information not previously in the public domain”. The irony seems to have been lost on Gawker. I’m of the opinion that Gawker shouldn’t have published an article about celebrities’ cab rides no more than it should publish their phone numbers if they were available inside a phone book. Unless it was trying to make a point about privacy and open data, which would’ve made for a great conversation piece.  Except it wasn’t since it was all about tipping. They even reached out to publicists for comments on the tipping.

Ultimately, who cares about Bradley Cooper taking a taxi. But when you go “hey, let’s see how many celebrities I can ID from this data” and write an article about it without questioning the privacy implications, you’re basically saying “Yes, because you can, it means you should.”

UPDATE: ok, so apparently there is a reason it’s called “Gawker”. See this example where this same author tries to out a Fox News reporter. Today I learned.

Reddit is NOT a failed state….

October 9th, 2014

It has it’s problems for sure, but I wouldn’t be so quick to dismiss it as having failed.

I’m referring to a The Verge article posted about a month ago following the celebrity nude photo leaks. The main argument for FAIL is the fact that instead of chastising the users who help spread the leaked photos, Reddit protected them under the shield of free speech. I’m not here to argue whether Reddit acted appropriately or not in protecting the individuals (personally, they could’ve been kicked out, banned, arrested, and I would’ve been content with that). But I do not think this transgression in privacy, abuse of free speech, and overall disgusting behavior by a small group of a larger community a failed state makes.

Is this indicative of pervasive malicious behavior across Reddit? Absolutely. We didn’t need r/TheFappening to figure that out. Just talk to woman redditors about their experiences as participants.

But at least we’re talking about these issues. It’s not so much the fact that we are, it’s the fact that we have the ability to do so. Through their karma system, Reddit has built a system that promotes good behavior and–sometimes–reproves the bad. It’s a primitive system for sure, especially since it’s not immune to the hivemind behavior (for example, apparently the r/nyc hivemind thinks people have ZERO responsability to give up your seat in the subway for a pregnant woman (maybe they’re right and I’m wrong)). This system, I think, allows the hive to go through iterations of what they believe to be correct. In effect, every now and then it corrects itself. Take the terrible “detective work” conducted during the immediate aftermath of the Boston Marathon bombing. After the hive realized it was wrong (so wrong), whenever there was a post asking for some sort of crowdsourced detective work, it was often met with someone who commented on the terrible results that came out of the last time they tried to play detectives.  As a result, Reddit for the most part now knows: We should avoid digital vigilantism.

In the coming years we will increasingly see Nobel Prize winner Elinor Ostrom’s principles on governing the commons applied to digital spaces. Although primitively (and perhaps unintentionally) Reddit has created a space where communities are able to define their own boundaries, (sort of) align “governance” rules with their preferences, (kind of) ensure that those who participate in the community can have a say on the rules, and are (barely) able to sanction those who misbehave. It has a long way to go for sure. What happened with r/TheFappening is a case where a group of very misguided individuals were able to gather in one place and as a community behave inappropriately. In that case what Reddit might be lacking is some greater oversight over communities and their leaders. An oversight that’s not dictatorial, but rather an oversight that is also provided by a community (a council of communities?).

Another problem with Reddit (or any digital space, actually) is that whenever someone goes through the trouble of committing a crime–say stealing nude celeb photos–the “morality cost” of engaging in the immoral behavior is significantly decreased by the internet’s ability to massively distribute information at a significantly low cost. For the most part, the consequences for engaging in such immoral behavior do not exist. Especially when it costs nothing to click on a link. This is maybe one of the internet’s biggest weaknesses: it’s ability to facilitate engagement in immoral behavior.

We need to design digital spaces that somehow take this into account. Spaces where the community can more meaningfully participate and deal with the bad apples more effectively. Is Reddit, and the rest of the internet, full of misguided individuals who do some fucked up shit? Yes, but this doesn’t mean we need to take it to the back of the barn and shoot it. It means we need to think about how we create these digital spaces in the future. Or do away with it if you want, but then let’s take the good lessons and the bad, and let’s make something better.

 

Placemeter pays YOU for your data…

October 8th, 2014
Note: I have set a new goal to post at least once a week, even if the posts are short.

Turns out you may have some data to offer that is actually more valuable than just your online shopping patterns: the view outside your window. Placemeter is a relatively new startup that pays New Yorkers up to $50 to place their phones against their windows and record movements on the street below. Using nifty computer vision algorithms, Placemeter extracts data from the images recorded by your phone. The short video below gives a sense of what they are trying to track.

The front page immediately addresses the issue of privacy. The company will not use the data to record anything that goes on inside your home, they will not use the data to identify people on the street, and the video they record isn’t stored. They only store raw data extracted from the video.

Their business model is simple: they pay you a little bit per month to record information which they will later sell to third parties. You provide the product they later sell (hey, at least they pay you for it). Since their goal is to sell data to businesses and city governments, they are mostly interested in views of restaurants, shops, or bars. This means lots of people like me can’t participate (I have a very lovely view of a wall). This got me thinking on who else can and can’t participate. If you happen to live in (and have a view of) Times Square, your view could be worth dozens of dollars! What about a view from a quiet Staten Island street? Or from the Bronx? Basically in order to participate you just have to live in the right place. A place that is probably expensive too.

One redditor applied to sell his/her view and was rejected because the street wasn’t busy enough, but that he/she would be considered when the company started “sending out unpaid meters”. I imagine this means the company would mail you a sensor for free and you would record data for them. If this happens I can see them shifting the rhetoric towards “help us analyse and improve your urban environment”, which this article already does.

Seeing as how the most valuable data would come from a select group of New Yorkers, most of their most valuable data might come from the already freely available video feeds around the city (they should fill out the survey for the OD500).

How to Build a Website From Scratch

January 23rd, 2014

When I signed up to build the Open Data 500 website, I wanted to go through the entire process of making a website from scratch. Full stack. Just to sort of see what it was like.
After spending 5 entire 10-hour days trying to trouble shoot a feature on the site, I decided to write a post on the skills needed to build an entire website from scratch.

To build an entire website from scratch you need to know the following:

  • HTML5
  • CSS
  • JavaScript
  • jQuery
  • D3
  • ParsleyJS
  • Modernizr
  • Tornado
  • Python
  • MongoDB
  • Mongoengine
  • CSV
  • JSON
  • geoJSON
  • Regular Expressions
  • Seamless
  • Heroku
  • Command Line
  • Git / Github
  • Google Analytics
  • MailChimp
  • DNS Records (A, CNAME, MX, etc)
  • Oh yeah, go directly to hell, GoDaddy
  • Polar Vortex Survival Skills
  • Basic Pharmacology
  • UX
  • UI
  • FU
  • F712U
  • Scheme (might as well)
  • Creative Commons Licensing
  • PHP (throw in a couple more languages, just in case)
  • Java
  • Ruby
  • C#
  • C♭
  • Perl
  • .NET
  • Obviously not WordPress
  • Ballmer Peaks
  • Double-team keyboarding
  • Windows
  • Mac
  • Linux
  • Atari
  • SNES (you’re welcome)
  • Brainfuck
  • SSL
  • HTTP
  • API’s
  • SOAP
  • LDAP
  • TCP/IP
  • WOFF
  • DOM
  • Cookies
  • XSRF
  • RSS
  • XML

I think that’s about it. I’f you’re just beginning with web development. Good luck. You’re almost there.

26

(Seriously, though. Keep it up, the road is long and arduous, but it’s totally worth it)

 

 

My Social Network

October 13th, 2013

I was playing around with Gephi, and I loaded my Facebook data to visualize my social network (or at least my Facebook social network). This is the result (click for full size).

 SocialNetwork

As you can see the network is pretty modular, which is to be expected since I’ve lived in 6 cities. There are 13 communities:

  1. High School, mostly my graduating class (21.97%) – Green
  2. The rest of Monterrey (17.41%) – Red
  3. ITP (16.4%) – Acqua
  4. Model UN (14.23%) – Light Blue
  5. UT Austin (12.57%) – Fuchsia?
  6. Oklahoma City (5.71%) – Purple
  7. Family, extended family, and family friends (5.13%) – Dark purple
  8. Schlumberger (3.32%) – Lime Green
  9. GovLab (1.3%) – Yellow
  10. NYCDigital (0.79%) – Orange
  11. Students For Sensible Drug Policy (0.72%) – Dark Blue
  12. Las Chilangas de Nueva York (0.22%) – Dark Blue inside ITP blob
  13. The group of Canadians I randomly befriended on a bus one day. (0.22%) – Tiny Light Green Offshoot from large Green blob

I filtered out those nodes which had less than 2 degrees (less than 2 mutual friends), but it was interesting to see the lonely nodes on my network. Those are mostly people that I have encountered while traveling alone or have randomly met. The graph contains 1,384 nodes (friends) with 35,226 edges (connections) between them. The longest path (network diameter) between two of my friends (without going through me) is 8. The huge blue dot in the middle is Gaby, and she is connected to 7 of my 13 communities and shares friends. In second place is Chantel who knows everyone in Monterrey.

Making your own graph

If you want to do this for your own Facebook data, go to http://snacourse.com/getnet. Authorize the app. I selected all options in case I want to use that data later. Click on the ‘click here‘ link in Step 2. The app will need to scrape your Facebook and this might take a while if you have a large network.

You’ll also need to download Gephi, an open source visualization software.

Once you’ve downloaded your data and Gephi, open Gephi and File->Open your data file (default settings should be OK). You’ll see a bunch of dots arranged in a square in the middle of the screen.

Screen Shot 2013-10-13 at 12.04.02 PM

You’ll need to tell Gephi to reorganize the graph. On the bottom left you can choose a Layout. I chose ForceAtlas 2, checked Dissuade Hubs and Prevent Overlap, and set Gravity to 50.

Click Run. You’ll see the dots start to move around. Depending on the size of your network, it might take a while before you start seeing a discernible pattern. You can click on an individual node to find information about it by selecting the Edit tool in the toolbar (bottom-most tool). The node info will be displayed on the edit tab next to the Partition and Ranking tabs.

Screen Shot 2013-10-13 at 12.19.17 PM

If you want to remove the lone nodes and just show your one giant network, on the right of your screen you’ll see a Statistics and Filters tab. Click on Filter -> Topology, and drag “Giant Component” below to where it says ‘Drag filter here‘. Click Filter at the bottom. I also filtered out nodes with less than 2 degrees. Drag ‘Degree Range‘ into your Queries as well. When selected, you’ll see Degree Range Settings at the bottom. Drag the sliders or double-click the numbers to edit them. (Don’t click Filter again, the button works like an On/Off switch, and it was already on from the previous step).

Degrees

Before sizing the nodes by degree (in this case degrees represents mutual friends), let’s calculate the Average Degree. Under Statistics, click on Run next to Average Degree. You’ll get a result for average number of mutual friends across your network and you’ll get a nifty distribution graph. Usually this looks like a power-law distribution.

Now, go to the top left and click on the Ranking tab. In the drop down menu, select Degree. You can visualize with color, size, label color, or label size. I chose Size, but feel free to play around. Choose a range that fits best for your network, and hit Apply.

Screen Shot 2013-10-13 at 12.15.41 PM

By the way, if you graph isn’t changing much anymore, you can stop the ForceAtlas 2 Layout process. Click on Stop. The dots should stop moving.

Communities / Modularity

To color the different communities, you’ll need to calculate Modularity. It’s under the Statistics tab on the right. Click Run. Press OK for the default settings. Again, you’ll get a nifty distribution chart.

Go to the Partition tab on the top left. Under Nodes, click the Refresh Button Screen Shot 2013-10-13 at 12.27.49 PM. Select Modularity Class from the drop-down menu. If you don’t like the colors, you can right-click inside that window and select Randomize Colors. Or click on the individual colors and manually select your colors. Once you’re happy with the colors, click Apply.

Awesome! You’re own social network graph. Gephi is a lot of fun to play around, and I encourage you to do so. The Gephi website has a bunch of tutorials you can follow that will teach you some of the awesome things you can do. To save your graph as a PDF, click on Preview on the top-top left. Feel free to play around with the settings, they’re pretty straight forward. When you’re done, just click on Export SVG/PDF/PNG at the bottom left.

I’ll try to make my graph prettier. As soon as I can get Illustrator to open up this tiny file.

What One Database Marketing Company Knows About Me

September 8th, 2013

It’s no surprise that marketing companies gather data about you to sell off to advertisers who then deliver targeted ads via mail, email, or while you surf the internet. Sometimes it’s even creepy how much they know about you. So far, it’s been a bit of a mystery finding out exactly how much of your information these companies have. A few days ago one marketing technology company, Acxiom, launched a new service called AboutTheData.com which allows people to take a peek into how much information the company has gathered on them.  Acxiom is no small marketing company. According to the NYTimes, it has created the world’s largest commercial database on consumers. I decided to give the service a try to see just how much data this company had about me.

Since this is such a large company, and I’m such an active internet user, I expected to find Acxiom to have gathered a lot of information about me. I was slightly disappointed–or relieved–when I found out that they didn’t have that much information on me at all (honestly, I don’t know how I should feel about this). Before going into the data, here is a little more information about where this data comes from and what we are shown.

According to Acxiom, this data is collected from:

  • Government records, public records and publicly available data – like data from telephone directories, website directories and postings, property and assessor files, and government issued licenses
  • Data from surveys and questionnaires consumers fill out
  • General data from other commercial entities where consumers have received notice of how data about them will be used, and offered a choice about whether or not to allow those uses – like demographic data

The data they show us, is their “core data”. This data is used to to generate the modeled insights and analytics used for marketing, which they do not show. Acxiom says that we are shown all of their core data. They make no mention about whether there is other non-core, non modeled insights data.

The site allows you to view data from six categories categories. Below is the information that has been gathered on me. Economic and Shopping data is over the past 24 months.

Characteristic Data: Male, Hispanic, inferred single
Home Data: No data.
Vehicle Data: No data.
Economic Data: Regular credit card holder (as opposed to Preimum/Gold), Regular Visa, 2 cash purchases (includes checks), 1 Visa purchase.
Shopping Data: $139 spent on 3 purchases (the ones referred to above?), 2 offline totalling $100, average $50 each (one purchase < $50, the other >$50, so I guess it’s a coincidence they add up to $100), 1 online for $39. My supposed interests include books, magazine, Christmas gift purchase, ethnic products (??), lifestyles, interests, and passions.
Households Interests Data: No data.

It makes sense that there is not be a lot of information about my home data or vehicle data, since I currently own neither (although there was no info on my previous vehicle ownership). Perhaps car and homeowners would have these sections filled out entirely. The household interests category is meant to include data related to interests of me or people in my household (examples given from the site include: gardening, traveling, sports). Not so surprised this is also empty, but I’m not sure why they guess that my shopping interests include ethnic products and yet they are not able to guess that I enjoy traveling. As for Characteristic Data? My Twitter feed should be enough to reveal that I’m a single male hispanic. Since you have to provide your name, email, address, and last 4 digits of your SSN, it’s pretty safe to assume that they also have this information.

**To skip Luis’ short history of shopping, jump to the next paragraph.
Economic and Shopping Data provide a little more hints as to where the data are coming from. First of all, they only have three purchases. That’s it. Out of the 3,100 card/check purchases I’ve made over the past 24 months, they have 3. I tried looking for two offline purchases on my Mint which add up to $100, but this proved to be a very difficult exercise. Even after filtering offline purchases and sorting data, there were too many possible combinations. For now, those two offline purchases remain a mystery. I was able to find a suspect for the online payment of $39. The most suspicious purchase came from a $39 seat upgrade at United Airlines. I can’t be sure if this is the one since I happened to buy a $39 upgrade, plus a plane ticket which does not show up in my AboutTheData. However, my suspicion arises from the fact that Mint had prepared a targeted ad for me by placing a green flashy dollar sign next to the purchase. This also could’ve been a coincidence.

Conclusions/Best Guesses
Given the fact that I spend A LOT of time on the internet and the high amount of purchases I’ve made over the years (I should cut down on those), I am surprised that Acxiom does not have more data about me. Basically, they know I’m a single, male, hispanic, and that’s about it. I can’t possibly imagine what they could gather from the rest of my data that’s worth $$$ to advertisers. Additionally, it seems a lot of their data comes from publicly available government data sets (home and car ownership), and–at least in my case–not a lot of data comes from neither my online habits or my shopping habits. I presume most of my important data is owned by Facebook and Google, and I’m pretty confident that they do not sell/share my data with Acxiom.

Last thought: AboutTheData let’s you edit your data so that you can receive more accurate targeted advertising. I’m curious to know who uses Acxiom data to target me, so I would’ve loved to enter distinctive preferences that do not apply to me (yet) such as “pregnancy”, “colonoscopies”, “underwater basket weaving”, or “Cook Islands National Women’s Football League” to see where these ads pop up. Unfortunately, AboutTheData only lets you change the above mentioned interests to ‘true’ or ‘false’. I guess they thought about the trolls.