May 13, 2013

FAQ: The Geography of Hate

Dear Readers,

Thanks to everyone (well, almost everyone) for their comments and constructive critiques on our Geography of Hate map. In light of all of the different directions these comments have come from, we wanted to respond to some of the more common questions and misunderstandings all at once. Before commenting or emailing about the map, please keep the following in mind...

1. First, read our original post. Second, read through this FAQ. Third, read the "Details about this map" section included in the interactive map, itself. We specifically spent time on these things in order to explain our approach, and they go into some detail about the methods we used. Nearly all of the critiques of our map are already included in one of these venues. We're happy to engage and confident in our methodology (not that any approach is perfect), but please, use the skills your first teacher gave you and take the time to read.

2. If you are offended by these words, and we sincerely hope that you are, remember that they are the object of a research project. As such, we felt compelled to reproduce the words in full in order to be as clear as possible about our project. While we agree that the use of these slurs can be hurtful to some, especially the groups that they are targeted at, we believe that there is a difference between including them as the object of our study and using them as they are 'meant' to be used.

3. The map is based solely on geocoded data from Twitter, and does not reflect our personal attitudes about a given place. The map represents real tweets sent by real people, and is evidence that the feeling of anonymity provided by Twitter can manifest itself in an ugly way. If you feel that the place you live is more or less racist than somewhere else and this isn't reflected in the map, please start a conversation with your community about these issues.

4. In order to produce this map, we took the number of geotagged hateful tweets, aggregated them to the county level and then normalized this count by the overall number of tweets in that county. This means that the spatial distributions you see for the different variables are decidedly NOT showing population density. As we mentioned above, this is clearly stated in all of the previously written material accompanying the map. And because we are specifically looking at the geographic patterns of Twitter activity, it makes more sense to normalize by overall levels of Twitter activity than by population.

Were that not enough, however, the fact that there is so little activity on the map in California - home to an eighth of the entire US population, including the cities of Los Angeles, San Francisco and San Diego - should be a clue that something else besides population is at work in explaining these distributions. While we share with the infamous xkcd cartoon a distaste for non-normalized data, just because you thought for a second that maybe it was relevant in this case doesn't make it so. There are many possible explanations for some of the distributions that you can see, and we don't pretend to have all of the explanations. But population just isn't one.  

5. This map includes ALL geotagged tweets for each of these words that were determined as negative. This is not a sample of tweets containing these words, but rather the entire population that meets our criteria. That being said, only around 1.5 % of all tweets are geotagged, as it requires opting-in to Twitter's location services. Sure enough, that subset might be biased in a multitude of ways when compared with the the entire body of tweets or even with the general population. But that does not mean that the spatial patterns we discover based on geotagged tweets should automatically be discarded - see for example some of our earlier posts on earthquakes and flooding


6. 150,000 is in no way a "small" number. Yes, it is less than the total population of earth. Yes, it is less than the number of atoms in the universe. But no, it is not small number, especially as it is the total population of the phenomenon rather than a sample (see #5). And were one to extrapolate out that, considering these 150,000 geotagged hateful tweets are only around 1.5% of the total number of hateful tweets, the actual number of tweets (both geotagged and not) containing such hateful words is quite a bit larger. Regardless, we think that 150,000 is a sufficiently large number to be quite depressed about the state of bigotry in our country.


7. Furthermore, given that each and every geotagged tweet including the words listed was read and manually coded by actual human beings (if you consider undergraduates to be human beings!), rather than automatically by a piece of software, 150,000 isn't an especially small number. For students to read just these 150,000 tweets, it took approximately 150 hours of labor. This isn't insignificant.

8. The original lists of words included were derived frohttp://en.wikipedia.org/wiki/List_of_ethnic_slurs and http://en.wikipedia.org/wiki/List_of_LGBT_slang and included the following words:

bitch
nigger
fag*
homo*
queer
dyke
Darky OR darkey OR darkie
gook*
gringo
honky OR honkey OR honkie
injun OR indian
monkey
towel head
Wigger OR Whigger OR Wigga
wet back OR wetback
cripple
cracker
honkey
fairy
fudge packer
tranny

A * indicates a list of lexeme variations was used, which accounts for alternate spellings of words. For example, "fag" was not just "fag," but also "fags", "faggot", "faggie", and "fagging", among other things. All geotagged tweets containing these terms were examined. All tweets that were not used in a derogatory manner were discarded during coding, and as a result some words no longer achieved a minimum number to be displayed on the map. For example, honky/honkey/honkie was discarded, as most of the tweets were positive references towards honky-tonk music and not slurs aimed at white people.  

In the end we were also constrained to words that could be manually coded, and words that could not. For instance, the 5.5 million tweets with reference to "bitch" were excluded from the list. Students were paid roughly $10 per 1000 coded tweets, and therefore including the word "bitch" alone would have cost roughly $55,000 to manually check for sentiment. Tranny/tranney would have been under $200. While we're obviously interested in including a wider range of hateful terms in our analysis, our research funds, and thus the scope of this project, are extremely limited. It's not like we have billions of dollars in funding lying around. If you feel strongly, feel free to donate to http://humboldt.edu/giving. and enter "The Geography of Hate Project" in your comments.

9. If you are a disgruntled white male who feels that the persistence of hatred towards minority groups is a license to complain about how discrimination against you is being ignored, just stop. You can refer to all of our previous commentary on this issue from November. Though we have typically refrained from deleting asinine comments to this effect - those who choose to make these comments do more to prove themselves to be fools than we ever could - we fully reserve the right to delete any and all comments we believe to be unnecessary.

May 10, 2013

The Geography of Hate

UPDATE (5/13/13 @ 10:45pm): We have written and published a FAQ to respond to some of the questions and concerns raised in the comments here and elsewhere. Please review our comments there before commenting or emailing.

Following the 2012 US Presidential election, we created a map of tweets that referred to President Obama using a variety of racist slurs. In the wake of that map, we received a number of criticisms - some constructive, others not - about how we were measuring what we determined to be racist sentiments. In that work, we showed that the states with the highest relative amount of racist content referencing President Obama - Mississippi and Alabama - were notable not only for being starkly anti-Obama in their voting patterns, but also for their problematic histories of racism. That is, even a fairly crude and cursory analysis can show how contemporary expressions of racism on social media can be tied to any number of contextual factors which explain their persistence.

The prominence of debates around online bullying and the censorship of hate speech prompted us to examine how social media has become an important conduit for hate speech, and how particular terminology used to degrade a given minority group is expressed geographically. As we’ve documented in a variety of cases, the virtual spaces of social media are intensely tied to particular socio-spatial contexts in the offline world, and as this work shows, the geography of online hate speech is no different.

Rather than focusing just on hate directed towards a single individual at a single point in time, we wanted to analyze a broader swath of discriminatory speech in social media, including the usage of racist, homophobic and ableist slurs.

Using DOLLY to search for all geotagged tweets in North America between June 2012 and April 2013, we discovered 41,306 tweets containing the word ‘nigger’, 95,123 referenced ‘homo’, among other terms. In order to address one of the earlier criticisms of our map of racism directed at Obama, students at Humboldt State manually read and coded the sentiment of each tweet to determine if the given word was used in a positive, negative or neutral manner. This allowed us to avoid using any algorithmic sentiment analysis or natural language processing, as many algorithms would have simply classified a tweet as ‘negative’ when the word was used in a neutral or positive way. For example the phrase ‘dyke’, while often negative when referring to an individual person, was also used in positive ways (e.g. “dykes on bikes #SFPride”). The students were able to discern which were negative, neutral, or positive. Only those tweets used in an explicitly negative way are included in the map.

Tweets negatively referring to "Dyke"
All together, the students determined over 150,000 geotagged tweets with a hateful slur to be negative. Hateful tweets were aggregated to the county level and then normalized by the total number of tweets in each county. This then shows a comparison of places with disproportionately high amounts of a particular hate word relative to all tweeting activity. For example, Orange County, California has the highest absolute number of tweets mentioning many of the slurs, but because of its significant overall Twitter activity, such hateful tweets are less prominent and therefore do not appear as prominently on our map. So when viewing the map at a broad scale, it’s best not to be covered with the blue smog of hate, as even the lower end of the scale includes the presence of hateful tweeting activity.

Even when normalized, many of the slurs included in our analysis display little meaningful spatial distribution. For example, tweets referencing ‘nigger’ are not concentrated in any single place or region in the United States; instead, quite depressingly, there are a number of pockets of concentration that demonstrate heavy usage of the word. In addition to looking at the density of hateful words, we also examined how many unique users were tweeting these words. For example in the Quad Cities (East Iowa) 31 unique Twitter users tweeted the word “nigger” in a hateful way 41 times. There are two likely reasons for higher proportion of such slurs in rural areas: demographic differences and differing social practices with regard to the use of Twitter. We will be testing the clusters of hate speech against the demographic composition of an area in a later phase of this project. 

Hotspots for "wetback" Tweets
Perhaps the most interesting concentration comes for references to ‘wetback’, a slur meant to degrade Latino immigrants to the US by tying them to ‘illegal’ immigration. Ultimately, this term is used most in different areas of Texas, showing the state’s centrality to debates about immigration in the US. But the areas with significant concentrations aren’t necessarily that close to the border, and neither do other border states who feature prominently in debates about immigration contain significant concentrations.

Ultimately, some of the slurs included in our analysis might not have particularly revealing spatial distributions. But, unfortunately, they show the significant persistence of hatred in the United States and the ways that the open platforms of social media have been adopted and appropriated to allow for these ideas to be propagated.

Funding for this map was provided by the University Research and Creative Activities Fellowship at HSU. Geography students Amelia Egle, Miles Ross and Matthew Eiben at Humboldt State University coded tweets and created this map.

The full interactive map is available here: http://users.humboldt.edu/mstephens/hate/hate_map.html

May 06, 2013

Tweeting the AAGs

Now that we've all had a couple of weeks after the AAGs to relax and make fun of certain unnamed party-animals, we thought we would reflect on how the conference itself was reflected in the Twittersphere. With comments abound that there was more conference-related Twitter activity than ever before, we wanted to see if we couldn't uncover some more specific trends.

Thanks to an enterprising geographer, we have an archive of all 3,154 tweets with the official conference hashtag, #AAG2013. We know from this database that those tweets came from a total of 697 users, of which the top 10 users contributed about 23% of the total number of tweets.

But cross-referencing the Eventifier database with DOLLY's archive of geotagged tweets with the conference hashtag, we can try to understand how and where some geographers tweet and whether geographers fit the overall profile of Twitter users in terms of geotagging. Do geographers geotag their tweets at a higher rate than the average user because of their heightened awareness of spatial issues? Or do they intentionally avoid geotagging their tweets due to sensitivity to location privacy?

According to DOLLY, there were just 137 geotagged tweets with #AAG2013, coming from just 41 users. So, rather than adhering to the oft-cited rule of ~1.5% of all tweets being geotagged, geographers in Los Angeles for the AAGs actually geotagged more than 4% of their conference-related tweets. Of the 137, 127 actually have exact lat/lon coordinates, so we're able to do some mapping at the urban scale in order to see where geographers were tweeting about the conference.

And because only 8 tweets came before the AAG started on April 9, and only 5 came after it ended on April 13, and these are roughly congruent with the 16 tweets outside of Los Angeles County, we'll focus on the 113 of 127 tweets with exact coordinates which were located in downtown LA. In other words, because most of the AAG-related tweeting happened during the conference and in its general proximity, it isn't too interesting to focus on the other locations from which the hashtag was being used.

AAG-related Tweeting Activity in Downtown Los Angeles
As is evident from this map, the vast majority of the tweets referencing #AAG2013 came from the Westin Bonaventure Hotel, the primary site of the conference. The second highest concentration of tweeting activity came from the Millenium Biltmore Hotel and LARTA, the secondary conference site and location of our IronSheep event, respectively, which were just half-a-block or so apart, and immediately adjacent to Pershing Square. But given the lack of free conference Wi-Fi and general lack of cell phone service in the Biltmore, it's even less surprising that it had quite a bit less geotagged tweeting activity. Other small pockets of tweeting activity around the downtown seem to be located in the general vicinity of bars that were known to be frequented by geographers, such as the Library Bar, which hosted multiple conference related parties over the course of the week.

As is the case with many of our maps, there's nothing too surprising here. Of course it makes sense that people tweet about the conference from the location of the conference. But we'd still be careful about reading too much into these results. More specifically, we shouldn't get the impression that geographers go to the AAGs primarily to sit in stuffy hotel rooms giving paper presentations rather than gallivant around town with old friends, instead, it seems more plausible that geographers are simply having too great of a time at various drinking establishments to tweet about it, or too smart to use the official conference hashtag when doing so!

May 02, 2013

DOLLY's Birthday!

We recently added a page outlining in more detail the DOLLY (Data On Local Life and You) project at the University of Kentucky to provide an overview to this ongoing and exciting project to make the massive datasets associated with geosocial media data (such as Twitter) accessible and explorable.  Yesterday we archived the 3 billionth tweet and it seemed worth recognizing DOLLY (along with all her algorithmic stream and process workers, since it was May Day) by declaring it to be her official one year birthday.  And since few of us can carry a tune (even with handles) we thought we'd let Satchmo serenade DOLLY.



We've posted some of our work based on DOLLY here including an analysis of tweets after the Boston bombingPremier League fandom in the UK, Flooding in the UKThanksgiving tweetsearthquakes in Kentucky and racist tweets after the 2012 election.

Now that the Spring semester is winding down we will be stepping up our work and posts here.  We have a couple of really great posts that will be appearing over the next week or so.

We see DOLLY as both a key tool for our own work but also as a means to break down the technological barrier that is often present for researchers that would like to study big data but do not necessarily possess the required technical skills.  So stay tuned.