Do you enjoy the maps and research posted at the FloatingSheep blog? Interested in discovering how the Internet, geoweb and social media are changing the way we use and understand places? Would you like the opportunity to use the DOLLY project to explore geo-social media?
If so, the Department of Geography at the University of Kentucky is currently accepting applications for graduate study at the Masters and Ph.D. level in the exciting arenas of online mapping, big data and critical social analysis. We're particularly interested in folks who blend experience in the technical/coding side of things with a desire to think carefully through the big socio-spatial theoretical questions that arise in concert with these technologies. To get a better sense of what this program of study might entail, take a closer look at some of the recent academic publications that have emerged from FloatingSheep such as work on augmented reality, user-generated geographies of religion and disaster relief as well as the virtual economy and economic flows. You can also check a full list of my publications
In addition, be sure to examine the work of other University of Kentucky Geography professors, namely Jeremy Crampton and Matthew Wilson, who doing really exciting work in the related areas of critical cartography, online mapping and participatory GIS.
More information on the program and application process is available here. Students admitted to the graduate program receive full tuition waivers and stipends in exchange for working as teaching assistants. Fellowships and other funding are also possible. Applicants should submit their materials by January 15 to ensure a complete review.
If you are interested (or want more information) please email me (zook@uky.edu) directly.
November 29, 2012
November 28, 2012
Digital Data Trails of the UK Floods
What do data scraped from the Internet tell us about a range of social, economic, political, and even environmental processes and practices? As ever more people take to social media to share and communicate, we are seeing that the data shadows of any particular story or event become increasingly well defined.
The ongoing UK floods offer a useful example of some of the links between digital data trails and the phenomena they represent. In the graphics below, we mapped every geocoded tweet between Nov 20 and Nov 27, 2012 that mentioned the word "flood" (or variations like "flooded" or "flooding").
Unlike many maps of online phenomena (relevant XKCD),careful analysis and mapping of Twitter data does NOT simply mirror population densities. Instead concentration of twitter activity (in this case tweets containing the keyword flood) seem to closely reflect the actual locations of floods and flood alerts even when simply look at the total counts. This pattern becomes even clearer when we do normalise the map (the second map is a location quotient where everything greater than 1 indicates that there are more tweets related to flooding than one would expect based on normal Twitter usage in that area), the data even more closely mirror the UK Environment Agency's flooding map.
As we demonstrated with our maps of Hurricane Sandy, it is important to approach these sorts of maps with caution. At least in the information-dense Western world, they are often able to reflect the broad contours of large phenomena. But, because we are still necessarily measuring subsets of subsets, our big data shadows start to become quite small and unrepresentative at more local levels. This is particularly an issue when the use of the relevant technology is unevenly distributed across demographic sectors such as was the case in post-Katrina New Orleans.
Nonetheless, with every new large event, movement, and phenomena, we are undoubtedly going to see a much more research into both the potentials and limitations of mapping and measuring digital data shadows. This is because physical phenomena like hurricanes and floods don't just leave physical trails, but create digital ones as well.
November 23, 2012
Sheepallenge Deadline in One Week!
For those of you taking part in our Sheepallenge competition, we have over 40 teams and people signed up for this challenge and are looking forward to the variety of submissions. A few quick reminders and updates:
1. Your final visualizations need to be submitted to Monica (monica.stephens@humboldt.edu) by midnight EST on November 30 (one week from today) to be forwarded on to the judges for consideration.
2. We ask that those of you using Sheepallenge as a class project censure the submissions from your students to the ones you think are award worthy.
3. Seriously, no bribing the judges with chocolate.
We've heard rumors of exciting visualizations utilizing this data (from sinful surfaces to glutinous glory) and will post the best results in the coming weeks.
November 22, 2012
Do People Tweet of Mashed Turnips?, and other Thanksgiving Day Mysteries
While trying to avoid the hard work of stuffing the turkey or the pain of listening relatives who want to rehash the election, we decided to take a look at Thanksgiving-related geocoded tweets across the United States. We're not doing a lot of interpretation of these, but hopefully the maps do a decent job speaking for themselves, though it is important to note that all maps show raw counts without any kind of normalization.
Since turkey tweets are everywhere, we thought it might be fun to take a closer look at some of the more off-beat or regionally-specific Thanksgiving traditions using some new tools being developed to extend the capabilities of the DOLLY project. Some rather off the cuff observations:
HO! HO! HO! Oh, wait... that's something different, isn't it?
Since turkey tweets are everywhere, we thought it might be fun to take a closer look at some of the more off-beat or regionally-specific Thanksgiving traditions using some new tools being developed to extend the capabilities of the DOLLY project. Some rather off the cuff observations:
- Grits, okra and hot dish have strongest regional tweet clusters in the south and upper midwest, respectively.
- Very few people are tweeting about mashed turnips (who knew?), but those who are, are doing it in the areas around New York City.
- Oyster and chestnut stuffing have the strongest concentrations in the Northeast.
- Texas prefers pecan pie relative to apple or pumpkin pie.
- People are still tweeting about turducken.
- NPR listeners really are concentrated in the Northeast (as per the Mama Sternberg Cranberry Relish Twitter index).
Search for Grits
Search for Okra
Search for "hot dish" OR "hotdish"
Search for "mashed turnips"
Search for stuffing
Search for oyster* AND stuffing
Search for chestnut AND stuffing
Search for apple AND pie
Search for pecan AND pie
Search for pumpkin AND pie
Search for turduckling OR turducken
Search for "cranberry" AND "sternberg",
inspired by Mama Sternberg's Cranberry Relish Recipe
HO! HO! HO! Oh, wait... that's something different, isn't it?
November 12, 2012
Mapping the Eastern Kentucky Earthquake
Last week's post on racist tweets in the wake of the US presidential election received much more attention than we ever expected. A number of questions about and critiques of our method were raised, which we attempted to respond to in a special FAQ with the post (first time we had to do that). Nonetheless, we thought it might be useful to demonstrate the utility of our technique on a less controversial subject in order to demonstrate how we can leverage a relatively small number of geocoded tweets in order to understand particular offline phenomena, and maybe even assuage some concerns about such an approach.
The 4.3 magnitude earthquake that occurred on Saturday, November 10th around 12:08pm EST, about eight miles west of Whitesburg, Kentucky, provides just such an example. Given our own connections to Kentucky, and the significant number of our own friends and family who tweeted or updated their statuses about the earthquake, we were naturally interested in what we might be able to bring to such an analysis.
But before showing our own results, it is useful to note that the US Geological Survey also collects user-generated data on earthquakes through their "Did You Feel It?" reporting system in which individuals contribute their location and experience with quake. The USGS then aggregates these reports into a crowd sourced map like the one below in order to visualize an approximation of how the earthquake was experienced in different locations.
Rather than use such a direct system of user-generated data collection, we fired up DOLLY in order to gather geocoded tweets referencing the earthquake in its immediate aftermath. We were able to collect 795 geotagged tweets referencing "earthquake" from 12:08pm -- where the first tweet we uncovered near Hyden in Leslie County, KY simply said "EARTHQUAKE HOLY SHAT" -- until around 4:05pm in an area comprising most of central and eastern Kentucky, southern Ohio, West Virginia, southwest Virginia, western North Carolina and east Tennessee (we limited our query based on a bounding box drawn around the epicenter of the quake).
This area includes several cities such as Louisville and Lexington in Kentucky and Knoxville, TN, as well as many more rural areas. As much of our earlier work has clearly shown, population centers typically possess a greater level of online activity simply by virtue of population size, so it was important to look beyond just the raw numbers of earthquake-related tweeting. Therefore, in order to normalize the data, we also collected a 1% sample of all geotagged tweets from the month of October within in the same area. This totaled 30,699 tweets, which we used to normalize the tweets about the earthquake and construct a location quotient measurement in exactly the same way as with the racist tweet analysis [1]. We again aggregated from individual tweets to a larger areal unit, in this case, counties.
First and foremost, though we did not use an entirely contiguous area, it is easy to notice that our map roughly conforms with the map of crowdsourced reports from the USGS, generally confirming the relevance of a relatively small set of user-generated data to understanding such an event.
Second, by looking at the blue dots representing each individual tweet, we can see concentrations within the counties containing the largest cities in the specified search area. These include Knox Co., TN (Knoxville), Jefferson Co., KY (Louisville), Fayette Co., KY (Lexington), Madison Co., KY (Richmond), and Cabell Co., WV (Huntington). None of these localities are particularly close to the epicenter of the quake in eastern Kentucky, but are more likely is a product of the higher population in these cities (increasing the likelihood that Twitter users would feel the quake and take to Twitter to report it), as well as their importance as regional centers with close social and economic connections to eastern Kentucky.
Third, and interestingly enough, there were only six counties where there were more earthquake tweets than there were tweets within the given 1% sample from October [2]. Leading this group of counties is Letcher County, where the earthquake epicenter was located. Letcher County also has a location quotient of nearly 100, indicating the fact that the earthquake generated a much greater than average number of tweets in Letcher County than one would expect on average. Each of the other counties, though possessing many fewer tweets both in the earthquake and reference datasets, are also located in close proximity to Letcher County and the epicenter of the earthquake. These include Bath Co., KY, Leslie Co., KY, Polk Co., TN, Johnson Co., TN and Rockingham Co., VA.
We can also look at patterns of tweets without aggregating to an administrative unit. In this case, we estimate the intensity of the earthquake tweet pattern (again normalized for what would be expected based on a random sample of tweets) in the region using Gaussian kernel smoothing. Interestingly, the 'epicenter' of earthquake tweets is only 6.7 miles away from the real epicenter of the earthquake (indicated by the red star). Not coincidentally, the center of intensity of our tweet map is located in the nearby town of Hazard, KY, which has a higher population density (resulting in more twitter users) than the more rural town of Whitesburg, the epicenter as measured by the USGS.
Ultimately, these results are not necessarily surprising, as they indicate both the extremely localized nature of a phenomenon like reporting an earthquake as evidenced by the greater location quotient values nearer the epicenter, as well as the essentially networked nature of such phenomena mediated by the internet in the clustering of user-generated internet content in cities quite distant from the earthquake's origin.
From a methodological standpoint, it shows that the fairly simple technique of calculating location quotients, or even the more involved technique of Gaussian kernel smoothing, can provide powerful ways of uncovering the spatial dimensions of online reflections of essentially offline phenomena.
We hope that this example -- which uses about the same number of tweets (particularly relative to the number of administrative units) as our racist tweets map -- will help alleviate some of the methodological concerns raised in our previous post.
---------
[1] The equation used to calculate the location quotient is as follows:
[2] We should note that this doesn't mean that there were more earthquake-related tweets in the given time period on Saturday than total tweets in the entire month of October. Rather, this simply represents an indicator of how many earthquake-related tweets there were relative to the expected amount of content in that place.
The 4.3 magnitude earthquake that occurred on Saturday, November 10th around 12:08pm EST, about eight miles west of Whitesburg, Kentucky, provides just such an example. Given our own connections to Kentucky, and the significant number of our own friends and family who tweeted or updated their statuses about the earthquake, we were naturally interested in what we might be able to bring to such an analysis.
But before showing our own results, it is useful to note that the US Geological Survey also collects user-generated data on earthquakes through their "Did You Feel It?" reporting system in which individuals contribute their location and experience with quake. The USGS then aggregates these reports into a crowd sourced map like the one below in order to visualize an approximation of how the earthquake was experienced in different locations.
Rather than use such a direct system of user-generated data collection, we fired up DOLLY in order to gather geocoded tweets referencing the earthquake in its immediate aftermath. We were able to collect 795 geotagged tweets referencing "earthquake" from 12:08pm -- where the first tweet we uncovered near Hyden in Leslie County, KY simply said "EARTHQUAKE HOLY SHAT" -- until around 4:05pm in an area comprising most of central and eastern Kentucky, southern Ohio, West Virginia, southwest Virginia, western North Carolina and east Tennessee (we limited our query based on a bounding box drawn around the epicenter of the quake).
This area includes several cities such as Louisville and Lexington in Kentucky and Knoxville, TN, as well as many more rural areas. As much of our earlier work has clearly shown, population centers typically possess a greater level of online activity simply by virtue of population size, so it was important to look beyond just the raw numbers of earthquake-related tweeting. Therefore, in order to normalize the data, we also collected a 1% sample of all geotagged tweets from the month of October within in the same area. This totaled 30,699 tweets, which we used to normalize the tweets about the earthquake and construct a location quotient measurement in exactly the same way as with the racist tweet analysis [1]. We again aggregated from individual tweets to a larger areal unit, in this case, counties.
First and foremost, though we did not use an entirely contiguous area, it is easy to notice that our map roughly conforms with the map of crowdsourced reports from the USGS, generally confirming the relevance of a relatively small set of user-generated data to understanding such an event.
Second, by looking at the blue dots representing each individual tweet, we can see concentrations within the counties containing the largest cities in the specified search area. These include Knox Co., TN (Knoxville), Jefferson Co., KY (Louisville), Fayette Co., KY (Lexington), Madison Co., KY (Richmond), and Cabell Co., WV (Huntington). None of these localities are particularly close to the epicenter of the quake in eastern Kentucky, but are more likely is a product of the higher population in these cities (increasing the likelihood that Twitter users would feel the quake and take to Twitter to report it), as well as their importance as regional centers with close social and economic connections to eastern Kentucky.
Third, and interestingly enough, there were only six counties where there were more earthquake tweets than there were tweets within the given 1% sample from October [2]. Leading this group of counties is Letcher County, where the earthquake epicenter was located. Letcher County also has a location quotient of nearly 100, indicating the fact that the earthquake generated a much greater than average number of tweets in Letcher County than one would expect on average. Each of the other counties, though possessing many fewer tweets both in the earthquake and reference datasets, are also located in close proximity to Letcher County and the epicenter of the earthquake. These include Bath Co., KY, Leslie Co., KY, Polk Co., TN, Johnson Co., TN and Rockingham Co., VA.
We can also look at patterns of tweets without aggregating to an administrative unit. In this case, we estimate the intensity of the earthquake tweet pattern (again normalized for what would be expected based on a random sample of tweets) in the region using Gaussian kernel smoothing. Interestingly, the 'epicenter' of earthquake tweets is only 6.7 miles away from the real epicenter of the earthquake (indicated by the red star). Not coincidentally, the center of intensity of our tweet map is located in the nearby town of Hazard, KY, which has a higher population density (resulting in more twitter users) than the more rural town of Whitesburg, the epicenter as measured by the USGS.
Ultimately, these results are not necessarily surprising, as they indicate both the extremely localized nature of a phenomenon like reporting an earthquake as evidenced by the greater location quotient values nearer the epicenter, as well as the essentially networked nature of such phenomena mediated by the internet in the clustering of user-generated internet content in cities quite distant from the earthquake's origin.
From a methodological standpoint, it shows that the fairly simple technique of calculating location quotients, or even the more involved technique of Gaussian kernel smoothing, can provide powerful ways of uncovering the spatial dimensions of online reflections of essentially offline phenomena.
We hope that this example -- which uses about the same number of tweets (particularly relative to the number of administrative units) as our racist tweets map -- will help alleviate some of the methodological concerns raised in our previous post.
---------
[1] The equation used to calculate the location quotient is as follows:
# of tweets referencing "earthquake" per county / total # of tweets referencing "earthquake"
------------------------------------------------
# of reference tweets per county / total # of reference tweets
[2] We should note that this doesn't mean that there were more earthquake-related tweets in the given time period on Saturday than total tweets in the entire month of October. Rather, this simply represents an indicator of how many earthquake-related tweets there were relative to the expected amount of content in that place.
Labels:
dolly,
earthquake,
eastern kentucky,
kentucky,
twitter
November 08, 2012
Mapping Racist Tweets in Response to President Obama's Re-election
Note: for questions about the methodology/approach of this post, see the FAQ (added 16:20 EST 11/9/2012).
Note: as of 11:00 EST 11/10/2012, we have disabled commenting on this post.
Note: at 10:00 am EST 11/12/2012 we posted an analysis using the same methodology as this post to locate the epicenter of earthquake in Eastern Kentucky over the weekend.
During the day after the 2012 presidential election we took note of a spike in hate speech on Twitter referring to President Obama's re-election, as chronicled by Jezebel (thanks to Chris Van Dyke for bringing this our attention). It is a useful reminder that technology reflects the society in which it is based, both the good and the bad. Information space is not divorced from everyday life and racism extends into the geoweb and helps shapes its contours; and in turn, data from the geoweb can be used to reflect the geographies of racist practice back onto the places from which they emerged.
Using DOLLY we collected all the geocoded tweets from the last week (beginning November 1) with racist terms that also reference the election in order to understand how these everyday acts of explicit racism are spatially distributed. Given the nature of these search terms, we've buried the details at the bottom of this post in a footnote [1].
Given our interest in the geography of information we wanted to see how this type of hate speech overlaid on physical space. To do this we aggregated the 395 hate tweets to the state level and then normalized them by comparing them to the total number of geocoded tweets coming out of that state in the same time period [2]. We used a location quotient inspired measure (LQ) that indicates each state's share of election hate speech tweet relative to its total number of tweets.[3] A score of 1.0 indicates that a state has relatively the same number of hate speech tweets as its total number of tweets. Scores above 1.0 indicate that hate speech is more prevalent than all tweets, suggesting that the state's "twitterspace" contains more racists post-election tweets than the norm.
So, are these tweets relatively evenly distributed? Or do some states have higher specializations in racist tweets? The answer is shown in the map below (also available here in an interactive version) in which the location of individual tweets (indicated by red dots)[4] are overlaid on color coded states. Yellow shading indicates states that have a relatively lower amount of post-election hate tweets (compared to their overall tweeting patterns) and all states shaded in green have a higher amount. The darker the green color the higher the location quotient measure for hate tweets.
A couple of findings from this analysis
But lest anyone elsewhere become too complacent, the unfortunate fact is that most states are not immune from this kind of activity. Racist behavior, particularly directed at African Americans in the U.S., is all too easy to find both offline and in information space.
--------------------- State Level Data ---------------------
The table below outlines the values for the location quotients for post-election hate tweets.
Note 1: no racist tweets, SMALL number of total geocoded tweets
[3] The formula for this location quotient is
Note: as of 11:00 EST 11/10/2012, we have disabled commenting on this post.
Note: at 10:00 am EST 11/12/2012 we posted an analysis using the same methodology as this post to locate the epicenter of earthquake in Eastern Kentucky over the weekend.
During the day after the 2012 presidential election we took note of a spike in hate speech on Twitter referring to President Obama's re-election, as chronicled by Jezebel (thanks to Chris Van Dyke for bringing this our attention). It is a useful reminder that technology reflects the society in which it is based, both the good and the bad. Information space is not divorced from everyday life and racism extends into the geoweb and helps shapes its contours; and in turn, data from the geoweb can be used to reflect the geographies of racist practice back onto the places from which they emerged.
Using DOLLY we collected all the geocoded tweets from the last week (beginning November 1) with racist terms that also reference the election in order to understand how these everyday acts of explicit racism are spatially distributed. Given the nature of these search terms, we've buried the details at the bottom of this post in a footnote [1].
Given our interest in the geography of information we wanted to see how this type of hate speech overlaid on physical space. To do this we aggregated the 395 hate tweets to the state level and then normalized them by comparing them to the total number of geocoded tweets coming out of that state in the same time period [2]. We used a location quotient inspired measure (LQ) that indicates each state's share of election hate speech tweet relative to its total number of tweets.[3] A score of 1.0 indicates that a state has relatively the same number of hate speech tweets as its total number of tweets. Scores above 1.0 indicate that hate speech is more prevalent than all tweets, suggesting that the state's "twitterspace" contains more racists post-election tweets than the norm.
So, are these tweets relatively evenly distributed? Or do some states have higher specializations in racist tweets? The answer is shown in the map below (also available here in an interactive version) in which the location of individual tweets (indicated by red dots)[4] are overlaid on color coded states. Yellow shading indicates states that have a relatively lower amount of post-election hate tweets (compared to their overall tweeting patterns) and all states shaded in green have a higher amount. The darker the green color the higher the location quotient measure for hate tweets.
Click here to access an interactive version of the map at GeoCommons
- Mississippi and Alabama have the highest LQ measures with scores of 7.4 and 8.1, respectively.
- Other southern states (Georgia, Louisiana, Tennessee) surrounding these two core states also have very high LQ scores and form a fairly distinctive cluster in the southeast.
- The prevalence of post-election racist tweets is not strictly a southern phenomenon as North Dakota (3.5), Utah (3.5) and Missouri (3) have very high LQs. Other states such as West Virginia, Oregon and Minnesota don't score as high but have a relatively higher number of hate tweets than their overall twitter usage would suggest.
- The Northeast and West coast (with the exception of Oregon) have a relatively lower number of hate tweets.
- States shaded in grey had no geocoded hate tweets within our database. Many of these states (Montana, Idaho, Wyoming and South Dakota) have relatively low levels of Twitter use as well. Rhode Island has much higher numbers of geocoded tweets but had no hate tweets that we could identify.
But lest anyone elsewhere become too complacent, the unfortunate fact is that most states are not immune from this kind of activity. Racist behavior, particularly directed at African Americans in the U.S., is all too easy to find both offline and in information space.
--------------------- State Level Data ---------------------
The table below outlines the values for the location quotients for post-election hate tweets.
State | LQ of Racist Tweets | Notes |
Alabama | 8.1 | |
Mississippi | 7.4 | |
Georgia | 3.6 | |
North Dakota | 3.5 | |
Utah | 3.5 | |
Louisiana | 3.3 | |
Tennessee | 3.1 | |
Missouri | 3.0 | |
West Virginia | 2.8 | |
Minnesota | 2.7 | |
Kansas | 2.4 | |
Kentucky | 1.9 | |
Arkansas | 1.9 | |
Wisconsin | 1.9 | |
Colorado | 1.9 | |
New Mexico | 1.6 | |
Maryland | 1.6 | |
Illinois | 1.5 | |
North Carolina | 1.5 | |
Virginia | 1.5 | |
Oregon | 1.5 | |
District of Columbia | 1.5 | |
Ohio | 1.4 | |
South Carolina | 1.4 | |
Texas | 1.3 | |
Florida | 1.3 | |
Delaware | 1.3 | |
Nebraska | 1.1 | |
Washington | 1.0 | |
Maine | 0.9 | |
New Hampshire | 0.8 | |
Pennsylvania | 0.7 | |
Michigan | 0.6 | |
Massachusetts | 0.5 | |
New Jersey | 0.5 | |
California | 0.5 | |
Oklahoma | 0.5 | |
Connecticut | 0.5 | |
Nevada | 0.5 | |
Iowa | 0.4 | |
Indiana | 0.3 | |
New York | 0.3 | |
Arizona | 0.2 | |
Alaska | - | see note 1 |
Idaho | - | see note 1 |
South Dakota | - | see note 1 |
Wyoming | - | see note 1 |
Montana | - | see note 1 |
Hawaii | - | see note 1 |
Vermont | - | see note 1 |
Rhode Island | - | see note 2 |
Note 1: no racist tweets, SMALL number of total geocoded tweets
Note 2: no racist tweets, LARGE number of total geocoded tweets
-----------------
[1] Using the examples of tweets chronicled by Jezebel blog post we collected tweets that contained the text "monkey" or "nigger" AND also contain the text "Obama" OR "reelected" OR "won". A quick, and very unsettling, examination of the search results revealed that this indeed was a good match for our target of election-related hate speech. We end up with a total of 395 of some of the nastiest tweets you might possibly imagine. And given that we're talking about the Internet, that is really saying something.
[2] To be precise, we took a 0.05% sample of all geocoded tweets in November 2012 aggregated to the state level.
[3] The formula for this location quotient is
(# of Hate Tweets in State / # of Hate Tweets in USA)
------------------------------------------------------------
(# of ALL Tweets in State / # of ALL Tweets in USA)
[4] We should also note that the precision of the individual tweet locations is variable. Often the specific location shown in a map is the centroid of an area that is several tens or hundreds of meters across so while the tweet came from nearby the point location shown it did not necessarily come from that exact spot on the map.
Labels:
2012,
election,
Obama,
presidential election,
racism
FAQ: Mapping Racist Tweets in Response to President Obama's Re-election
Note: This FAQ was posted at 4:20 EST on 11/9/12
What about the sample size? 395 doesn’t seem like that many?
The 395 tweets mentioned are the number of geocoded tweets referencing the given keywords from November 1 until November 7 at approximately 4:00 pm EST. This is NOT a sample, but the total population of geocoded tweets that matched our search criteria as outlined in the post. Geocoded tweets make up a tiny fraction of overall Twitter activity (could be as large as 5% or as small as less than 1%), so the actual number of tweets referencing these keywords is likely much, much larger, though we are not sure as to this number.
That said, we don't know what the geographical distribution of non-geocoded tweets is. However, given that many geocoded tweets are the product of GPS-enabled smart phones, it is likely that geocoded tweets tend to come from wealthier locations. All things being equal, this means that the geocoded data likely underrepresents relatively poorer and more rural locations. Should this actually be the case, the location quotients for Mississippi and Alabama would actually be even higher than our initial study showed, but the exact nature of this phenomena is unknown.
Note: People concerned about our methodology should also check out our post on 11/12/2012 using geocoded tweets to located the epicenter of an earthquake in Kentucky. (this paragraph added at 10:20 am EST 11/12/2012)
Why didn’t you map references to hateful comments towards Mitt Romney?
First, the motivation for this posting was the observations posted on the Jezebel blog linked in our original post, noting the uptick in racist tweets following President Obama’s re-election. Second, we focus on racist language directed at President Obama because racism directed at black Americans is not only historically more significant, but because it also highlights the persistence of explicitly racist attitudes in what some have (fallaciously) termed ‘post-racial America’. Third, we did check for both the number of tweets referencing Mitt Romney containing some racially charged terms, as well as the number of derogatory comments about white people. Depending on the terminology used, the results show that there are 7-15x the amount hateful tweets direct towards President Obama than Mitt Romney.
Finally, if this is your first response to our map, and not “that’s really f---ed up!”, then we probably have more important issues to deal with than the minutiae of our methodology. Though we endorse neither hatred, discrimination or violence against anyone, we refuse to acknowledge the equivalence of the terms being used to describe President Obama and Mitt Romney.
Did you remove uses of the “N word” that were positive?
No. We didn’t filter the tweets used in this database, however a quick look at the data reveals that most are derogatory in nature. By leaving the data as is, we are more easily able to compare the number of references to, say, the kinds of comments about Mitt Romney people are clamoring for us to map, without inserting ourselves into an undoubtedly subjective filtering process. Regardless, even if we were to filter tweets, it very well might not change the overall spatial distribution, e.g., a filtered tweet could be from California or Alabama, leaving the map looking essentially the same as it currently does.
A further point is that the term ‘n----r’ is almost universally associated with negative, derogatory intent, as opposed to the more colloquialized (and appropriated by the black community) ‘n---a’, which a quick inspection of the data shows is used more positively. References to ‘n---a’ were not included in the study.
What about multiple tweets by the same individual?
Like our decision not to filter tweets based on their context, nor did we filter based on multiple tweets by the same individual. However, a quick look at the map indicates that tweeting activity is not entirely concentrated at any individual point, meaning that barring the remote possibility of a hyper-mobile tweeter fixated on racist slurs or a racist twitter bot, this is not issue enough to undermine our findings.
Moreover, when we returned to the data and looked at users rather than tweets, very little changes in the location quotients, with Alabama’s being even higher. We thus see this as being a moot point.
Are you saying I’m racist because I didn’t vote for Obama? Are you saying that everyone in a state that had more racist tweets is racist?
No and no. Nor do we imply such a thing anywhere in our original posts or our reactions to comments. However, we believe that the concentration of racist tweets in the South is indicative of the persistence of racism in the South, which is correlated with, though not necessarily causally-related to, statewide voting for Mitt Romney. Just because you live in Mississippi or Alabama does not make you a terrible person. If, however, you use the “N word” to degrade an individual or group of people, as the tweets we are talking about here do, it’s a different story altogether.
What else do you have to say for yourself?
This map and blog post have received more attention than we could have imagined, most of it positive and thought-provoking. Though racism undoubtedly remains a touchy subject, and one perhaps not best dealt with by fairly simple maps, we hoped to use this exercise to show the persistence of racism in the US, even with the country’s first black president being re-elected to a second term, and the need to address this head on, rather than counter such explicitly racist language and behavior with claims of ‘reverse racism’ as many of the critics of our map have done.
Of course, our map does not encompass the entirety of racism as it is experienced by black Americans, much less members of other groups who are systemically discriminated against, both in explicit language directed at these individuals and groups, as well as structural forms of racism that continually limit the ability of people to live happy, healthy and comfortable lives. As geographers, we like to think of ourselves as especially attuned to such issues. However, as the focus of this blog is dedicated to studying the world through the lens of the geoweb, we limit ourselves in this forum to analyses like those presented in the original post.
What about the sample size? 395 doesn’t seem like that many?
The 395 tweets mentioned are the number of geocoded tweets referencing the given keywords from November 1 until November 7 at approximately 4:00 pm EST. This is NOT a sample, but the total population of geocoded tweets that matched our search criteria as outlined in the post. Geocoded tweets make up a tiny fraction of overall Twitter activity (could be as large as 5% or as small as less than 1%), so the actual number of tweets referencing these keywords is likely much, much larger, though we are not sure as to this number.
That said, we don't know what the geographical distribution of non-geocoded tweets is. However, given that many geocoded tweets are the product of GPS-enabled smart phones, it is likely that geocoded tweets tend to come from wealthier locations. All things being equal, this means that the geocoded data likely underrepresents relatively poorer and more rural locations. Should this actually be the case, the location quotients for Mississippi and Alabama would actually be even higher than our initial study showed, but the exact nature of this phenomena is unknown.
Note: People concerned about our methodology should also check out our post on 11/12/2012 using geocoded tweets to located the epicenter of an earthquake in Kentucky. (this paragraph added at 10:20 am EST 11/12/2012)
Why didn’t you map references to hateful comments towards Mitt Romney?
First, the motivation for this posting was the observations posted on the Jezebel blog linked in our original post, noting the uptick in racist tweets following President Obama’s re-election. Second, we focus on racist language directed at President Obama because racism directed at black Americans is not only historically more significant, but because it also highlights the persistence of explicitly racist attitudes in what some have (fallaciously) termed ‘post-racial America’. Third, we did check for both the number of tweets referencing Mitt Romney containing some racially charged terms, as well as the number of derogatory comments about white people. Depending on the terminology used, the results show that there are 7-15x the amount hateful tweets direct towards President Obama than Mitt Romney.
Finally, if this is your first response to our map, and not “that’s really f---ed up!”, then we probably have more important issues to deal with than the minutiae of our methodology. Though we endorse neither hatred, discrimination or violence against anyone, we refuse to acknowledge the equivalence of the terms being used to describe President Obama and Mitt Romney.
Did you remove uses of the “N word” that were positive?
No. We didn’t filter the tweets used in this database, however a quick look at the data reveals that most are derogatory in nature. By leaving the data as is, we are more easily able to compare the number of references to, say, the kinds of comments about Mitt Romney people are clamoring for us to map, without inserting ourselves into an undoubtedly subjective filtering process. Regardless, even if we were to filter tweets, it very well might not change the overall spatial distribution, e.g., a filtered tweet could be from California or Alabama, leaving the map looking essentially the same as it currently does.
A further point is that the term ‘n----r’ is almost universally associated with negative, derogatory intent, as opposed to the more colloquialized (and appropriated by the black community) ‘n---a’, which a quick inspection of the data shows is used more positively. References to ‘n---a’ were not included in the study.
What about multiple tweets by the same individual?
Like our decision not to filter tweets based on their context, nor did we filter based on multiple tweets by the same individual. However, a quick look at the map indicates that tweeting activity is not entirely concentrated at any individual point, meaning that barring the remote possibility of a hyper-mobile tweeter fixated on racist slurs or a racist twitter bot, this is not issue enough to undermine our findings.
Moreover, when we returned to the data and looked at users rather than tweets, very little changes in the location quotients, with Alabama’s being even higher. We thus see this as being a moot point.
Are you saying I’m racist because I didn’t vote for Obama? Are you saying that everyone in a state that had more racist tweets is racist?
No and no. Nor do we imply such a thing anywhere in our original posts or our reactions to comments. However, we believe that the concentration of racist tweets in the South is indicative of the persistence of racism in the South, which is correlated with, though not necessarily causally-related to, statewide voting for Mitt Romney. Just because you live in Mississippi or Alabama does not make you a terrible person. If, however, you use the “N word” to degrade an individual or group of people, as the tweets we are talking about here do, it’s a different story altogether.
What else do you have to say for yourself?
This map and blog post have received more attention than we could have imagined, most of it positive and thought-provoking. Though racism undoubtedly remains a touchy subject, and one perhaps not best dealt with by fairly simple maps, we hoped to use this exercise to show the persistence of racism in the US, even with the country’s first black president being re-elected to a second term, and the need to address this head on, rather than counter such explicitly racist language and behavior with claims of ‘reverse racism’ as many of the critics of our map have done.
Of course, our map does not encompass the entirety of racism as it is experienced by black Americans, much less members of other groups who are systemically discriminated against, both in explicit language directed at these individuals and groups, as well as structural forms of racism that continually limit the ability of people to live happy, healthy and comfortable lives. As geographers, we like to think of ourselves as especially attuned to such issues. However, as the focus of this blog is dedicated to studying the world through the lens of the geoweb, we limit ourselves in this forum to analyses like those presented in the original post.
November 05, 2012
Can Twitter Predict the US Presidential Election?
Can Twitter predict the outcome of tomorrow's US presidential election? If the results of our preliminary analysis are anything to go by, then Barack Obama will be easily re-elected. The data presented below, including all geocoded tweets referencing Obama or Romney between October 1st and November 1st, out of a sample of about 30 million, give some insight into the visibility of each of the candidates on Twitter.
We see that if the election were decided purely based on Twitter mentions, then Obama would be re-elected quite handily. In fact, the only states in the electoral college that Romney would win are Maine, Massachusetts, New Mexico, Oregon, Pennsylvania, Utah, and Vermont. Romney also wins in the District of Colombia, and we unfortunately didn't collect data on Alaska or Hawaii. Some of the results seem to be interesting reflections of social and political characteristics of particular places. It makes sense that Romney has captured more of the public imagination in Utah, likely due to the state's considerable conservatism and large Mormon population, and Massachusetts, the state that he governed not all that long ago.
However, this drubbing that Romney receives in the Twitter electoral college belies the close nature of the final popular (Twitter) vote, re-raising the issue of whether the electoral college is the most suitable means of deciding the country's political future. There are a total of 132,771 tweets mentioning Obama and 120,637 mentioning Romney, giving Obama only 52.4% of the total and Romney 47.6%, a breakdown that is remarkably similar to current opinion polls, though not reflected when looking at the state-level aggregations in absolute terms. If you want to explore the data in more detail, please play around with the interactive map below:
We can also visualize the data using a sliding scale, so as to see how close the margin of victory is for each candidate in a given state.
Romney's largest margins of victory are in Pennsylvania and Massachusetts, while Obama's largest victories are in California and, strangely, Texas. The cases of Massachusetts and Texas, not to mention large portions of the south and plain states, likely point to the fact that many references on Twitter would tend to be negative.
It is also worth noting that we compared Twitter mentions of both Vice-Presidential candidates: Biden and Ryan. Ryan, interestingly, wins the head-to-head competition in every single state. This makes for a rather boring map, so we decided to instead compare references to Ryan and Romney in the map below (Romney shaded in grey for his ebullient personality, and Ryan in pink as a result of his staunch support for gay rights).
As might be expected, there are more references to Romney in most states (Kansas, Michigan, North Dakota, Rhode Island, South Dakota, and Vermont being the exceptions here). However, when looking at total references, we again don't see a large gap between the two men. Ryan has 94,707 tweets compared to Romney's 120,637.
What do these data really tell us? Ultimately, I doubt that they will accurately predict the election, as Obama's seeming victory in Texas or Romney's in Massachusetts will almost certainly not come to pass. But they do certainly reveal that many internet users in California, Texas, and much of the rest of the country for that matter, tend to talk more about Obama than Romney. And, of course, in order to truly equate tweets with votes, we would need to employ sentiment analysis or manually read a large number of the election-related tweets in order to figure out whether we are seeing messages of support or more critical posts, as has been done in a couple of interesting projects by Twitter available here and here and another project by Esri available here.
Maybe the most revealing aspect of these data is that the 'popular vote' is split between the two candidates. While the social and political data shadows that we are picking up may not accurately tell us much about the electoral college results, when aggregated across the country they may be a rough indicator of tomorrow's outcome, pointing to the more-or-less equal and evenly divided nature of the American two-party political system. While this work may seem like a contemporary attempt at soothsaying, something we tend to shy away from, the data more appropriately serve as a useful benchmark in order to allow us to analyze what social media data shadows might actually reflect, as no matter the level of participation, they remain distorted mirrors on the offline material world.
We see that if the election were decided purely based on Twitter mentions, then Obama would be re-elected quite handily. In fact, the only states in the electoral college that Romney would win are Maine, Massachusetts, New Mexico, Oregon, Pennsylvania, Utah, and Vermont. Romney also wins in the District of Colombia, and we unfortunately didn't collect data on Alaska or Hawaii. Some of the results seem to be interesting reflections of social and political characteristics of particular places. It makes sense that Romney has captured more of the public imagination in Utah, likely due to the state's considerable conservatism and large Mormon population, and Massachusetts, the state that he governed not all that long ago.
We can also visualize the data using a sliding scale, so as to see how close the margin of victory is for each candidate in a given state.
Romney's largest margins of victory are in Pennsylvania and Massachusetts, while Obama's largest victories are in California and, strangely, Texas. The cases of Massachusetts and Texas, not to mention large portions of the south and plain states, likely point to the fact that many references on Twitter would tend to be negative.
It is also worth noting that we compared Twitter mentions of both Vice-Presidential candidates: Biden and Ryan. Ryan, interestingly, wins the head-to-head competition in every single state. This makes for a rather boring map, so we decided to instead compare references to Ryan and Romney in the map below (Romney shaded in grey for his ebullient personality, and Ryan in pink as a result of his staunch support for gay rights).
As might be expected, there are more references to Romney in most states (Kansas, Michigan, North Dakota, Rhode Island, South Dakota, and Vermont being the exceptions here). However, when looking at total references, we again don't see a large gap between the two men. Ryan has 94,707 tweets compared to Romney's 120,637.
What do these data really tell us? Ultimately, I doubt that they will accurately predict the election, as Obama's seeming victory in Texas or Romney's in Massachusetts will almost certainly not come to pass. But they do certainly reveal that many internet users in California, Texas, and much of the rest of the country for that matter, tend to talk more about Obama than Romney. And, of course, in order to truly equate tweets with votes, we would need to employ sentiment analysis or manually read a large number of the election-related tweets in order to figure out whether we are seeing messages of support or more critical posts, as has been done in a couple of interesting projects by Twitter available here and here and another project by Esri available here.
Maybe the most revealing aspect of these data is that the 'popular vote' is split between the two candidates. While the social and political data shadows that we are picking up may not accurately tell us much about the electoral college results, when aggregated across the country they may be a rough indicator of tomorrow's outcome, pointing to the more-or-less equal and evenly divided nature of the American two-party political system. While this work may seem like a contemporary attempt at soothsaying, something we tend to shy away from, the data more appropriately serve as a useful benchmark in order to allow us to analyze what social media data shadows might actually reflect, as no matter the level of participation, they remain distorted mirrors on the offline material world.
November 01, 2012
The seven deadly sins: Sheepallenge 2012
Over the past couple of months/weeks we've been having a lot of fun with the Twitter data we've been pulling in through our DOLLY project. We've looked at beer vs. church, binders full of women, and even Big Bird. But why should we have all the fun? Wouldn't you like to be a sheeple too?
So in that vein and despite the frankenstorm on the East Coast which has reduced Taylor to nibbling on dry Ramen as he makes maps, we're pushing forward with our November Sheepallenge. Building upon the idea of IronSheep 2012 (in which teams were given the same datasets and tasked with making "tasty maps") we have provided Sheepallenge participants with a set of Twitter derived data as part of an fantastical, allegorical, mapitorital competition taking place this month. This is going to be so wicked cool!
Those of you who registered as research participants should have received an email with a link to download the data. If you are just reading this now and are thinking "Man, I should of signed up." email Monica and asked to be hooked up (monica.stephens@humboldt.edu). We probably can accommodate more participants but no guarantees. Currently we have 33 visualization groups/classes/people signed up from around the world so we can't wait to see what we end up with.
Whoever creates the most interesting, fun, informative and aesthetically pleasing visualization or data-driven artwork, will receive a prize and will have their visualization posted about here on FloatingSheep.org.
Rules for the Sheepallenge 2012
The rules are as follows:
May the best map win!
Metadata for Sheepallenge 2012
Much more extensive metadata is available with the data but the basics are:
So in that vein and despite the frankenstorm on the East Coast which has reduced Taylor to nibbling on dry Ramen as he makes maps, we're pushing forward with our November Sheepallenge. Building upon the idea of IronSheep 2012 (in which teams were given the same datasets and tasked with making "tasty maps") we have provided Sheepallenge participants with a set of Twitter derived data as part of an fantastical, allegorical, mapitorital competition taking place this month. This is going to be so wicked cool!
After some brain-storming we decided to go with the theme of the Seven Deadly Sins (Envy, Gluttony, Greed, Lust, Pride, Sloth, Wrath) inspired in part by the cool mapping exercise by Mitchel Stimers and others at Kansas State University (here). After all, the Twitter data from which we were pulling reflects the commentary of daily life. What better source for uncovering the sins that lurk within the hearts and microblogging activities of Internet users? So we sat down and came up with range of terms that we thought did a decent job of representing a sin (e.g., the term Big Mac for Gluttony or honor student for Pride) and compiled them into a "sindex" for each each of the seven sins. The sindex can be used as an aggregate measure or divided into its component parts (see meta data below).
(btw, Stimers et al maps are NOT based on tweets but indicators such as crime, income, etc.)
The challenge to you is to make your own map(s) of the 7-deadly sins with our data.
Those of you who registered as research participants should have received an email with a link to download the data. If you are just reading this now and are thinking "Man, I should of signed up." email Monica and asked to be hooked up (monica.stephens@humboldt.edu). We probably can accommodate more participants but no guarantees. Currently we have 33 visualization groups/classes/people signed up from around the world so we can't wait to see what we end up with.
Whoever creates the most interesting, fun, informative and aesthetically pleasing visualization or data-driven artwork, will receive a prize and will have their visualization posted about here on FloatingSheep.org.
Rules for the Sheepallenge 2012
The rules are as follows:
- You can not post the raw data on the internet or redistributed to others. Please contact us before using the data for any other research purpose. Commercial use is prohibited.
- Maps may created in a range of formats from static maps (e.g, choropleth, cartograms or cartoons) to animations or interactive interactive maps. Maps can be submitted as an attached .jpg or .pdf or can be a link to an interactive or animated map.
- Your visualization needs to use at least one of the data files included in the 7-sins data package. Adding additional data from other sources (e.g. census, crime stats) is definitely allowed. You can chose one sin, one aspect of a sin, or all the sins.
- For your visualization to be considered by the judges, you must email it to monica.stephens@humboldt.edu by November 30, 2012. Monica will forward it on to the judges for consideration.
- Include a jpg/pdf of the actual map (or series of maps) or a link to an interactive or animated map.
- Include a Word document that has (a) your name (or group), (b) your contact info (c) the specific seven sin dataset(s) you used and (d) a title/name for the map. Although it is not required, feel free to include a short description/abstract of what you did, especially if you think it is cool/important.
- Multiple entries are allowed but submit each map separately as outlined above.
- Judges should not be bribed with chocolate. (It is OK to bribe your local cartographer/GIS expert for help as long as you credit them in your work).
- The judges will select winners across a range of categories and map types.
- Winners will receive bragging rights, a carefully constructed electronic certificate (suitable for framing) and perhaps some FloatingSheep paraphernalia we have kicking around.
- By submitting a map you give the FloatingSheep.org blog permission to post it under the creative commons attribution-noncommercial-sharealike license we use for all our stuff.
- Goto rule 1.
May the best map win!
Metadata for Sheepallenge 2012
Much more extensive metadata is available with the data but the basics are:
- The database is about 70 MB in size
- Data covers all geotagged tweets made within the United States between June 26 and October 30
- Keywords used (and associated sampling rates when appropriate) is available online
- We have also include a range of other terms that may (or may not) fit in well with a particular sin (e.g., does Justin Bieber represent Lust? Or Pride? Or Envy?). Some of them (such as a random selection of tweets) will be useful for standardizing purposes.
Subscribe to:
Posts (Atom)