Analyzing Twitter Data on the Cloud

Table of Contents

What Can Twitter Data Tell Us About Hurricane Sandy?

Social Media is an enormous source of information. For example, with more than 350 million Tweets per day, there is a lot of information to be mined. The question is: “where do you start?”

As many of you know, Hurricane Sandy wreaked havoc in Haiti and the East Coast of the United States in the fall of 2012. We decided to collect Tweets with the word “hurricane” into Treasure Data, and asked a few questions.

Prerequisites

  • Basic knowledge of Treasure Data, including the toolbelt.
  • Basic knowledge of td-agent.

Step 1: Installing td-agent

You must first set up td-agent on your application servers. Please refer to the following articles. For Linux systems, we provide deb/rpm packages.

If you have... Please refer to...
Debian / Ubuntu Systems Installing td-agent for Debian and Ubuntu
Redhat / CentOS Systems Installing td-agent for Redhat and CentOS
Untitled-3
td-agent is fully open-sourced under the fluentd project. td-agent extends fluentd with custom plugins for Treasure Data.

Step 2: Upload Tweets

We assume that you have already downloaded the td toolbelt and td-agent.

We will be uploading Tweets to Treasure Data continuously using td-agent. Luckily, there are already a couple of input plugins for td-agent that take care of collecting Tweets for us.

For the purpose of this walkthough, we are using Jeff Yuan’s twitterstream input plugin (There is also fluent-plugin-twitter by y-ken, who has contributed a number of useful plugins). Please put the file into the plugin directory /etc/td-agent/plugin/.

$ sudo /usr/lib/fluent/ruby/bin/fluent-gem install twitter
$ sudo curl -L 'http://bitly.com/TIDLvO' > /etc/td-agent/plugin/in-twitterstream.rb

The twitterstream input plugin requires just five parameters:

  1. consumer_key
  2. consumer_secret
  3. access_token_key
  4. access_token_secret
  5. one of follow/track/locations

The first four parameters are generic authorization parameters for Twitter applications. The fifth parameter decides how you filter the Twitter stream.

Here is our td-agent configuration file.

<source>
  type twitterstream
  consumer_key YOUR_CONSUMER_KEY
  consumer_secret YOUR_CONSUMER_KEY_SECRET
  access_token_key YOUR_ACCESS_TOKEN_KEY
  access_token_secret 76HyTgCH6C2epd4BpAxUfKIj990Et7Vdvs4czpaZ4

  tag twitterstream.hurricane_sandy
  track hurricane
</source>

<match twitterstream.*>
  type tdlog
  endpoint api.treasuredata.com
  apikey YOUR_TD_API_KEY_HERE
  auto_create_table
  buffer_type file
  buffer_path /var/log/td-agent/buffer/td
  use_ssl true
</match>

Note that we are tracking the word “hurricane” here. Now just remember to…

  1. make sure you have the twitterstream input plugin (in_twitterstream.rb) in your plugin path.
  2. make sure you have write access to YOUR_BUFFER_PATH.
  3. start td-agent and upload Tweets.

Step 3: Analyze & Vizualize!

Once td-agent is up and running, you can start querying the data immediately on Treasure Data (Note the 5-minute delay due to td-agent’s buffering interval).

I would now like to show some queries I ran. Please note that all the schema have already been added to streamline queries (Please refer to the Schema article to learn more about schema management on Treasure Data).

Which Tweet was Retweeted the Most?

This is probably the first question you would ask. Which Hurricane-related Tweet got the most number of eyeballs (at least by one metric)? Here is the query to get the top 10.

$ td query -w -d twitterstream \
  "SELECT MAX(retweet_count) AS rt_ct, text
     FROM hurricane_sandy GROUP BY text
     ORDER BY rt_ct DESC LIMIT 10"
+-------+----------------------------------------------------------------------------------------------------------------------------------------------+
| rt_ct | text                                                                                                                                         |
+-------+----------------------------------------------------------------------------------------------------------------------------------------------+
| 61028 | RT @justinbieber: everyone dealing with the hurricane up north be safe                                                                       |
| 24868 | RT @hurricannesandy: WHAT IF GANGAM STYLE WAS ACTUALLY JSST A GIANT RAIN DANCE AND WE BROUGHT THIS HURRICANE ON OURSELVES?                   |
| 21809 | RT @ElectionParody: FOR EVERY 100 RETWEETS, WE WILL BE DONATING $1,000 TO HELP REBUILT COMMUNITIES DAMAGED BY HURRICANE SANDY. PLEASE RE ... |
| 21808 | RT @HurricaneSandyw: FOR EVERY 100 RETWEETS, WE WILL BE DONATING $1,000 TO HELP REBUILT COMMUNITIES DAMAGED BY HURRICANE SANDY. PLEASE R ... |
| 19926 | RT @RepublicanTips: Everyone in the path of the hurricane should head to their second or third home to safety #Sandy #RomneyStormTips        |
| 19924 | RT @ALiberalPundit: Everyone in the path of the hurricane should head to their second or third home to safety #Sandy #RomneyStormTips        |
| 15389 | RT @SandysHurricane: R.I.P to the 65 victims who lost their lives because of Hurricane Sandy. RT for respect &lt;3                           |
| 12549 | RT @Seth_Fried: If your apartment is hit by a dolphin, DO NOT GO OUT TO SEE IF THE DOLPHIN IS OKAY. That's how the hurricane tricks you  ... |
| 11084 | RT @AHurricaneSandy: I WENT TO HIGHSCHOOL WIT IRENE. SHE CAN'T EVEN TWERK. SHE AIN'T BOUT DAT HURRICANE LYFE.                                |
| 10779 | RT @FillWerrell: Hurricane Sandy wouldn't be here if Patrick would've just stopped making fun of Texas.                                      |
+-------+----------------------------------------------------------------------------------------------------------------------------------------------+

I suppose it is no surprise that Justin Bieber’s Tweet has been retweeted the most. What’s curious is how this Gangnam Style meme ranked second. The Internet often associates two seemingly disparate entities in creative ways. This is all very fun until you find your company or brand as one of the two entities: at that point, you want to identify the association quickly and act on it if it negatively positions your company or brand.

Here is another look at the Tweets. The following word cloud was generated based on the top 30 most popular bigrams. Font sizes are proportional to frequencies with some thresholding to make the layout look nicer.

$ td query -w -d twitterstream \
  "SELECT NGRAMS(SENTENCES(LOWER(text)), 2, 100)
     FROM hurricane_sandy"

A few observations here:

  1. Predictably, the most popular bigram was “Hurricane Sandy”.
  2. Note that “http t.co” is up next. Of course, this comes from Twitter’s official URL shortener. What’s interesting is how other URL shorteners such as “http bit.ly” or “http goo.gl” are absent from the top 30.

What Time Are People Tweeting?

Twitter has become an increasingly important marketing tool. One natural question to ask is, “when are people active on Twitter?”

$ td query -w -d twitterstream \
  "SELECT COUNT(*) AS ct, TD_TIME_FORMAT(time, "yyyy-MM-dd hh", "PDT") AS time
     FROM hurricane_sandy
     GROUP BY TD_TIME_FORMAT(time, "yyyy-MM-dd hh", "PDT")
     ORDER BY time"

The number of Tweets by Hour



At first glance, there is nothing too surprising here: a gradual increase toward early afternoon, peaking at 2PM. However, note that Twitter has a global presence and people tweet from many timezones. Yet this dataset suggests that one timezone dwarfs all other timezones. Can you guess which timezone this is? Pacific Standard Time.

In hindsight, this is not too surprising. PST includes both Silicon Valley and Hollywood, two groups with a large number of Twitter power users.

Conclusion

Social media provides valuable datasets, but the challenge is in collecting and analyzing the data quickly. With Treasure Data and its versatile td-agent, you can get up and running with your social media analysis in a few hours, not weeks or months.


Last modified: Aug 03 2015 00:01:48 UTC

If this article is incorrect or outdated, or omits critical information, please let us know. For all other issues, please see our support channels.