DGT Walkthrough #2: Hashtags in Twitter

In this walkthrough, we will be working with public stream data from Twitter. First, we are going to ask the question, “what are the moods associated with positive and negative reviews?” Then, we will go a little deeper into the data and see how the mood distributions differ based on the gender of the reviewer, and also suggest other explorations.

Through this example, we will introduce the basic concepts and commands of a DGT script. We’ll show how to load data, extract fields and derived features from social media; and project and aggregate the results.

Getting the Discussion Graph Tool

Step 1. Download the Discussion Graph Tool (DGT) from http://research.microsoft.com/dgt/

If you haven’t already, download and install the discussion graph tool (Detailed installation instructions.) The rest of this walkthrough will assume that you have installed the tool and added it to your executable path.

To double-check the installation, open a new command-line window and type the command “dgt --help". You should see the following output:

>dgt --help
Discussion Graph Tool Version 1.0
More info: http://research.microsoft.com/dgt/
Contact: discussiongraph@microsoft.com

Usage: dgt.exe filename.dgt [options]
    --target=local|... Specify target execution environment.
    --config=filename.xml Specify non-default configuration file

Step 2. Create a new directory for this walkthrough. Here, we'll use the directory E:\dgt-sample\

>mkdir e:\dgt-sample\


Getting Twitter Data

First, let’s get some data to analyze. We’ll be using Twitter data for this walkthrough.  Twitter doesn't allow redistribution of its data, but does have an API for retrieving a sample stream of tweets.  There are a number of steps you'll have to complete, including registering for API keys and access tokens from Twitter.  We've put up full instructions.

Step 3. Install twitter-tools package.  See our instructions.

Step 4. Download a sample of tweets.  Run the GatherStatusStream.bat for "a while"---press Ctl-C to stop the download.  This will generate a file (or files) called statuses.log.YYYY-MM-DD-HH where YY-MM-DD-HH represent the current date and hour.  The files may be compressed (indicated with a .gz file suffix)

Each of the line in this file represents a tweet (*), in JSON format, that includes all available metadata about the tweet, tweet author, etc.  (* the file also includes some other information, such as tweet deletions.  There's no need to worry about those for this walkthrough.)

> twitter-tools-master\twitter-tools-core\target\appassembler\bin\GatherStatusStream.bat
1000 messages received.
2000 messages received.
3000 messages received.
4000 messages received.
5000 messages received.
6000 messages received.
7000 messages received.
8000 messages received.
9000 messages received.
10000 messages received.
Terminate batch job (Y/N)? Y
> dir statuses*
 Volume in drive C is DISK
 Volume Serial Number is AAAA-AAAA

 Directory of E:\dgt-sample\twitter-tools-core

06/13/2014  12:53 PM        49,665,736 statuses.log.2014-06-13-12
               1 File(s)     49,665,736 bytes
               0 Dir(s)  43,039,879,168 bytes free


Writing the Script

As we saw in walkthrough #1, there are 4 basic commands we will use in our script: LOAD for loading data; EXTRACT for extracting features from the raw data; PROJECT for projecting specific relationships and context from the raw data; and OUTPUT for saving the result to a file. Let’s take things step-by-step.

Step 5. Create a new file twitter-hashtags.dgt Use notepad.exe, emacs, vi or your favorite text editor.

e:\dgt-sample\> notepad twitter-hashtags.dgt


Step 6. LOAD the data.

The first command in the script is going to be to load the data file. The tweets we downloaded are in a JSON-based record format, where each line in the file is a JSON-formatted key-value field of a record; and records are separated by blank lines. The LOAD Twitter() command can parse this file. Add the following line as the first command in the script file:

LOAD Twitter(path:"statuses.log.2014-06-13-12",ignoreErrors:"true");

The Twitter data source already knows about ***the key fields in the Twitter JSON data file*** (ADD LINK), so we don’t have to specify any more information. The twitter-tools adds some non-JSON lines into its output, so we'll also set the ignoreErrors flag to true. This will tell DGT to ignore misformatted lines in the input.

Step 7. EXTRACT higher-level features from the raw data

Add the following line as the second command in the script file:

EXTRACT AffectDetector(), Gender(), hashtag;

This EXTRACT statement generates 3 higher-level features:

  • The AffectDetector() call infers the affect, or mood, of a text.  By default, the AffectDetector() looks for a field named "text" in the raw data, though we could set the "field" argument to make it look at other fields instead.
  • The Gender() call infers the gender of the author, based on the author’s first name. By default, the Gender() extractor looks for a field named "username" in the raw data.  Again, we could override this using the "field" argument.
  • By naming the hashtag field---without parentheses---we tell the script to pass the hashtag field through without modification. 


Note: The output of twitter-tools already includes hashtags, user mentions, urls and stock symbols as explicit fields already parsed out of the raw text. We'll see in the further explorations how we can use exact phrase matching and regular expression matching to pull values out of the text ourselves.


Step 8. PROJECT the data to focus on the relationships of importance

Now, we tell the script what we relationships we care about. Here, we want to extract the pair-wise co-occurrence relationships among hashtags.  That is, which hashtags are used together?



By projecting to “hashtag”, we are telling DGT to build a co-occurrence graph among review scores. By default DGT assumes the co-occurrence relationships are defined by the co-occurrence of values within the same record. 

In this exercise, we're choosing to use a PLANAR PROJECT command because we're going to visually display the resulting hashtag graph at the end of this walkthrough, and planar graphs are simply easier to render.  However, it's worth noting that the planar representation is incomplete.  For example, if 3 hashtags always co-occur together that information will be lost because the planar graph cannot represent this information.  A hyper-graph can represent such complex co-occurrences, however.  For this reason, the PROJECT command defaults to a hyper-graph, and we recommend using this representation if you are going to be computing on the result.

Step 9. OUTPUT the results to disk

Finally, we add the following command to the script to save the results:

OUTPUT TO "twitter_hashtags.graph";


If you haven't already, now would be a good time to save your script file... The whole script should look like this:

LOAD Twitter(path:"statuses.log.2014-06-13-12",ignoreErrors:"true");
EXTRACT AffectDetector(), Gender(), hashtag;
OUTPUT TO "twitter_hashtags.graph";


Run the Script

Step 9. From the command line, run DGT against the script twitter-hashtags.dgt:

e:\dgt-sample\> dgt.exe twitter-hashtags.dgt

The output file "twitter_hashtags.graph" should now be in the e:\dgt-sample\ directory. Each row of the output file represents a relationship between a pair of hashatags, since we projected to the planar relationship between co-occurring hashtags in our script. Columns are tab-separated and the first column of each row is the name of the edge in the graph (the edge name is simply the concatenation of the two node names, in this case the two hashtags); The second column is the count of tweets seen with the pair of hashtags; and the third column is a JSON formatted bag of data distributions for gender and affect observations.

To import this data into visualization and analysis tools, we have included two command-line utilities dgt2tsv.exe and dgt2gexf.exe that can extract specific values into a tab-separated values (TSV) file or a Graph Exchange XML Format (GEXF) file.

We'll use the dgt2gexf command and visualize the result with the Gephi graph visualization tool:

e:\dgt-sample\> dgt2gexf.exe twitter_hashtags.graph count twitter_hashtags.gexf


If your twitter sample is large, you might consider adding the option "filtercount=N" (without the quotes) to the command-line.  This will only include edges that were seen at least N times in your sample.  Use an appropriate number, from 10 to 1000 or higher, depending on the size of your sample.

Here's the resulting hashtag graph.  Each of the clusters represents a group of hashtags that are frequently co-mentioned in our tiny sample of Twitter data...


For clarity and fun, we'll filter out low-frequency edges and zoom into one of the clusters of hashtags about world-cup related topics.  We see from the thickness of the edges that #NED and #ESP are the most frequently co-occurring hashtags, and each also co-occurs relatively frequently with #WorldCup.  We also see a number of people piggy-backing on the popular #worldcup hashtag with topically unrelated hashtags (#followers, #followback, #retweet, #followme)  to solicit followers and retweets.


Further Explorations

There are many interesting things to explore in hashtag relationships, such as the evolution of hashtag relationships over time --- for example, use PROJECT TO hashtag,absoluteday; --- hashtag relationships conditioned on gender --- PROJECT TO hashtag,Gender(); --- and inspections of token distributions, moods and other features associated with hashtags and their relationships.

What are you going to explore next? Let us know what you do! My twitter handle is @emrek, or you can reach the whole team by emailing us at discussiongraph@microsoft.com. Thanks! - Emre