In this walkthrough, we will be working with Amazon review data for fine food products. First, we are going to ask the question, “what are the moods associated with positive and negative reviews?” Then, we will go a little deeper into the data and see how the mood distributions differ based on the gender of the reviewer, and also suggest other explorations.
Through this example, we will introduce the basic concepts and commands of a DGT script. We’ll show how to load data, extract fields and derived features from social media; and project and aggregate the results.
Getting the Discussion Graph Tool
Step 1. Download the Discussion Graph Tool (DGT) from http://research.microsoft.com/dgt/
If you haven’t already, download and install the discussion graph tool from http://research.microsoft.com/dgt/ . The rest of this walkthrough will assume that you have installed the tool and added it to your executable path.
To double-check the installation, open a new command-line window and type the command “dgt --help". You should see the following output:
Discussion Graph Tool Version 0.5
More info: http://research.microsoft.com/dgt/
Usage: dgt.exe filename.dgt [options]
--target=local|... Specify target execution environment.
--config=filename.xml Specify non-default configuration file
Step 2. Create a new directory for this walkthrough. Here, we'll use the directory E:\dgt-sample\
Getting the Data
Before we start to write our first script, let’s get some data to analyze. We’ll be using Amazon review data collected by McAuley and Leskovec. This dataset includes over 500M reviews of 74k food-related products. Each review record includes a product id, user id, user name, review score, helpfulness rating, timestamp and both review and summary text. The user names are often real names, and review scores are integers on a scale from 1 to 5
> cd e:\dgt-sample\
Volume in drive E is DISK
Volume Serial Number is AAAA-AAAA
Directory of E:\dgt-sample\
06/10/2014 11:17 AM <DIR> .
06/10/2014 11:17 AM <DIR> ..
06/10/2014 11:16 AM 122,104,202 finefoods.txt.gz
1 File(s) 122,104,202 bytes
2 Dir(s) 45,007,622,144 bytes free
Writing the Script
There are 4 basic commands we will use in our script: LOAD for loading data; EXTRACT for extracting features from the raw data; PROJECT for projecting specific relationships and context from the raw data; and OUTPUT for saving the result to a file. Let’s take things step-by-step.
Step 4. Create a new file mood-reviews.dgt Use notepad.exe, emacs, vi or your favorite text editor.
Step 5. LOAD the data.
The first command in the script is going to be to load the data file. The reviews we downloaded are in a multi-line record format, where each line in the file represents a key-value field of a record; and records are separated by blank lines. The LOAD MultiLine() command will parse this data file. Add the following line as the first command in the script file:LOAD Multiline(path:"finefoods.txt.gz",ignoreErrors:"true");
Since the multi-line format naturally embeds the schema within the data file, we don’t have to specify it in the LOAD command. There are some spurious newlines in the finefoods.txt.gz data, so we do we need to set the ignoreErrors flag to true. This will tell DGT to ignore data that is misformatted.
Step 6. EXTRACT higher-level features from the raw data
Add the following line as the second command in the script file:EXTRACT AffectDetector(field:"review_text"),Gender(field:"review_profileName"),review_score;
This EXTRACT statement generates 3 higher-level features:
- The AffectDetector() call infers the affect, or mood, of a text. The field argument tells it which of the raw fields to analyze. We’ll choose the long review field but could just as easily have selected the summary field. If you don’t pass a field argument, then the AffectDetector() extractor will by default look for a field named “text” in the raw data.
- The Gender() call infers the gender of the author, based on the author’s first name. The field argument tells it which field includes the author’s name. If you don’t pass a field argument, then the Gender() extractor will by default look for a field named “username” in the raw data.
- By naming the reviewscore field---without parentheses---we tell the script to pass the reviewscore field through without modification.
Step 7. PROJECT the data to focus on the relationships of importance
Now, we tell the script what we relationships we care about. Often, we’ll be using DGT to extract a graph of co-occurrence relations from a set of data. In this first example, we’re going to ask for a simpler result set, essentially using DGT as a simple aggregator or “group by” style function. Add the following line to the script:PROJECT TO review_score;
By projecting to “review_score”, we are telling DGT to build a co-occurrence graph among review scores. By default DGT assumes the co-occurrence relationships are defined by the co-occurrence of values within the same record. Since in this dataset every record has at most one review score, that means that there are no co-occurrence relationships. The resulting graph is then simply the degenerate graph of 5 nodes (1 for each score from 1 to 5). For each of these nodes, DGT aggregates the affect and gender information that we extracted.
Step 8. OUTPUT the results to disk
Finally, we add the following command to the script to save the results:OUTPUT TO "finefoods_reviewscore_context.graph";
If you haven't already, now would be a good time to save your script file... The whole script should look like this:LOAD Multiline(path:"finefoods.txt.gz",ignoreErrors:"true");EXTRACT AffectDetector(field:"review_text"),Gender(field:"review_profileName"),review_score;PROJECT TO review_score;OUTPUT TO "finefoods_reviewscore_context.graph";
Run the Script
Step 9. From the command line, run DGT against the script mood-reviews.dgt:
The output file "finefoods_reviewscore_context.graph" should now be in the e:\dgt-sample\ directory. Each row of the output file represents a reviewscore, since that is what we projected to in our script. Columns are tab-separated and the first column of each row is the name of the edge (or nodes) in the graph; The second column is the count of records seen with the given review score; and the third column is a JSON formatted bag of data distributions for gender and affect observations.
To import this data into R, Excel or other tools, we have included a command-line utility dgt2tsv.exe that can pull out specific values. Use the following command to build a TSV file that summarizes the gender and mood for each review score:
Here’s a quick graph of the results about how mood varies across review scores.
We see that joviality increases and sadness decreases with higher review scores. We see that there is more hostility in lower review scores and more serenity in higher review scores. While most moods are monotonically increasing or decreasing with review score, we see that guilt peaks in 2- and 3-star reviews.
The design goal of DGT is to make it easy to explore the relationships embedded in social media data and capture the context of the discussions from which the relationships were inferred.
Are the distributions of mood across review scores different for men and women? Conditioning the mood distributions on gender as well as review score gives us this information. We can do this simply by adding the gender field to our PROJECT command, as follows (changes from the original script are bolded):
PROJECT TO review_score, gender;
OUTPUT TO "finefoods_reviewscore_gender_context.graph";
Here's a quick look at the results. Here, I've graphed the joviality (solid line) and sadness (dashed line) for men (orange) and women (green). We see that the general trends hold, though there are some differences that one might continue digging deeper into...
How are products related to each other by reviewer? For example, how many people that wrote a review of "Brand A Popcorn" also wrote about "Brand X chocolate candies"? We can answer this question by defining a co-occurrence relationship based on user id. That is, we'll say that two product ids are related if the same user reviewed both products. Here's how we do that in the script:
EXTRACT product_productId, review_userId;
RELATE BY review_userId;
PLANAR PROJECT TO product_productId AGGREGATE();
OUTPUT TO "finefoods_products_relateby_user.graph";
(We'll learn more about the RELATE BY and PLANAR PROJECT commands in the next walkthroughs.) This will generate a discussion graph that connects pairs of products that were reviewed by the same person. We can convert this into a file readable by the Gephi graph visualization tool using the dgt2gexf command:
The dgt2gexf command mirrors the dgt2tsv command. In this case, we decided to use a filterbycount option to only output edges that have at least 1000 users who have co-reviewed the pair of products. This filter helps keep the visualization relatively manageable.
Here's the resulting product graph, laid out using Gephi's Fructerman Reingold algorithm: Each of the clusters represents a group of products that are frequently co-reviewed food products on Amazon...
What are you going to explore next? Let us know what you do! My twitter handle is @emrek, or you can reach the whole team by emailing us at email@example.com. Thanks! - Emre