This download contains sets of 10, 20, 50, 100, 200, and 500 representative phrases from the Enron corpus. The phrases contain four words. The original Enron data source comes from a data set collected and prepared by the CALO (A Cognitive Assistant that Learns and Organizes) Project. It contains data from about 150 users, mostly Enron senior management, organized into folders. The corpus contains about a half-million messages. This data originally was made public and posted to the web by the Federal Energy Regulatory Commission during its investigation of Enron. The data set does not include attachments, and some messages have been deleted “as part of a redaction effort due to requests from affected employees.” To make the stimuli more representative of “general email,” as opposed to emails common in an Enron setting, we filtered the data to remove all email addresses and phone numbers. Furthermore, because a large portion of the emails contained replies quoting the original message, we removed duplicate sentences. This might have inadvertently removed duplicate sentences that were not quotations.
Note By installing, copying, or otherwise using this software, you agree to be bound by the terms of its license. Read the license.