The enron corpus

March 5, 2011 at 12:16 pm (nlp)

One of the issues that people who do nlp and machine learning struggle with is getting access to large quantities of data relevant to the task they are doing. (The other issue is getting that data classified so you can use it for training). If you are interested in doing nlp, machine learning, or sentiment analysis on emails, you should know about the Enron corpus.

As part of the Enron investigation, the federal government has over 5 million emails from enron.  This database was purchased for $10,000 by Andrew McCallum, a computer scientist at the University of Massachusetts, Amherst and made freely available.  There are several versions available, you can get the raw data or as a normalized mysql database.  A small portion of it has even been annotated.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: