The enron corpus

March 5, 2011 at 12:16 pm (nlp)

One of the issues that people who do nlp and machine learning struggle with is getting access to large quantities of data relevant to the task they are doing. (The other issue is getting that data classified so you can use it for training). If you are interested in doing nlp, machine learning, or sentiment analysis on emails, you should know about the Enron corpus.

As part of the Enron investigation, the federal government has over 5 million emails from enron.  This database was purchased for $10,000 by Andrew McCallum, a computer scientist at the University of Massachusetts, Amherst and made freely available.  There are several versions available, you can get the raw data or as a normalized mysql database.  A small portion of it has even been annotated.

