1. UNIX User Data: This file contains 9 sets of sanitized user data drawn from the command histories of 8 UNIX computer users at Purdue over the course of up to 2 years.
2. Twenty Newsgroups: This data set consists of 20000 messages taken from 20 newsgroups.
3. OpinRank Review Dataset: This data set contains user reviews of cars and and hotels collected from Tripadvisor (~259,000
reviews) and Edmunds (~42,230 reviews).
4. Opinosis Opinion ⁄ Review: This dataset contains sentences extracted from user reviews on a given topic. Example topics are “performance of Toyota Camry” and “sound quality of ipod nano”.
5. NSF Research Award Abstracts 1990-2003: This data set consists of (a) 129,000 abstracts describing NSF awards for basic research, (b) bag-of-word data files extracted from the abstracts, (c) a list of words used for indexing the bag-of-word