1. Northix: Northix is designed to be a schema matching benchmark problem for data integration of two entity relationship databases. 2. Russian Corpus of Biographical Texts: Sentence classification (Russian). The corpus contains Wikipedia texts splitted into sentences/ Each sentence has a topic label. 3. Labeled Text Forum Threads Dataset: The dataset is a collection of text forum threads with class labels reflects the reply quality to the Initial-Post, 3 for complete relevant, 2 for partially relevant, and 1 for irrelevant 4. Reuters Transcribed Subset: This dataset is created by reading out 200 files from the 10 largest Reuters
classes and using an Automatic Speech Recognition system to create
corresponding transcriptions. 5. Syskill and Webert Web Page Ratings: This database contains HTML source of web pages plus the ratings of a single user on these web pages. Web pages are on four seperate subjects (Bands- recording artists; Goats; Sheep; and BioMedical) 6. Paper Reviews: This sentiment analysis data set contains scientific paper reviews from an international conference on computing and informatics. The task is to predict the orientation or the evaluation of a review. 7. Dresses_Attribute_Sales: This dataset contain Attributes of dresses and their recommendations according to their sales.Sales are monitor on the basis of alternate days. 8. Turkish Spam V01: The TurkishSpam data set contains spam and normal emails written in Turkish. |