1. Reuters Transcribed Subset: This dataset is created by reading out 200 files from the 10 largest Reuters
classes and using an Automatic Speech Recognition system to create
corresponding transcriptions. 2. Syskill and Webert Web Page Ratings: This database contains HTML source of web pages plus the ratings of a single user on these web pages. Web pages are on four seperate subjects (Bands- recording artists; Goats; Sheep; and BioMedical) 3. Northix: Northix is designed to be a schema matching benchmark problem for data integration of two entity relationship databases. |