Reuters Transcribed Subset
Donated on 3/7/2008
This dataset is created by reading out 200 files from the 10 largest Reuters classes and using an Automatic Speech Recognition system to create corresponding transcriptions.
Dataset Characteristics
Text
Subject Area
Business
Associated Tasks
Classification
Feature Type
-
# Instances
200
# Features
-
Dataset Information
Additional Information
Data Characteristics: -------------------- This data was created by selecting 20 files each from the 10 largest classes in the Reuters-21578 collection (http://www.daviddlewis.com/resources/testcollections/reuters21578/). The files were read out by 3 Indian speakers and an Automatic Speech Recognition (ASR) system was used to generate the transcripts. More about the ASR system can be found in [1]. Such a dataset will be really helpful to study the effect of speech recognition noise on text mining algorithms. The first work which refered to this dataset was on noisy text classification[2]. Data Format: ---------- There are 10 directories labeled by the topic name. Each contains 20 files of transcriptions. References: ---------- [1] L. R. Bahl, S. Balakrishnan-Aiyer, J. Bellegarda, M. Franz, P. Gopalakrishnan, D. Nahamoo, M. Novak, M. Padmanabhan, M. Picheny, and S. Roukos, Performance of the IBM large vocabulary continuous speech recognition system on the ARPA wall street journal task. In Proc. of ICASSP ’95, pages 41–44, Detroit, MI, 1995. [2] S. Agarwal, S. Godbole, D. Punjani and S. Roy, How Much Noise is too Much: A Study in Automatic Text Classification', In Proc. of ICDM 2007
Has Missing Values?
No
Dataset Files
File | Size |
---|---|
ReutersTranscribedSubsetOld.zip | 153.2 KB |
ReutersTranscribedSubset.zip | 145.1 KB |
README.txt | 1.7 KB |
reuters_transcribed.html | 987 Bytes |
Reviews
There are no reviews for this dataset yet.
pip install ucimlrepo
from ucimlrepo import fetch_ucirepo # fetch dataset reuters_transcribed_subset = fetch_ucirepo(id=163) # data (as pandas dataframes) X = reuters_transcribed_subset.data.features y = reuters_transcribed_subset.data.targets # metadata print(reuters_transcribed_subset.metadata) # variable information print(reuters_transcribed_subset.variables)
Godbole, S. & Roy, S. (2007). Reuters Transcribed Subset [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5M90K.
Creators
Shantanu Godbole
Shourya Roy
DOI
License
This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.