Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact


Repository Web            Google
View ALL Data Sets

Reuters-21578 Text Categorization Collection Data Set

Below are papers that cite this data set, with context shown. Papers were automatically harvested and associated with this data set, in collaboration with Rexa.info.

Return to Reuters-21578 Text Categorization Collection data set page.


Manuel Oliveira. Library Release Form Name of Author: Stanley Robson de Medeiros Oliveira Title of Thesis: Data Transformation For Privacy-Preserving Data Mining Degree: Doctor of Philosophy Year this Degree Granted. University of Alberta Library. 2005.

Retail for condition C1. . . . . . . . . . 148 B.20 Results of misses cost on the dataset Reuters for condition C1. . . . . . . . . 148 B.21 Results of misses cost on the dataset BMS-1 for condition C1. . . . . . . . . . 148 B.22 Results of misses cost on the Kosarark dataset for


David Littau and Daniel Boley. Using Low-Memory Representations to Cluster Very Large Data Sets. SDM. 2003.

which are already sparse to begin with, such as the document datasets k1 and reuters However, a larger document data set would probably see a higher percentage of memory savings, since the number of possible attributes in document data sets is limited by the number


Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. KDD. 2002.

estimates using a sigmoid function. rate class membership probability estimates, while being faster. The same method can be applied to naive Bayes. This was proposed by Bennett [4] for the Reuters dataset. In Figure 4 we show the sigmoidal fit to the naive Bayes scores for the Adult and TIC datasets. The sigmoidal shape does not appear to fit naive Bayes scores as well as it fits SVM scores, for


Vijay S. Iyengar and Chidanand Apt and Tong Zhang. Active learning using adaptive resampling. KDD. 2000.

the ALAR method are shown in Figure 5. Both ALAR-3-nn and ALAR-vote-E achieve the accuracy goal with only 8000 labeled instances. The last benchmark used is the Mod-Apte split of the Reuters data set available from [20]. Only the top ten categories are considered. For eachofthemwe solve the binary classi#cation problem of being in or out of that category.Weused the notion of information gain


Dmitry Pavlov and Jianchang Mao and Byron Dom. Scaling-Up Support Vector Machines Using Boosting Algorithm. ICPR. 2000.

by the standard SVM training algorithms. 3. Experiments We compared performance of linear classifiers trained with the Boost-SMO and the Full-SMO (conventional SMO algorithm) on the following three data sets: the Reuters Data, the Microsoft Web Data and the UCI Adult Data. For the Reuters Data we looked at the classes "acq" and "earn" that have the greatest number of positive examples. The Microsoft


Daphne Koller and Mehran Sahami. Toward Optimal Feature Selection. ICML. 1996.

from the UCI repository (Murphy & Aha 1995); and two datasets which are a subset of the Reuters document collection (Reuters 1995). These datasets are detailed in Table 1. We selected these datasets as they are either well understood in terms of feature


Thomas T. Osugi and M. S. EXPLORATION-BASED ACTIVE MACHINE LEARNING. Faculty of The Graduate College at the University of Nebraska In Partial Fulfillment of Requirements.

of e-mail, a user can label good and bad examples. The user won't have much patience in this training process, and AL can minimize the amount of input needed. In experiments, for the Reuters 21578 dataset, only 10% of 6 unlabeled examples had to be labeled in order to get the same accuracy as the entire labeled pool. For image classification, Luo et al. [16] use active learning to recognize diŽerent


Vikas Sindhwani and P. Bhattacharya and Subrata Rakshit. Information Theoretic Feature Crediting in Multiclass Support Vector Machines.

that we constructed, LED-24, Waveform-40, DNA, Vehicle, and Satellite Images (SAT) drawn from the UCI repository [19]; and three datasets which are a subset of the Reuters document collection [20]. Table 5 lists the details of these datasets. We first examine the informativeness of features using SVM-infoprop in each of these


Omid Madani and David M. Pennock and Gary William Flake. Co-Validation: Using Model Disagreement to Validate Classification Algorithms. Yahoo! Research Labs.

unlabeled data does not tend to wildly underestimate error, even though it's theoretically possible. 3 Experiments We conducted experiments on the 20 Newsgroups and Reuters 21578 test categorization datasets 1 , and the Votes, Chess, Adult, and Optics datasets from the UCI collection [BKM98]. We chose 1 Available from http://www.ics.uci.edu/ and http://www.daviddlewis.com/resources/testcollections/ two


Return to Reuters-21578 Text Categorization Collection data set page.

Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML