Reuter_50_50
Donated on 9/7/2011
The dataset is used for authorship identification in online Writeprint which is a new research field of pattern recognition.
Dataset Characteristics
Multivariate, Text, Domain-Theory
Subject Area
Computer Science
Associated Tasks
Classification, Clustering
Feature Type
Real
# Instances
2500
# Features
10000
Dataset Information
Additional Information
The dataset is the subset of RCV1. These corpus has already been used in author identification experiments. In the top 50 authors (with respect to total size of articles) were selected. 50 authors of texts labeled with at least one subtopic of the class CCAT(corporate/industrial) were selected.That way, it is attempted to minimize the topic factor in distinguishing among the texts. The training corpus consists of 2,500 texts (50 per author) and the test corpus includes other 2,500 texts (50 per author) non-overlapping with the training texts.
Has Missing Values?
No
Variable Information
Attributes of the dataset are character n-grams(n=1-5)
Dataset Files
File | Size |
---|---|
C50test/JimGilchrist/252113newsML.txt | 8.7 KB |
C50test/JimGilchrist/327974newsML.txt | 8.5 KB |
C50train/JimGilchrist/126597newsML.txt | 8.4 KB |
C50train/SarahDavison/396743newsML.txt | 7.8 KB |
C50train/JimGilchrist/123461newsML.txt | 7.5 KB |
0 to 5 of 5000
Reviews
There are no reviews for this dataset yet.
pip install ucimlrepo
from ucimlrepo import fetch_ucirepo # fetch dataset reuter_50_50 = fetch_ucirepo(id=217) # data (as pandas dataframes) X = reuter_50_50.data.features y = reuter_50_50.data.targets # metadata print(reuter_50_50.metadata) # variable information print(reuter_50_50.variables)
Liu, Z. (2006). Reuter_50_50 [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5DS42.
Creators
Zhi Liu
DOI
License
This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.