Reuter_50_50

Donated on 9/7/2011

The dataset is used for authorship identification in online Writeprint which is a new research field of pattern recognition.

Dataset Characteristics

Multivariate, Text, Domain-Theory

Subject Area

Computer Science

Associated Tasks

Classification, Clustering

Feature Type

Real

# Instances

2500

# Features

10000

Dataset Information

Additional Information

The dataset is the subset of RCV1. These corpus has already been used in author identification experiments. In the top 50 authors (with respect to total size of articles) were selected. 50 authors of texts labeled with at least one subtopic of the class CCAT(corporate/industrial) were selected.That way, it is attempted to minimize the topic factor in distinguishing among the texts. The training corpus consists of 2,500 texts (50 per author) and the test corpus includes other 2,500 texts (50 per author) non-overlapping with the training texts.

Has Missing Values?

No

Variable Information

Attributes of the dataset are character n-grams(n=1-5)

Dataset Files

FileSize
C50test/JimGilchrist/252113newsML.txt8.7 KB
C50test/JimGilchrist/327974newsML.txt8.5 KB
C50train/JimGilchrist/126597newsML.txt8.4 KB
C50train/SarahDavison/396743newsML.txt7.8 KB
C50train/JimGilchrist/123461newsML.txt7.5 KB

0 to 5 of 5000

Reviews

There are no reviews for this dataset yet.

Login to Write a Review
Download (7.8 MB)
0 citations
2250 views

Creators

Zhi Liu

License

By using the UCI Machine Learning Repository, you acknowledge and accept the cookies and privacy practices used by the UCI Machine Learning Repository.

Read Policy