Reuter_50_50

Donated on 9/7/2011

The dataset is used for authorship identification in online Writeprint which is a new research field of pattern recognition.

Dataset Characteristics

Multivariate, Text, Domain-Theory

Subject Area

Computer Science

Associated Tasks

Classification, Clustering

Feature Type

Real

# Instances

2500

# Features

10000

Dataset Information

Additional Information

The dataset is the subset of RCV1. These corpus has already been used in author identification experiments. In the top 50 authors (with respect to total size of articles) were selected. 50 authors of texts labeled with at least one subtopic of the class CCAT(corporate/industrial) were selected.That way, it is attempted to minimize the topic factor in distinguishing among the texts. The training corpus consists of 2,500 texts (50 per author) and the test corpus includes other 2,500 texts (50 per author) non-overlapping with the training texts.

Has Missing Values?

Variable Information

Attributes of the dataset are character n-grams(n=1-5)

Dataset Files

File	Size
C50test/JimGilchrist/252113newsML.txt	8.7 KB
C50test/JimGilchrist/327974newsML.txt	8.5 KB
C50train/JimGilchrist/126597newsML.txt	8.4 KB
C50train/SarahDavison/396743newsML.txt	7.8 KB
C50train/JimGilchrist/123461newsML.txt	7.5 KB

Rows per page

0 to 5 of 5000

Download (7.8 MB)

0 citations

4109 views

Creators

Zhi Liu

DOI

10.24432/C5DS42

License

This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.