Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact

Repository Web            Google
View ALL Data Sets

Reuter_50_50 Data Set
Download: Data Folder, Data Set Description

Abstract: The dataset is used for authorship identification in online Writeprint which is a new research field of pattern recognition.

Data Set Characteristics:  

Multivariate, Text, Domain-Theory

Number of Instances:




Attribute Characteristics:


Number of Attributes:


Date Donated


Associated Tasks:

Classification, Clustering

Missing Values?


Number of Web Hits:



Dataset creator and donator: ZhiLiu, e-mail: liuzhi8673 '@', institution: National Engineering Research Center for E-Learning, Hubei Wuhan, China

Data Set Information:

The dataset is the subset of RCV1. These corpus has already been used in author identification experiments. In the top 50 authors (with respect to total size of articles) were selected. 50 authors of texts labeled with at least one subtopic of the class CCAT(corporate/industrial) were selected.That way, it is attempted to minimize the topic factor in distinguishing among the texts. The training corpus consists of 2,500 texts (50 per author) and the test corpus includes other 2,500 texts (50 per author) non-overlapping with the training texts.

Attribute Information:

Attributes of the dataset are character n-grams(n=1-5)

Relevant Papers:

J. Houvardas, E. Stamatatos, “N-gram Feature Selection for Authorship Identification,” in Proc. of the 12th Int. Conf. on Artificial Intelligence: Methodology, Systems, Applications, vol. 4183, pp.77-86, (2006) September 12-15; Varna, Bulgaria.
E. Stamatatos, “Author Identification Using Imbalanced and Limited Training Texts,” In Proc. of the 4th International Workshop on Text-based Information Retrieval, (2007) September 3-7; Regensburg, Germany.

Citation Request:

Please refer to the donator Zhi Liu from National Engineering Research Center For E-Learning Technology,China.

Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML