![]() Center for Machine Learning and Intelligent Systems |
About
Citation Policy
Donate a Data Set
Contact
View ALL Data Sets |
Source: Dataset creator and donator: ZhiLiu, e-mail: liuzhi8673 '@' gmail.com, institution: National Engineering Research Center for E-Learning, Hubei Wuhan, China
Data Set Information: The dataset is the subset of RCV1. These corpus has already been used in author identification experiments. In the top 50 authors (with respect to total size of articles) were selected. 50 authors of texts labeled with at least one subtopic of the class CCAT(corporate/industrial) were selected.That way, it is attempted to minimize the topic factor in distinguishing among the texts. The training corpus consists of 2,500 texts (50 per author) and the test corpus includes other 2,500 texts (50 per author) non-overlapping with the training texts. Attribute Information: Attributes of the dataset are character n-grams(n=1-5) Relevant Papers: J. Houvardas, E. Stamatatos, “N-gram Feature Selection for Authorship Identification,†in Proc. of the 12th Int. Conf. on Artificial Intelligence: Methodology, Systems, Applications, vol. 4183, pp.77-86, (2006) September 12-15; Varna, Bulgaria.
Citation Request: Please refer to the donator Zhi Liu from National Engineering Research Center For E-Learning Technology,China.
|
Supported By: |
![]() |
In Collaboration With: |
![]() |