Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact


Repository Web            Google
View ALL Data Sets

× Check out the beta version of the new UCI Machine Learning Repository we are currently testing! Contact us if you have any issues, questions, or concerns. Click here to try out the new site.

KDC-4007 dataset Collection Data Set
Download: Data Folder, Data Set Description

Abstract: KDC-4007 dataset Collection is the Kurdish Documents Classification text used in categories regarding Kurdish Sorani news and articles.

Data Set Characteristics:  

Multivariate, Text

Number of Instances:

4007

Area:

Computer

Attribute Characteristics:

Integer

Number of Attributes:

N/A

Date Donated

2017-04-27

Associated Tasks:

Classification, Regression

Missing Values?

N/A

Number of Web Hits:

40420


Source:

Arazo M. Mustafa, (arazo.2007 '@' yahoo.com),
School of Computer Science University of Sulaimania, Kurdistan, Iraq


Data Set Information:

The most important feature of this dataset is its simplicity to use and its being well-documented, which can be widely used in various studies of text analysis regarding Kurdish Sorani news and articles.
The documents consist of eight categories, which are Sport, Religion, Art, Economic, Education, Social, Style, and Health. Each of them consisted of 500 text documents, where the total size of the corpus is 4,007 text files.
The dataset and documents have become freely accessible in order to have repeatable outcomes for experimental assessment.


Attribute Information:

There is four collection:

- ST-Ds datasets, just stop words elimination is performed by using Kurdish preprocessing-step approach.
- The pre-ds dataset, Kurdish preprocessing-step approach is used.
- The Pre+TW-Ds dataset, TF×IDF term weighting on the Pre-Ds dataset is performed.
- Orig-Ds datasets, no process is used which is the original dataset.


Relevant Papers:

[1] Arazo M. Mustafa and Tarik A. Rashid,“ Kurdish Stemmer Pre-processing Steps for Improving Information Retrieval”, Journal of Information Science, First published date: january-01-2017, 10.1177/0165551516683617.
[2] Tarik A. Rashid, Arazo M. Mustafa and Ari M. Saeed, 2017.'A Robust Categorization System for Kurdish Sorani Text Documents'. Information Technology Journal, 16: 27-34.
[3] Tarik A. Rashid, Arazu M. Mustafa, Ari M. Saeed Automatic Kurdish Text Classification Using KDC 4007 Dataset, accepted in Springer book, Series Title: Lecture Notes on Data Engineering and Communications Technologies: Book title: Advances in Internetworking, Data & Web Technologies, Indexing: The books of this series are submitted to ISI Proceedings, EI, Scopus, MetaPress, Springerlink, 2017.



Citation Request:

If you have no special citation requests, please leave this field blank.


Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML