KDC-4007 dataset Collection

Donated on 4/26/2017

KDC-4007 dataset Collection is the Kurdish Documents Classification text used in categories regarding Kurdish Sorani news and articles.

Dataset Characteristics

Multivariate, Text

Subject Area

Computer Science

Associated Tasks

Classification, Regression

Feature Type

Integer

# Instances

4007

# Features

-

Dataset Information

Additional Information

The most important feature of this dataset is its simplicity to use and its being well-documented, which can be widely used in various studies of text analysis regarding Kurdish Sorani news and articles. The documents consist of eight categories, which are Sport, Religion, Art, Economic, Education, Social, Style, and Health. Each of them consisted of 500 text documents, where the total size of the corpus is 4,007 text files. The dataset and documents have become freely accessible in order to have repeatable outcomes for experimental assessment.

Has Missing Values?

No

Variable Information

There is four collection: - ST-Ds datasets, just stop words elimination is performed by using Kurdish preprocessing-step approach. - The pre-ds dataset, Kurdish preprocessing-step approach is used. - The Pre+TW-Ds dataset, TF×IDF term weighting on the Pre-Ds dataset is performed. - Orig-Ds datasets, no process is used which is the original dataset.

Dataset Files

FileSize
KDC-4007-Dataset.rar852.6 KB

Reviews

There are no reviews for this dataset yet.

Login to Write a Review
Download (852.7 KB)
0 citations
1145 views

Creators

Arazo Mustafa

License

By using the UCI Machine Learning Repository, you acknowledge and accept the cookies and privacy practices used by the UCI Machine Learning Repository.

Read Policy