KDC-4007 dataset Collection

Donated on 4/26/2017

KDC-4007 dataset Collection is the Kurdish Documents Classification text used in categories regarding Kurdish Sorani news and articles.

Dataset Characteristics

Multivariate, Text

Subject Area

Computer Science

Associated Tasks

Classification, Regression

Feature Type

Integer

# Instances

4007

# Features

Dataset Information

Additional Information

The most important feature of this dataset is its simplicity to use and its being well-documented, which can be widely used in various studies of text analysis regarding Kurdish Sorani news and articles. The documents consist of eight categories, which are Sport, Religion, Art, Economic, Education, Social, Style, and Health. Each of them consisted of 500 text documents, where the total size of the corpus is 4,007 text files. The dataset and documents have become freely accessible in order to have repeatable outcomes for experimental assessment.

Has Missing Values?

Variable Information

There is four collection: - ST-Ds datasets, just stop words elimination is performed by using Kurdish preprocessing-step approach. - The pre-ds dataset, Kurdish preprocessing-step approach is used. - The Pre+TW-Ds dataset, TFÃ—IDF term weighting on the Pre-Ds dataset is performed. - Orig-Ds datasets, no process is used which is the original dataset.

Dataset Files

File	Size
KDC-4007-Dataset.rar	852.6 KB

Reviews

There are no reviews for this dataset yet.

Download (852.7 KB)

0 citations

1460 views

Creators

Arazo Mustafa

DOI

10.24432/C5X021

License

This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.