KDC-4007 dataset Collection
Donated on 4/26/2017
KDC-4007 dataset Collection is the Kurdish Documents Classification text used in categories regarding Kurdish Sorani news and articles.
Dataset Characteristics
Multivariate, Text
Subject Area
Computer Science
Associated Tasks
Classification, Regression
Feature Type
Integer
# Instances
4007
# Features
-
Dataset Information
Additional Information
The most important feature of this dataset is its simplicity to use and its being well-documented, which can be widely used in various studies of text analysis regarding Kurdish Sorani news and articles. The documents consist of eight categories, which are Sport, Religion, Art, Economic, Education, Social, Style, and Health. Each of them consisted of 500 text documents, where the total size of the corpus is 4,007 text files. The dataset and documents have become freely accessible in order to have repeatable outcomes for experimental assessment.
Has Missing Values?
No
Variable Information
There is four collection: - ST-Ds datasets, just stop words elimination is performed by using Kurdish preprocessing-step approach. - The pre-ds dataset, Kurdish preprocessing-step approach is used. - The Pre+TW-Ds dataset, TF×IDF term weighting on the Pre-Ds dataset is performed. - Orig-Ds datasets, no process is used which is the original dataset.
Dataset Files
File | Size |
---|---|
KDC-4007-Dataset.rar | 852.6 KB |
Reviews
There are no reviews for this dataset yet.
pip install ucimlrepo
from ucimlrepo import fetch_ucirepo # fetch dataset kdc_4007_dataset_collection = fetch_ucirepo(id=376) # data (as pandas dataframes) X = kdc_4007_dataset_collection.data.features y = kdc_4007_dataset_collection.data.targets # metadata print(kdc_4007_dataset_collection.metadata) # variable information print(kdc_4007_dataset_collection.variables)
Mustafa, A. (2017). KDC-4007 dataset Collection [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5X021.
Creators
Arazo Mustafa
DOI
License
This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.