TTC-3600: Benchmark dataset for Turkish text categorization
Donated on 2/7/2017
The TTC-3600 data set is a collection of Turkish news and articles including categorized 3,600 documents from 6 well-known portals in Turkey. It has 4 different forms in ARFF Weka format.
Dataset Characteristics
Text
Subject Area
Computer Science
Associated Tasks
Classification, Clustering
Feature Type
Integer
# Instances
3600
# Features
4814
Dataset Information
Additional Information
The dataset consists of a total of 3600 documents including 600 news/texts from six categories – economy, culture-arts, health, politics, sports and technology – obtained from six well-known news portals and agencies (Hurriyet,Posta,Iha,HaberTurk,Radikal and Zaman). Documents of TTC-3600 dataset were collected between May and July 2015 via Rich Site Summary (RSS) feeds from six categories of the respective portals. All java scripts, HTML tags ( < img> , < a > , < p > , < strong> etc.), operators, punctuations, non-printable characters and irrelevant data such as advertising are removed. Three additional dataset versions are created on TTC-3600 by implementing different stemming methods. In all versions of datasets, first, removal-based pre-processing, which is explained in Section 3.2 in detail, is used. Then Turkish stop-words that have no discriminatory power (pronouns, prepositions, conjunctions, etc.) in regard to TC are removed from datasets except for the original one. In this study, a semi-automatically constructed stop-words list that contains 147 words is utilized.
Has Missing Values?
No
Variable Information
ARFF (Attribute-Relation File Format) Weka format
Dataset Files
File | Size |
---|---|
TTC-3600 Turkish Text Classification Dataset.rar | 2.5 MB |
Reviews
There are no reviews for this dataset yet.
pip install ucimlrepo
from ucimlrepo import fetch_ucirepo # fetch dataset ttc_3600_benchmark_dataset_for_turkish_text_categorization = fetch_ucirepo(id=407) # data (as pandas dataframes) X = ttc_3600_benchmark_dataset_for_turkish_text_categorization.data.features y = ttc_3600_benchmark_dataset_for_turkish_text_categorization.data.targets # metadata print(ttc_3600_benchmark_dataset_for_turkish_text_categorization.metadata) # variable information print(ttc_3600_benchmark_dataset_for_turkish_text_categorization.variables)
Kilin, D. (2015). TTC-3600: Benchmark dataset for Turkish text categorization [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5P32W.
Creators
Deniz Kilin
DOI
License
This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.