TTC-3600: Benchmark dataset for Turkish text categorization

Donated on 2/7/2017

The TTC-3600 data set is a collection of Turkish news and articles including categorized 3,600 documents from 6 well-known portals in Turkey. It has 4 different forms in ARFF Weka format.

Dataset Characteristics

Text

Subject Area

Computer Science

Associated Tasks

Classification, Clustering

Feature Type

Integer

# Instances

3600

# Features

4814

Dataset Information

Additional Information

The dataset consists of a total of 3600 documents including 600 news/texts from six categories – economy, culture-arts, health, politics, sports and technology – obtained from six well-known news portals and agencies (Hurriyet,Posta,Iha,HaberTurk,Radikal and Zaman). Documents of TTC-3600 dataset were collected between May and July 2015 via Rich Site Summary (RSS) feeds from six categories of the respective portals. All java scripts, HTML tags ( < img> , < a > , < p > , < strong> etc.), operators, punctuations, non-printable characters and irrelevant data such as advertising are removed. Three additional dataset versions are created on TTC-3600 by implementing different stemming methods. In all versions of datasets, first, removal-based pre-processing, which is explained in Section 3.2 in detail, is used. Then Turkish stop-words that have no discriminatory power (pronouns, prepositions, conjunctions, etc.) in regard to TC are removed from datasets except for the original one. In this study, a semi-automatically constructed stop-words list that contains 147 words is utilized.

Has Missing Values?

No

Variable Information

ARFF (Attribute-Relation File Format) Weka format

Reviews

There are no reviews for this dataset yet.

Login to Write a Review
Download
0 citations
1473 views

Creators

Deniz Kilin

License

By using the UCI Machine Learning Repository, you acknowledge and accept the cookies and privacy practices used by the UCI Machine Learning Repository.

Read Policy