Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact

Repository Web            Google
View ALL Data Sets

× Check out the beta version of the new UCI Machine Learning Repository we are currently testing! Contact us if you have any issues, questions, or concerns. Click here to try out the new site.

TTC-3600: Benchmark dataset for Turkish text categorization Data Set
Download: Data Folder, Data Set Description

Abstract: The TTC-3600 data set is a collection of Turkish news and articles including categorized 3,600 documents from 6 well-known portals in Turkey. It has 4 different forms in ARFF Weka format.

Data Set Characteristics:  


Number of Instances:




Attribute Characteristics:


Number of Attributes:


Date Donated


Associated Tasks:

Classification, Clustering

Missing Values?


Number of Web Hits:



Assist.Prof.Dr. Deniz KILINÇ, Faculty of Technology, Celal Bayar University, Turkey

Data Set Information:

The dataset consists of a total of 3600 documents including 600 news/texts from six categories – economy, culture-arts, health, politics, sports and technology – obtained
from six well-known news portals and agencies (Hurriyet,Posta,Iha,HaberTurk,Radikal and Zaman). Documents of TTC-3600 dataset were collected between May and July 2015 via Rich Site Summary (RSS) feeds from six categories of the respective portals. All java scripts, HTML tags ( < img> , < a > , < p > , < strong> etc.), operators, punctuations, non-printable characters and irrelevant data such as advertising are removed.

Three additional dataset versions are created on TTC-3600 by implementing different stemming methods. In all versions of datasets, first, removal-based pre-processing, which is explained in Section 3.2 in detail, is used. Then Turkish stop-words that have no discriminatory power (pronouns, prepositions, conjunctions, etc.) in regard to TC are removed
from datasets except for the original one. In this study, a semi-automatically constructed stop-words list that contains 147 words is utilized.

Attribute Information:

ARFF (Attribute-Relation File Format) Weka format

Relevant Papers:

[Web Link]

Citation Request:

Kılınç, Deniz, et al. 'TTC-3600: A new benchmark dataset for Turkish text categorization.' Journal of Information Science (2015): 0165551515620551.

Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML