Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact


Repository Web            Google
View ALL Data Sets

× Check out the beta version of the new UCI Machine Learning Repository we are currently testing! Contact us if you have any issues, questions, or concerns. Click here to try out the new site.

TTC-3600: Benchmark dataset for Turkish text categorization Data Set
Download: Data Folder, Data Set Description

Abstract: The TTC-3600 data set is a collection of Turkish news and articles including categorized 3,600 documents from 6 well-known portals in Turkey. It has 4 different forms in ARFF Weka format.

Data Set Characteristics:  

Text

Number of Instances:

3600

Area:

Computer

Attribute Characteristics:

Integer

Number of Attributes:

4814

Date Donated

2017-02-08

Associated Tasks:

Classification, Clustering

Missing Values?

N/A

Number of Web Hits:

17340


Source:

Assist.Prof.Dr. Deniz KILINÇ, Faculty of Technology, Celal Bayar University, Turkey
drdenizkilinc'@'gmail.com


Data Set Information:

The dataset consists of a total of 3600 documents including 600 news/texts from six categories – economy, culture-arts, health, politics, sports and technology – obtained
from six well-known news portals and agencies (Hurriyet,Posta,Iha,HaberTurk,Radikal and Zaman). Documents of TTC-3600 dataset were collected between May and July 2015 via Rich Site Summary (RSS) feeds from six categories of the respective portals. All java scripts, HTML tags ( < img> , < a > , < p > , < strong> etc.), operators, punctuations, non-printable characters and irrelevant data such as advertising are removed.

Three additional dataset versions are created on TTC-3600 by implementing different stemming methods. In all versions of datasets, first, removal-based pre-processing, which is explained in Section 3.2 in detail, is used. Then Turkish stop-words that have no discriminatory power (pronouns, prepositions, conjunctions, etc.) in regard to TC are removed
from datasets except for the original one. In this study, a semi-automatically constructed stop-words list that contains 147 words is utilized.


Attribute Information:

ARFF (Attribute-Relation File Format) Weka format


Relevant Papers:

[Web Link]



Citation Request:

Kılınç, Deniz, et al. 'TTC-3600: A new benchmark dataset for Turkish text categorization.' Journal of Information Science (2015): 0165551515620551.


Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML