Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact


Repository Web            Google
View ALL Data Sets

TTC-3600: Benchmark dataset for Turkish text categorization Data Set
Download: Data Folder, Data Set Description

Abstract: The TTC-3600 data set is a collection of Turkish news and articles including categorized 3,600 documents from 6 well-known portals in Turkey. It has 4 different forms in ARFF Weka format.

Data Set Characteristics:  

Text

Number of Instances:

3600

Area:

Computer

Attribute Characteristics:

Integer

Number of Attributes:

4814

Date Donated

2017-02-08

Associated Tasks:

Classification, Clustering

Missing Values?

N/A

Number of Web Hits:

2032


Source:

Assist.Prof.Dr. Deniz KILINÇ, Faculty of Technology, Celal Bayar University, Turkey
drdenizkilinc'@'gmail.com


Data Set Information:

The dataset consists of a total of 3600 documents including 600 news/texts from six categories – economy, culture-arts, health, politics, sports and technology – obtained
from six well-known news portals and agencies (Hurriyet,Posta,Iha,HaberTurk,Radikal and Zaman). Documents of TTC-3600 dataset were collected between May and July 2015 via Rich Site Summary (RSS) feeds from six categories of the respective portals. All java scripts, HTML tags ( < img> , < a > , < p > , < strong> etc.), operators, punctuations, non-printable characters and irrelevant data such as advertising are removed.

Three additional dataset versions are created on TTC-3600 by implementing different stemming methods. In all versions of datasets, first, removal-based pre-processing, which is explained in Section 3.2 in detail, is used. Then Turkish stop-words that have no discriminatory power (pronouns, prepositions, conjunctions, etc.) in regard to TC are removed
from datasets except for the original one. In this study, a semi-automatically constructed stop-words list that contains 147 words is utilized.


Attribute Information:

ARFF (Attribute-Relation File Format) Weka format


Relevant Papers:

[Web Link]



Citation Request:

Kılınç, Deniz, et al. 'TTC-3600: A new benchmark dataset for Turkish text categorization.' Journal of Information Science (2015): 0165551515620551.


Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML