Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact


Repository Web            Google
View ALL Data Sets

× Check out the beta version of the new UCI Machine Learning Repository we are currently testing! Contact us if you have any issues, questions, or concerns. Click here to try out the new site.

Wisesight Sentiment Corpus Data Set
Download: Data Folder, Data Set Description

Abstract: Social media messages in Thai language with sentiment label (positive, neutral, negative, question).

Data Set Characteristics:  

Multivariate, Text

Number of Instances:

26737

Area:

Social

Attribute Characteristics:

N/A

Number of Attributes:

4

Date Donated

2020-08-25

Associated Tasks:

Classification

Missing Values?

N/A

Number of Web Hits:

14462


Source:

https://github.com/PyThaiNLP/wisesight-sentiment/

Source: Facebook, Twitter, web forums.

Size: 26,737 messages

Language: Central Thai

Style: Informal and conversational. With some news headlines and advertisement.

Time period: Around 2016 to early 2019. With small amount from other period.

Domains: Mixed. Majority are consumer products and services (restaurants, cosmetics, drinks, car, hotels), with some current affairs.

Privacy:
- Only messages that made available to the public on the internet (websites, blogs, social network sites).
- For Facebook, this means the public comments (everyone can see) that made on a public page.
- Private/protected messages and messages in groups, chat, and inbox are not included.

Alternations and modifications:
- Keep in mind that this corpus does not statistically represent anything in the language register.
- Large amount of messages are not in their original form. Personal data are removed or masked.
- Duplicated, leading, and trailing whitespaces are removed. Other punctuations, symbols, and emojis are kept intact.
- (Mis)spellings are kept intact.
- Messages longer than 2,000 characters are removed.
- Long non-Thai messages are removed. Duplicated message (exact match) are removed.
- More characteristics of the data can be explore by this notebook: https://github.com/PyThaiNLP/wisesight-sentiment/blob/master/exploration.ipynb


Data Set Information:

For wisesight-160 and wisesight-1000, which are samples from this corpus in a tokenized form, see [Web Link]

For data exploration and classification examples, see Thai Text Classification Benchmarks. [Web Link]

Personal data
- We trying to exclude any known personally identifiable information from this data set.
- Usernames and non-public figure names are removed
- Phone numbers are masked (e.g. 088-888-8888, 09-9999-9999, 0-2222-2222)
- If you see any personal data still remain in the set, please tell us - so we can remove them.

Sentiment value annotation methodology
- Sentiment values are assigned by human annotators.
- A human annotator put his/her best effort to assign just one label, out of three, to a message.
- A message can be ambiguous. When possible, the judgement will be based solely on the text itself.
- In some situation, like when the context is missing, the annotator may have to rely on his/her own world knowledge and just guess.
- In some cases, the human annotator may have an access to the message's context, like an image. These additional information are not included as part of this corpus.
- Agreement, enjoyment, and satisfaction are positive. Disagreement, sadness, and disappointment are negative.
- Showing interest in a topic or in a product is counted as positive.
- In this sense, a question about a particular product could has a positive sentiment value, if it shows the interest in the product.
- Saying that other product or service is better is counted as negative.
- General information or news title tend to be counted as neutral.


Attribute Information:

A message can has only one label: Question, Negative, Neutral, Positive

All messages are kept in plaintext files. All files are UTF-8 encoded.

One message per line. A newline character in the original message will be replaced with a space.

One label per file.

q.txt Questions (575 messages)
neg.txt Message with negative sentiment (6,823)
neu.txt Message with neutral sentiment (14,561)
pos.txt Message with positive sentiment (4,778)


Relevant Papers:

N/A



Citation Request:

Arthit Suriyawongkul, Ekapol Chuangsuwanich, Pattarawat Chormai, and Charin Polpanumas. 2019. PyThaiNLP/wisesight-sentiment: First release. September.

BibTeX:

@software{bact_2019_3457447,
author = {Suriyawongkul, Arthit and
Chuangsuwanich, Ekapol and
Chormai, Pattarawat and
Polpanumas, Charin},
title = {PyThaiNLP/wisesight-sentiment: First release},
month = sep,
year = 2019,
publisher = {Zenodo},
version = {v1.0},
doi = {10.5281/zenodo.3457447},
url = {[Web Link]}
}


Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML