Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact


Repository Web            Google
View ALL Data Sets

Wisesight Sentiment Corpus Data Set
Download: Data Folder, Data Set Description

Abstract: Social media messages in Thai language with sentiment label (positive, neutral, negative, question).

Data Set Characteristics:  

Multivariate, Text

Number of Instances:

26737

Area:

Social

Attribute Characteristics:

N/A

Number of Attributes:

4

Date Donated

2020-08-25

Associated Tasks:

Classification

Missing Values?

N/A

Number of Web Hits:

2717


Source:

https://github.com/PyThaiNLP/wisesight-sentiment/

Source: Facebook, Twitter, web forums.

Size: 26,737 messages

Language: Central Thai

Style: Informal and conversational. With some news headlines and advertisement.

Time period: Around 2016 to early 2019. With small amount from other period.

Domains: Mixed. Majority are consumer products and services (restaurants, cosmetics, drinks, car, hotels), with some current affairs.

Privacy:
- Only messages that made available to the public on the internet (websites, blogs, social network sites).
- For Facebook, this means the public comments (everyone can see) that made on a public page.
- Private/protected messages and messages in groups, chat, and inbox are not included.

Alternations and modifications:
- Keep in mind that this corpus does not statistically represent anything in the language register.
- Large amount of messages are not in their original form. Personal data are removed or masked.
- Duplicated, leading, and trailing whitespaces are removed. Other punctuations, symbols, and emojis are kept intact.
- (Mis)spellings are kept intact.
- Messages longer than 2,000 characters are removed.
- Long non-Thai messages are removed. Duplicated message (exact match) are removed.
- More characteristics of the data can be explore by this notebook: https://github.com/PyThaiNLP/wisesight-sentiment/blob/master/exploration.ipynb


Data Set Information:

For wisesight-160 and wisesight-1000, which are samples from this corpus in a tokenized form, see [Web Link]

For data exploration and classification examples, see Thai Text Classification Benchmarks. [Web Link]

Personal data
- We trying to exclude any known personally identifiable information from this data set.
- Usernames and non-public figure names are removed
- Phone numbers are masked (e.g. 088-888-8888, 09-9999-9999, 0-2222-2222)
- If you see any personal data still remain in the set, please tell us - so we can remove them.

Sentiment value annotation methodology
- Sentiment values are assigned by human annotators.
- A human annotator put his/her best effort to assign just one label, out of three, to a message.
- A message can be ambiguous. When possible, the judgement will be based solely on the text itself.
- In some situation, like when the context is missing, the annotator may have to rely on his/her own world knowledge and just guess.
- In some cases, the human annotator may have an access to the message's context, like an image. These additional information are not included as part of this corpus.
- Agreement, enjoyment, and satisfaction are positive. Disagreement, sadness, and disappointment are negative.
- Showing interest in a topic or in a product is counted as positive.
- In this sense, a question about a particular product could has a positive sentiment value, if it shows the interest in the product.
- Saying that other product or service is better is counted as negative.
- General information or news title tend to be counted as neutral.


Attribute Information:

A message can has only one label: Question, Negative, Neutral, Positive

All messages are kept in plaintext files. All files are UTF-8 encoded.

One message per line. A newline character in the original message will be replaced with a space.

One label per file.

q.txt Questions (575 messages)
neg.txt Message with negative sentiment (6,823)
neu.txt Message with neutral sentiment (14,561)
pos.txt Message with positive sentiment (4,778)


Relevant Papers:

N/A



Citation Request:

Arthit Suriyawongkul, Ekapol Chuangsuwanich, Pattarawat Chormai, and Charin Polpanumas. 2019. PyThaiNLP/wisesight-sentiment: First release. September.

BibTeX:

@software{bact_2019_3457447,
author = {Suriyawongkul, Arthit and
Chuangsuwanich, Ekapol and
Chormai, Pattarawat and
Polpanumas, Charin},
title = {PyThaiNLP/wisesight-sentiment: First release},
month = sep,
year = 2019,
publisher = {Zenodo},
version = {v1.0},
doi = {10.5281/zenodo.3457447},
url = {[Web Link]}
}


Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML