Wisesight Sentiment Corpus

Donated on 8/24/2020

Social media messages in Thai language with sentiment label (positive, neutral, negative, question).

Dataset Characteristics

Multivariate, Text

Subject Area

Social Science

Associated Tasks


Feature Type


# Instances


# Features


Dataset Information

Additional Information

For wisesight-160 and wisesight-1000, which are samples from this corpus in a tokenized form, see https://github.com/PyThaiNLP/wisesight-sentiment/tree/master/word-tokenization For data exploration and classification examples, see Thai Text Classification Benchmarks. https://github.com/PyThaiNLP/classification-benchmarks Personal data - We trying to exclude any known personally identifiable information from this data set. - Usernames and non-public figure names are removed - Phone numbers are masked (e.g. 088-888-8888, 09-9999-9999, 0-2222-2222) - If you see any personal data still remain in the set, please tell us - so we can remove them. Sentiment value annotation methodology - Sentiment values are assigned by human annotators. - A human annotator put his/her best effort to assign just one label, out of three, to a message. - A message can be ambiguous. When possible, the judgement will be based solely on the text itself. - In some situation, like when the context is missing, the annotator may have to rely on his/her own world knowledge and just guess. - In some cases, the human annotator may have an access to the message's context, like an image. These additional information are not included as part of this corpus. - Agreement, enjoyment, and satisfaction are positive. Disagreement, sadness, and disappointment are negative. - Showing interest in a topic or in a product is counted as positive. - In this sense, a question about a particular product could has a positive sentiment value, if it shows the interest in the product. - Saying that other product or service is better is counted as negative. - General information or news title tend to be counted as neutral.

Has Missing Values?


Variable Information

A message can has only one label: Question, Negative, Neutral, Positive All messages are kept in plaintext files. All files are UTF-8 encoded. One message per line. A newline character in the original message will be replaced with a space. One label per file. q.txt Questions (575 messages) neg.txt Message with negative sentiment (6,823) neu.txt Message with neutral sentiment (14,561) pos.txt Message with positive sentiment (4,778)

0 citations


By using the UCI Machine Learning Repository, you acknowledge and accept the cookies and privacy practices used by the UCI Machine Learning Repository.

Read Policy