
Wisesight Sentiment Corpus
Donated on 8/24/2020
Social media messages in Thai language with sentiment label (positive, neutral, negative, question).
Dataset Characteristics
Multivariate, Text
Subject Area
Social Science
Associated Tasks
Classification
Feature Type
-
# Instances
26737
# Features
4
Dataset Information
Additional Information
For wisesight-160 and wisesight-1000, which are samples from this corpus in a tokenized form, see https://github.com/PyThaiNLP/wisesight-sentiment/tree/master/word-tokenization For data exploration and classification examples, see Thai Text Classification Benchmarks. https://github.com/PyThaiNLP/classification-benchmarks Personal data - We trying to exclude any known personally identifiable information from this data set. - Usernames and non-public figure names are removed - Phone numbers are masked (e.g. 088-888-8888, 09-9999-9999, 0-2222-2222) - If you see any personal data still remain in the set, please tell us - so we can remove them. Sentiment value annotation methodology - Sentiment values are assigned by human annotators. - A human annotator put his/her best effort to assign just one label, out of three, to a message. - A message can be ambiguous. When possible, the judgement will be based solely on the text itself. - In some situation, like when the context is missing, the annotator may have to rely on his/her own world knowledge and just guess. - In some cases, the human annotator may have an access to the message's context, like an image. These additional information are not included as part of this corpus. - Agreement, enjoyment, and satisfaction are positive. Disagreement, sadness, and disappointment are negative. - Showing interest in a topic or in a product is counted as positive. - In this sense, a question about a particular product could has a positive sentiment value, if it shows the interest in the product. - Saying that other product or service is better is counted as negative. - General information or news title tend to be counted as neutral.
Has Missing Values?
No
Variable Information
A message can has only one label: Question, Negative, Neutral, Positive All messages are kept in plaintext files. All files are UTF-8 encoded. One message per line. A newline character in the original message will be replaced with a space. One label per file. q.txt Questions (575 messages) neg.txt Message with negative sentiment (6,823) neu.txt Message with neutral sentiment (14,561) pos.txt Message with positive sentiment (4,778)
pip install ucimlrepo
from ucimlrepo import fetch_ucirepo # fetch dataset wisesight_sentiment_corpus = fetch_ucirepo(id=600) # data (as pandas dataframes) X = wisesight_sentiment_corpus.data.features y = wisesight_sentiment_corpus.data.targets # metadata print(wisesight_sentiment_corpus.metadata) # variable information print(wisesight_sentiment_corpus.variables)
Wisesight Sentiment Corpus. (2020). UCI Machine Learning Repository. https://doi.org/10.24432/C5PG7K.
@misc{misc_wisesight_sentiment_corpus_600, title = {{Wisesight Sentiment Corpus}}, year = {2020}, howpublished = {UCI Machine Learning Repository}, note = {{DOI}: https://doi.org/10.24432/C5PG7K} }
DOI
License
This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.