Wisesight Sentiment Corpus
Donated on 8/24/2020
Social media messages in Thai language with sentiment label (positive, neutral, negative, question).
Dataset Characteristics
Multivariate, Text
Subject Area
Social Science
Associated Tasks
Classification
Feature Type
-
# Instances
26737
# Features
4
Dataset Information
Additional Information
For wisesight-160 and wisesight-1000, which are samples from this corpus in a tokenized form, see https://github.com/PyThaiNLP/wisesight-sentiment/tree/master/word-tokenization For data exploration and classification examples, see Thai Text Classification Benchmarks. https://github.com/PyThaiNLP/classification-benchmarks Personal data - We trying to exclude any known personally identifiable information from this data set. - Usernames and non-public figure names are removed - Phone numbers are masked (e.g. 088-888-8888, 09-9999-9999, 0-2222-2222) - If you see any personal data still remain in the set, please tell us - so we can remove them. Sentiment value annotation methodology - Sentiment values are assigned by human annotators. - A human annotator put his/her best effort to assign just one label, out of three, to a message. - A message can be ambiguous. When possible, the judgement will be based solely on the text itself. - In some situation, like when the context is missing, the annotator may have to rely on his/her own world knowledge and just guess. - In some cases, the human annotator may have an access to the message's context, like an image. These additional information are not included as part of this corpus. - Agreement, enjoyment, and satisfaction are positive. Disagreement, sadness, and disappointment are negative. - Showing interest in a topic or in a product is counted as positive. - In this sense, a question about a particular product could has a positive sentiment value, if it shows the interest in the product. - Saying that other product or service is better is counted as negative. - General information or news title tend to be counted as neutral.
Has Missing Values?
No
Variable Information
A message can has only one label: Question, Negative, Neutral, Positive All messages are kept in plaintext files. All files are UTF-8 encoded. One message per line. A newline character in the original message will be replaced with a space. One label per file. q.txt Questions (575 messages) neg.txt Message with negative sentiment (6,823) neu.txt Message with neutral sentiment (14,561) pos.txt Message with positive sentiment (4,778)
Dataset Files
File | Size |
---|---|
wisesight-sentiment-master/kaggle-competition/train.txt | 5.4 MB |
wisesight-sentiment-master/neu.txt | 3.6 MB |
wisesight-sentiment-master/exploration.ipynb | 2.1 MB |
wisesight-sentiment-master/neg.txt | 1.6 MB |
wisesight-sentiment-master/pos.txt | 743.4 KB |
0 to 5 of 24
Reviews
There are no reviews for this dataset yet.
pip install ucimlrepo
from ucimlrepo import fetch_ucirepo # fetch dataset wisesight_sentiment_corpus = fetch_ucirepo(id=600) # data (as pandas dataframes) X = wisesight_sentiment_corpus.data.features y = wisesight_sentiment_corpus.data.targets # metadata print(wisesight_sentiment_corpus.metadata) # variable information print(wisesight_sentiment_corpus.variables)
Wisesight Sentiment Corpus [Dataset]. (2020). UCI Machine Learning Repository. https://doi.org/10.24432/C5PG7K.
DOI
License
This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.