TamilSentiMix
Donated on 8/13/2023
We created a gold standard Tamil-English code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube.
Dataset Characteristics
Multivariate, Text
Subject Area
Computer Science
Associated Tasks
Classification
Feature Type
Categorical
# Instances
15744
# Features
-
Dataset Information
Additional Information
Understanding the sentiment of a comment from a video or an image is an essential task in many applications. Sentiment analysis of a text can be useful for various decision-making processes. One such application is to analyse the popular sentiments of videos on social media based on viewer comments. However, comments from social media do not follow strict rules of grammar, and they contain mixing of more than one language, often written in non-native scripts. Non-availability of annotated code-mixed data for a low-resourced language like Tamil also adds difficulty to this problem. To overcome this, we created a gold standard Tamil-English code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube. In this paper, we describe the process of creating the corpus and assigning polarities. We present inter-annotator agreement and show the results of sentiment analysis trained on this corpus as a benchmark.
Has Missing Values?
No
Introductory Paper
By Bharathi Raja Chakravarthi, V. Muralidaran, R. Priyadharshini, John P. McCrae. 2020
Published in Workshop on Spoken Language Technologies for Under-resourced Languages
Dataset Files
File | Size |
---|---|
mcs_ds_edited_iter_shuffled.csv | 4 KB |
Reviews
There are no reviews for this dataset yet.
pip install ucimlrepo
from ucimlrepo import fetch_ucirepo # fetch dataset tamilsentimix = fetch_ucirepo(id=845) # data (as pandas dataframes) X = tamilsentimix.data.features y = tamilsentimix.data.targets # metadata print(tamilsentimix.metadata) # variable information print(tamilsentimix.variables)
Raja, B., Muralidaran, V., Priyadharshini, R., & P., J. (2020). TamilSentiMix [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5TW4T.
Keywords
Creators
Bharathi Raja
bharathiraja.akr@gmail.com
National University of Ireland Galway
V. Muralidaran
R. Priyadharshini
John P.
DOI
License
This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.