Bengali Hate Speech Detection Dataset
Donated on 4/23/2022
The dataset can be used for hate speech detection in Bengali social media texts. The dataset is categorized into political, personal, geopolitical, religious, and gender abusive hates that are either directed or generalized towards a specific person, entity, or group. The data and lexicons contain content that is racist, sexist, homophobic, and offensive in many different ways. The dataset is collected and subsequently annotated only for research-related purposes. Besides, authors don't take any liability if some statements contain very offensive and hateful statements that are either directed towards a specific person or entity or generalized towards a group. Therefore, please use it at your risk.
Dataset Characteristics
Text
Subject Area
Computer Science
Associated Tasks
Classification
Feature Type
-
# Instances
4500
# Features
-
Dataset Information
For what purpose was the dataset created?
Bengali is spoken by 230 million people in Bangladesh and India, making it one of the major languages in the world. Like other major languages such as English, the use of anti-social behavior in Bengali – hate speech, in particular, is getting more pervasive. Bengali hateful statements could not only be very severe, but also show that hate speech is being contextualized from the personal level to religious, political, and geopolitical levels. Since such hate speech is getting more pervasive, there is a potential chance that these could lead to serious consequences such as hate crimes, regardless of language, geographic location, or ethnicity. Therefore, automatic identification of hate speech in social media and raising public awareness is challenging and a nontrivial task for Bengali too. We prepared the first hate speech detection dataset for a kind of a problem for the under-resourced Bengali language. The main objective was to promote NLP research for Bengali and foster reproducible research via making available the datasets, source codes, models, and notebooks.
What do the instances in this dataset represent?
Each instance represents Bengali social media texts and its associated label in any of political, personal, geopolitical, religious, or gender abusive hates.
Are there recommended data splits?
Training, validation, test
Does the dataset contain data that might be considered sensitive in any way?
The data and lexicons contain contenst that are racist, sexist, homophobic, and offensive in many different ways. The dataset is collected and subsequently annotated only for research-related purposes. Besides, authors don't take any liability if some statements contain very offensive and hateful statements that are either directed towards a specific person or entity or generalized towards a group. Therefore, please use it at your risk.
Was there any data preprocessing performed?
PoS tagging, removal of proper nouns, hashtags normalization, stemming, Emojis and duplicates, removal of infrequent words
Additional Information
Please check https://github.com/rezacsedu/Bengali-Hate-Speech-Dataset for more details about this dataset.
Has Missing Values?
No
Introductory Paper
By Md. Rezaul Karim, Bharathi Raja Chakravarthi, Mihael Arcan, John P. McCrae, Michael Cochez. 2020
Published in 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA)
Dataset Files
File | Size |
---|---|
Capture.PNG | 482.8 KB |
Bengali_hate_speech_dataset.zip | 217.2 KB |
Reviews
There are no reviews for this dataset yet.
pip install ucimlrepo
from ucimlrepo import fetch_ucirepo # fetch dataset bengali_hate_speech_detection_dataset = fetch_ucirepo(id=719) # data (as pandas dataframes) X = bengali_hate_speech_detection_dataset.data.features y = bengali_hate_speech_detection_dataset.data.targets # metadata print(bengali_hate_speech_detection_dataset.metadata) # variable information print(bengali_hate_speech_detection_dataset.variables)
Kanti Dey, S., Cochez , M., & Karim, M. (2020). Bengali Hate Speech Detection Dataset [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5PD07.
Keywords
Creators
Sumon Kanti Dey
sumonkantidey23@gmail.com
Noakhali Science and Technology University, Bangladesh
Michael Cochez
michaelcochez@gmail.com
Vrije Universiteit Amsterdam
Md. Rezaul Karim
rezaul.karim.fit@gmail.com
DOI
License
This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.