Bengali Hate Speech Detection Dataset

Donated on 4/23/2022

The dataset can be used for hate speech detection in Bengali social media texts. The dataset is categorized into political, personal, geopolitical, religious, and gender abusive hates that are either directed or generalized towards a specific person, entity, or group. The data and lexicons contain content that is racist, sexist, homophobic, and offensive in many different ways. The dataset is collected and subsequently annotated only for research-related purposes. Besides, authors don't take any liability if some statements contain very offensive and hateful statements that are either directed towards a specific person or entity or generalized towards a group. Therefore, please use it at your risk.

Dataset Characteristics

Text

Subject Area

Computer Science

Associated Tasks

Classification

Feature Type

-

# Instances

4500

# Features

-

Dataset Information

For what purpose was the dataset created?

Bengali is spoken by 230 million people in Bangladesh and India, making it one of the major languages in the world. Like other major languages such as English, the use of anti-social behavior in Bengali – hate speech, in particular, is getting more pervasive. Bengali hateful statements could not only be very severe, but also show that hate speech is being contextualized from the personal level to religious, political, and geopolitical levels. Since such hate speech is getting more pervasive, there is a potential chance that these could lead to serious consequences such as hate crimes, regardless of language, geographic location, or ethnicity. Therefore, automatic identification of hate speech in social media and raising public awareness is challenging and a nontrivial task for Bengali too. We prepared the first hate speech detection dataset for a kind of a problem for the under-resourced Bengali language. The main objective was to promote NLP research for Bengali and foster reproducible research via making available the datasets, source codes, models, and notebooks.

What do the instances in this dataset represent?

Each instance represents Bengali social media texts and its associated label in any of political, personal, geopolitical, religious, or gender abusive hates.

Are there recommended data splits?

Training, validation, test

Does the dataset contain data that might be considered sensitive in any way?

The data and lexicons contain contenst that are racist, sexist, homophobic, and offensive in many different ways. The dataset is collected and subsequently annotated only for research-related purposes. Besides, authors don't take any liability if some statements contain very offensive and hateful statements that are either directed towards a specific person or entity or generalized towards a group. Therefore, please use it at your risk.

Was there any data preprocessing performed?

PoS tagging, removal of proper nouns, hashtags normalization, stemming, Emojis and duplicates, removal of infrequent words

Additional Information

Please check https://github.com/rezacsedu/Bengali-Hate-Speech-Dataset for more details about this dataset.

Has Missing Values?

No

Introductory Paper

Classification Benchmarks for Under-resourced Bengali Language based on Multichannel Convolutional-LSTM Network

By Md. Rezaul Karim, Bharathi Raja Chakravarthi, Mihael Arcan, John P. McCrae, Michael Cochez. 2020

Published in 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA)

Reviews

There are no reviews for this dataset yet.

Login to Write a Review
Download
1 citations
4037 views

Keywords

NLPBengaliText classificationfairness

Creators

Sumon Kanti Dey

sumonkantidey23@gmail.com

Noakhali Science and Technology University, Bangladesh

Michael Cochez

michaelcochez@gmail.com

Vrije Universiteit Amsterdam

Md. Rezaul Karim

rezaul.karim.fit@gmail.com

License

By using the UCI Machine Learning Repository, you acknowledge and accept the cookies and privacy practices used by the UCI Machine Learning Repository.

Read Policy