Browse Datasets

Predict Students' Dropout and Academic Success

A dataset created from a higher education institution (acquired from several disjoint databases) related to students enrolled in different undergraduate degrees, such as agronomy, design, education, nursing, journalism, management, social service, and technologies. The dataset includes information known at the time of student enrollment (academic path, demographics, and social-economic factors) and the students' academic performance at the end of the first and second semesters. The data is used to build classification models to predict students' dropout and academic sucess. The problem is formulated as a three category classification task, in which there is a strong imbalance towards one of the classes.

Multivariate Gait Data

Bilateral (left, right) joint angle (ankle, knee, hip) times series data collected from 10 healthy subjects under 3 walking conditions (unbraced, knee braced, ankle braced). For each condition, each subject’s data consists of 10 consecutive gait cycles.

Product Classification and Clustering

This dataset was collected from PriceRunner, a popular product comparison platform. It includes 35311 product offers from 10 categories, provided by 306 different merchants. This dataset offers an ideal ground for evaluating classification, clustering, and entity matching algorithms. Although it contains product-related data, it can still be applied to any problem involving text/short-text mining.

Recipe Reviews and User Feedback

The "Recipe Reviews and User Feedback Dataset" is a comprehensive repository of data encompassing various aspects of recipe reviews and user interactions. It includes essential information such as the recipe name, its ranking on the top 100 recipes list, a unique recipe code, and user details like user ID, user name, and an internal user reputation score. Each review comment is uniquely identified with a comment ID and comes with additional attributes, including the creation timestamp, reply count, and the number of up-votes and down-votes received. Users' sentiment towards recipes is quantified on a 1 to 5 star rating scale, with a score of 0 denoting an absence of rating. This dataset is a valuable resource for researchers and data scientists, facilitating endeavors in sentiment analysis, user behavior analysis, recipe recommendation systems, and more. It offers a window into the dynamics of recipe reviews and user feedback within the culinary website domain.

RealWaste

An image classification dataset of waste items across 9 major material types, collected within an authentic landfill environment.

Bengali Hate Speech Detection Dataset

The dataset can be used for hate speech detection in Bengali social media texts. The dataset is categorized into political, personal, geopolitical, religious, and gender abusive hates that are either directed or generalized towards a specific person, entity, or group. The data and lexicons contain content that is racist, sexist, homophobic, and offensive in many different ways. The dataset is collected and subsequently annotated only for research-related purposes. Besides, authors don't take any liability if some statements contain very offensive and hateful statements that are either directed towards a specific person or entity or generalized towards a group. Therefore, please use it at your risk.

Sundanese Twitter Dataset

This dataset contains tweet of the second-largest local language in Indonesia and is used for emotion classification.

Printed Circuit Board Processed Image

This CSV dataset, originally used for test-pad coordinate retrieval from PCB images, presents potential applications like classification (e.g., Grey test pad detection), anomaly detection (e.g., fake test pads), or clustering for grey test pads discovery. The dataset includes X and Y representing pixel positions, and R, G, B values determining pixel color (minmax normalized from 0-255). A 'Grey' field indicates approximate grey pixels. This dataset was originally used for a 2-stage discovery of high number of test pad clusters (>100) in a dataset presented in: @article{Tan2016FastRO, title={Fast retrievals of test-pad coordinates from photo images of printed circuit boards}, author={Swee Chuan Tan and Schumann Tong Wei Kit}, journal={2016 International Conference on Advanced Mechatronic Systems (ICAMechS)}, year={2016}, pages={464-467}, url={https://api.semanticscholar.org/CorpusID:38544897} } More pixels here than that in the paper due to different extraction method.

0 to 8 of 8

By using the UCI Machine Learning Repository, you acknowledge and accept the cookies and privacy practices used by the UCI Machine Learning Repository.

Read Policy