
DeliciousMIL: A Data Set for Multi-Label Multi-Instance Learning with Instance Labels
Donated on 10/26/2016
This dataset includes 1) 12234 documents (8251 training, 3983 test) extracted from DeliciousT140 dataset, 2) class labels for all documents, 3) labels for a subset of sentences of the test documents.
Dataset Characteristics
Text
Subject Area
Computer Science
Associated Tasks
Classification
Feature Type
Integer
# Instances
12234
# Features
8519
Dataset Information
Additional Information
This dataset provides ground-truth class labels to evaluate performance of multi-instance learning models on both instance-level and bag-level label predictions. DeliciousMIL was first used in [1] to evaluate performance of MLTM, a multi-label multi-instance learning method, for document classification and sentence labeling. Multi-instance learning is a special class of weakly supervised machine learning methods where the learner receives a collection of labeled bags each containing multiple instances. A bag is set to have a particular class label if and only if at least one of its instances has that class label. DeliciousMIL consists of a subset of tagged web pages from the social bookmarking site delicious.com. The original web pages were obtained from DeliciousT140 dataset, which was collected by [2] from the delicious.com in June 2008. Users of the website delicious.com bookmarked each page with word tags. From this dataset, we extracted text parts of each web page and chose 20 common tags as class labels. These class labels are: reference, design, programming, internet, computer, web, java, writing, English, grammar, style, language, books, education, philosophy, politics, religion, science, history, and culture. We randomly selected 12234 pages and randomly divided them into 8251 training and 3983 test documents. We also applied Porter stemming and standard stopword removal. Each text document is a bag within a multi-instance learning framework consisting of multiple sentences (instances). The goal is to predict document-level and sentence-level class labels on the test set using a model which is trained given only the document-level class labels in the training set. To evaluate performance of such a model, we have manually labeled 1468 randomly selected sentences from the test documents. Please see [1] for more information.
Has Missing Values?
No
Variable Information
1) train-data.dat and test-data.dat: These files contain the bag-of-word representation of the training and test documents. Each line is of the form: <S_d> sentence_1 sentence_2 … sentence_{Sd} where Sd is the number of sentences in document d. Each sentence s is in the following format: <L_s> w_{1s} w_{2s} … w_{L_s s} where L_s is the number of words in sentence s, and w_{is} is an integer which indexes the i-th term in sentence s. 2) vocabs.txt: This file contains the list of words used for indexing the document representations in data files. Each line contains: word, index. 3) train-label.dat and test-label.dat: Each file contains a D by C binary matrix where D is the number of documents in every file and C=20 is the number of classes. The element b_{dc} is 1 if class c is present in document d and zero otherwise. 4) test-sentlabel.dat, labeled_test_sentences.dat: test-sentlabel.dat: This file contains class labels for sentences of the test documents. Each line d is of the form: <y_{11d} y_{12d} … y_{1Cd}><y_{21d} y_{22d} … y_{2Cd}>...<y_{S_d1d} y_{S_d2d} … y_{S_dCd}> where y_{scd} is the binary indicator of class c for sentence s of document d. y_{scd} is 1 if class c present in sentence s and zero otherwise. Note that only 1468 sentences are randomly selected and manually labeled. For the rest of the sentences that are unlabeled, we set y_{scd}=-1. labeled_test_sentences.dat: This file only contains the class labels for the 1468 sentences which are manually labeled. Each line of this file is of the form: d s y_{s1d} y_{s2d} … y_{sCd} where d and s are respectively document and sentence indices. 4) labels.txt: This contains the list of all class labels in this dataset. Each line is of the form: label, index. Please see https://github.com/hsoleimani/MLTM for example python code for reading these files.
Dataset Files
| File | Size | 
|---|---|
| Data/train-data.dat | 5.2 MB | 
| Data/test-sentlabel.dat | 4.2 MB | 
| Data/test-data.dat | 2.5 MB | 
| Data/train-label.dat | 322.3 KB | 
| Data/test-label.dat | 155.6 KB | 
0 to 5 of 8
Reviews
There are no reviews for this dataset yet.
pip install ucimlrepo
from ucimlrepo import fetch_ucirepo # fetch dataset deliciousmil_a_data_set_for_multi_label_multi_instance_learning_with_instance_labels = fetch_ucirepo(id=418) # data (as pandas dataframes) X = deliciousmil_a_data_set_for_multi_label_multi_instance_learning_with_instance_labels.data.features y = deliciousmil_a_data_set_for_multi_label_multi_instance_learning_with_instance_labels.data.targets # metadata print(deliciousmil_a_data_set_for_multi_label_multi_instance_learning_with_instance_labels.metadata) # variable information print(deliciousmil_a_data_set_for_multi_label_multi_instance_learning_with_instance_labels.variables)
Miller, D., Soleimani, H., Zubiaga, A., & Fresno, V. (2016). DeliciousMIL: A Data Set for Multi-Label Multi-Instance Learning with Instance Labels [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5DK74.
Creators
David Miller
Hossein Soleimani
Arkaitz Zubiaga
Vctor Fresno
DOI
License
This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.