DeliciousMIL: A Data Set for Multi-Label Multi-Instance Learning with Instance Labels

Donated on 10/26/2016

This dataset includes 1) 12234 documents (8251 training, 3983 test) extracted from DeliciousT140 dataset, 2) class labels for all documents, 3) labels for a subset of sentences of the test documents.

Dataset Characteristics

Text

Subject Area

Computer Science

Associated Tasks

Classification

Feature Type

Integer

# Instances

12234

# Features

8519

Dataset Information

Additional Information

This dataset provides ground-truth class labels to evaluate performance of multi-instance learning models on both instance-level and bag-level label predictions. DeliciousMIL was first used in [1] to evaluate performance of MLTM, a multi-label multi-instance learning method, for document classification and sentence labeling. Multi-instance learning is a special class of weakly supervised machine learning methods where the learner receives a collection of labeled bags each containing multiple instances. A bag is set to have a particular class label if and only if at least one of its instances has that class label. DeliciousMIL consists of a subset of tagged web pages from the social bookmarking site delicious.com. The original web pages were obtained from DeliciousT140 dataset, which was collected by [2] from the delicious.com in June 2008. Users of the website delicious.com bookmarked each page with word tags. From this dataset, we extracted text parts of each web page and chose 20 common tags as class labels. These class labels are: reference, design, programming, internet, computer, web, java, writing, English, grammar, style, language, books, education, philosophy, politics, religion, science, history, and culture. We randomly selected 12234 pages and randomly divided them into 8251 training and 3983 test documents. We also applied Porter stemming and standard stopword removal. Each text document is a bag within a multi-instance learning framework consisting of multiple sentences (instances). The goal is to predict document-level and sentence-level class labels on the test set using a model which is trained given only the document-level class labels in the training set. To evaluate performance of such a model, we have manually labeled 1468 randomly selected sentences from the test documents. Please see [1] for more information.

Has Missing Values?

Variable Information

1) train-data.dat and test-data.dat: These files contain the bag-of-word representation of the training and test documents. Each line is of the form: <S_d> sentence_1 sentence_2 â€¦ sentence_{Sd} where Sd is the number of sentences in document d. Each sentence s is in the following format: <L_s> w_{1s} w_{2s} â€¦ w_{L_s s} where L_s is the number of words in sentence s, and w_{is} is an integer which indexes the i-th term in sentence s. 2) vocabs.txt: This file contains the list of words used for indexing the document representations in data files. Each line contains: word, index. 3) train-label.dat and test-label.dat: Each file contains a D by C binary matrix where D is the number of documents in every file and C=20 is the number of classes. The element b_{dc} is 1 if class c is present in document d and zero otherwise. 4) test-sentlabel.dat, labeled_test_sentences.dat: test-sentlabel.dat: This file contains class labels for sentences of the test documents. Each line d is of the form: <y_{11d} y_{12d} â€¦ y_{1Cd}><y_{21d} y_{22d} â€¦ y_{2Cd}>...<y_{S_d1d} y_{S_d2d} â€¦ y_{S_dCd}> where y_{scd} is the binary indicator of class c for sentence s of document d. y_{scd} is 1 if class c present in sentence s and zero otherwise. Note that only 1468 sentences are randomly selected and manually labeled. For the rest of the sentences that are unlabeled, we set y_{scd}=-1. labeled_test_sentences.dat: This file only contains the class labels for the 1468 sentences which are manually labeled. Each line of this file is of the form: d s y_{s1d} y_{s2d} â€¦ y_{sCd} where d and s are respectively document and sentence indices. 4) labels.txt: This contains the list of all class labels in this dataset. Each line is of the form: label, index. Please see https://github.com/hsoleimani/MLTM for example python code for reading these files.

Dataset Files

File	Size
Data/train-data.dat	5.2 MB
Data/test-sentlabel.dat	4.2 MB
Data/test-data.dat	2.5 MB
Data/train-label.dat	322.3 KB
Data/test-label.dat	155.6 KB

Rows per page

0 to 5 of 8

Reviews

There are no reviews for this dataset yet.

Download (3.1 MB)

0 citations

1858 views

Creators

David Miller

Hossein Soleimani

Arkaitz Zubiaga

Vctor Fresno

DOI

10.24432/C5DK74

License

This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.