Bag of Words
Donated on 3/11/2008
This data set contains five text collections in the form of bags-of-words.
Dataset Characteristics
Text
Subject Area
Other
Associated Tasks
Clustering
Feature Type
Integer
# Instances
8000000
# Features
100000
Dataset Information
Additional Information
For each text collection, D is the number of documents, W is the number of words in the vocabulary, and N is the total number of words in the collection (below, NNZ is the number of nonzero counts in the bag-of-words). After tokenization and removal of stopwords, the vocabulary of unique words was truncated by only keeping words that occurred more than ten times. Individual document names (i.e. a identifier for each docID) are not provided for copyright reasons. These data sets have no class labels, and for copyright reasons no filenames or other document-level metadata. These data sets are ideal for clustering and topic modeling experiments. For each text collection we provide docword.*.txt (the bag of words file in sparse format) and vocab.*.txt (the vocab file). Enron Emails: orig source: www.cs.cmu.edu/~enron D=39861 W=28102 N=6,400,000 (approx) NIPS full papers: orig source: books.nips.cc D=1500 W=12419 N=1,900,000 (approx) KOS blog entries: orig source: dailykos.com D=3430 W=6906 N=467714 NYTimes news articles: orig source: ldc.upenn.edu D=300000 W=102660 N=100,000,000 (approx) PubMed abstracts: orig source: www.pubmed.gov D=8200000 W=141043 N=730,000,000 (approx)
Has Missing Values?
No
Variable Information
The format of the docword.*.txt file is 3 header lines, followed by NNZ triples: --- D W NNZ docID wordID count docID wordID count docID wordID count docID wordID count ... docID wordID count docID wordID count docID wordID count --- The format of the vocab.*.txt file is line <n> contains wordID=n.
Dataset Files
File | Size |
---|---|
docword.pubmed.txt.gz | 1.7 GB |
docword.nytimes.txt.gz | 223.4 MB |
docword.enron.txt.gz | 11.7 MB |
docword.nips.txt.gz | 2.2 MB |
vocab.pubmed.txt | 1.4 MB |
0 to 5 of 11
Papers Citing this Dataset
Sort by Year, desc
By Pareesa Golnari, Sharad Malik. 2019
Published in ArXiv.
By Masafumi Oyamada, Jianquan Liu, Shinji Ito, Kazuyo Narita, Takuya Araki, Hiroyuki Kitagawa. 2018
Published in JIP.
By Mohamed-Rafik Bouguelia, Slawomir Nowaczyk, K. Santosh, Antanas Verikas. 2018
Published in Int. J. Machine Learning & Cybernetics.
By Abdelrahman Elsayed, Hoda Mokhtar, Osama Ismail. 2015
Published in The International Journal of Database Management Systems (IJDMS), April 2015, Volume 7, Number 2.
By Raghvendra Mall, Rocco Langone, Johan Suykens, Renaud Lambiotte. 2015
Published in PloS one.
0 to 5 of 10
Reviews
There are no reviews for this dataset yet.
pip install ucimlrepo
from ucimlrepo import fetch_ucirepo # fetch dataset bag_of_words = fetch_ucirepo(id=164) # data (as pandas dataframes) X = bag_of_words.data.features y = bag_of_words.data.targets # metadata print(bag_of_words.metadata) # variable information print(bag_of_words.variables)
Newman, D. (2008). Bag of Words [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5ZG6P.
Creators
David Newman
DOI
License
This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.