Bag of Words

Donated on 3/11/2008

This data set contains five text collections in the form of bags-of-words.

Dataset Characteristics

Text

Subject Area

Other

Associated Tasks

Clustering

Feature Type

Integer

# Instances

8000000

# Features

100000

Dataset Information

Additional Information

For each text collection, D is the number of documents, W is the number of words in the vocabulary, and N is the total number of words in the collection (below, NNZ is the number of nonzero counts in the bag-of-words). After tokenization and removal of stopwords, the vocabulary of unique words was truncated by only keeping words that occurred more than ten times. Individual document names (i.e. a identifier for each docID) are not provided for copyright reasons. These data sets have no class labels, and for copyright reasons no filenames or other document-level metadata. These data sets are ideal for clustering and topic modeling experiments. For each text collection we provide docword.*.txt (the bag of words file in sparse format) and vocab.*.txt (the vocab file). Enron Emails: orig source: www.cs.cmu.edu/~enron D=39861 W=28102 N=6,400,000 (approx) NIPS full papers: orig source: books.nips.cc D=1500 W=12419 N=1,900,000 (approx) KOS blog entries: orig source: dailykos.com D=3430 W=6906 N=467714 NYTimes news articles: orig source: ldc.upenn.edu D=300000 W=102660 N=100,000,000 (approx) PubMed abstracts: orig source: www.pubmed.gov D=8200000 W=141043 N=730,000,000 (approx)

Has Missing Values?

Variable Information

The format of the docword.*.txt file is 3 header lines, followed by NNZ triples: --- D W NNZ docID wordID count docID wordID count docID wordID count docID wordID count ... docID wordID count docID wordID count docID wordID count --- The format of the vocab.*.txt file is line <n> contains wordID=n.

Dataset Files

File	Size
docword.pubmed.txt.gz	1.7 GB
docword.nytimes.txt.gz	223.4 MB
docword.enron.txt.gz	11.7 MB
docword.nips.txt.gz	2.2 MB
vocab.pubmed.txt	1.4 MB

Rows per page

0 to 5 of 11

Papers Citing this Dataset

Sparse Matrix to Matrix Multiplication: A Representation and Architecture for Acceleration (long version)

By Pareesa Golnari, Sharad Malik. 2019

Published in ArXiv.

Compressed Vector Set: A Fast and Space-Efficient Data Mining Framework

By Masafumi Oyamada, Jianquan Liu, Shinji Ito, Kazuyo Narita, Takuya Araki, Hiroyuki Kitagawa. 2018

Published in JIP.

Agreeing to disagree: active learning with noisy labels without crowdsourcing

By Mohamed-Rafik Bouguelia, Slawomir Nowaczyk, K. Santosh, Antanas Verikas. 2018

Published in Int. J. Machine Learning & Cybernetics.

Ontology Based Document Clustering Using MapReduce

By Abdelrahman Elsayed, Hoda Mokhtar, Osama Ismail. 2015

Published in The International Journal of Database Management Systems (IJDMS), April 2015, Volume 7, Number 2.

Netgram: Visualizing Communities in Evolving Networks

By Raghvendra Mall, Rocco Langone, Johan Suykens, Renaud Lambiotte. 2015

Published in PloS one.

Rows per page

0 to 5 of 10

Download (2 GB)

10 citations

15096 views

Creators

David Newman

DOI

10.24432/C5ZG6P

License

This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.