Bag of Words

Donated on 3/11/2008

This data set contains five text collections in the form of bags-of-words.

Dataset Characteristics

Text

Subject Area

Other

Associated Tasks

Clustering

Feature Type

Integer

# Instances

8000000

# Features

100000

Dataset Information

Additional Information

For each text collection, D is the number of documents, W is the number of words in the vocabulary, and N is the total number of words in the collection (below, NNZ is the number of nonzero counts in the bag-of-words). After tokenization and removal of stopwords, the vocabulary of unique words was truncated by only keeping words that occurred more than ten times. Individual document names (i.e. a identifier for each docID) are not provided for copyright reasons. These data sets have no class labels, and for copyright reasons no filenames or other document-level metadata. These data sets are ideal for clustering and topic modeling experiments. For each text collection we provide docword.*.txt (the bag of words file in sparse format) and vocab.*.txt (the vocab file). Enron Emails: orig source: www.cs.cmu.edu/~enron D=39861 W=28102 N=6,400,000 (approx) NIPS full papers: orig source: books.nips.cc D=1500 W=12419 N=1,900,000 (approx) KOS blog entries: orig source: dailykos.com D=3430 W=6906 N=467714 NYTimes news articles: orig source: ldc.upenn.edu D=300000 W=102660 N=100,000,000 (approx) PubMed abstracts: orig source: www.pubmed.gov D=8200000 W=141043 N=730,000,000 (approx)

Has Missing Values?

No

Variable Information

The format of the docword.*.txt file is 3 header lines, followed by NNZ triples: --- D W NNZ docID wordID count docID wordID count docID wordID count docID wordID count ... docID wordID count docID wordID count docID wordID count --- The format of the vocab.*.txt file is line <n> contains wordID=n.

Dataset Files

FileSize
docword.pubmed.txt.gz1.7 GB
docword.nytimes.txt.gz223.4 MB
docword.enron.txt.gz11.7 MB
docword.nips.txt.gz2.2 MB
vocab.pubmed.txt1.4 MB

0 to 5 of 11

Papers Citing this Dataset

Compressed Vector Set: A Fast and Space-Efficient Data Mining Framework

By Masafumi Oyamada, Jianquan Liu, Shinji Ito, Kazuyo Narita, Takuya Araki, Hiroyuki Kitagawa. 2018

Published in JIP.

Agreeing to disagree: active learning with noisy labels without crowdsourcing

By Mohamed-Rafik Bouguelia, Slawomir Nowaczyk, K. Santosh, Antanas Verikas. 2018

Published in Int. J. Machine Learning & Cybernetics.

Ontology Based Document Clustering Using MapReduce

By Abdelrahman Elsayed, Hoda Mokhtar, Osama Ismail. 2015

Published in The International Journal of Database Management Systems (IJDMS), April 2015, Volume 7, Number 2.

Netgram: Visualizing Communities in Evolving Networks

By Raghvendra Mall, Rocco Langone, Johan Suykens, Renaud Lambiotte. 2015

Published in PloS one.

0 to 5 of 10

Reviews

There are no reviews for this dataset yet.

Login to Write a Review
Download (2 GB)
10 citations
9541 views

Creators

David Newman

License

By using the UCI Machine Learning Repository, you acknowledge and accept the cookies and privacy practices used by the UCI Machine Learning Repository.

Read Policy