Open Web Text Corpus

External

Linked on 12/2/2021

We started by extracting all Reddit post urls from the Reddit submissions dataset. These links were deduplicated, filtered to exclude non-html content, and then shuffled randomly. The links were then distributed to several machines in parallel for download, and all web pages were extracted using the newspaper python package. Using Facebook FastText, non-English web pages were filtered out. Subsequently, near-duplicate documents were identified using local-sensitivity hashing (LSH). Documents were hashed into sets of 5-grams and all documents that had a similarity threshold of greater than 0.5 were removed. The remaining documents were tokenized, and documents with fewer than 128 tokens were removed. This left 38GB of text data (40GB using SI units) from 8,013,769 documents.

Dataset Characteristics

Text

Subject Area

Other

Associated Tasks

Classification, Regression, Clustering

Feature Type

-

# Instances

8013769

# Features

-

Dataset Information

Has Missing Values?

No

Introductory Paper

Language Models are Unsupervised Multitask Learners

By Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever. 2019

Published in Conference

Reviews

There are no reviews for this dataset yet.

Login to Write a Review
Dataset Home Page
1 citations
5961 views

Citations/Acknowledgements

If you use this dataset, please follow the acknowledgment policy on the original dataset website.

License

By using the UCI Machine Learning Repository, you acknowledge and accept the cookies and privacy practices used by the UCI Machine Learning Repository.

Read Policy