Open Web Text Corpus


Linked on 12/2/2021

We started by extracting all Reddit post urls from the Reddit submissions dataset. These links were deduplicated, filtered to exclude non-html content, and then shuffled randomly. The links were then distributed to several machines in parallel for download, and all web pages were extracted using the newspaper python package. Using Facebook FastText, non-English web pages were filtered out. Subsequently, near-duplicate documents were identified using local-sensitivity hashing (LSH). Documents were hashed into sets of 5-grams and all documents that had a similarity threshold of greater than 0.5 were removed. The remaining documents were tokenized, and documents with fewer than 128 tokens were removed. This left 38GB of text data (40GB using SI units) from 8,013,769 documents.

Dataset Characteristics


Subject Area


Associated Tasks

Classification, Regression, Clustering

Feature Type


# Instances


# Features


Dataset Information

Has Missing Values?


Introductory Paper

Language Models are Unsupervised Multitask Learners

By Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever. 2019

Published in Conference


There are no reviews for this dataset yet.

Login to Write a Review
Dataset Home Page
1 citations


If you use this dataset, please follow the acknowledgment policy on the original dataset website.


By using the UCI Machine Learning Repository, you acknowledge and accept the cookies and privacy practices used by the UCI Machine Learning Repository.

Read Policy