Open Web Text Corpus
Linked on 12/2/2021
We started by extracting all Reddit post urls from the Reddit submissions dataset. These links were deduplicated, filtered to exclude non-html content, and then shuffled randomly. The links were then distributed to several machines in parallel for download, and all web pages were extracted using the newspaper python package. Using Facebook FastText, non-English web pages were filtered out. Subsequently, near-duplicate documents were identified using local-sensitivity hashing (LSH). Documents were hashed into sets of 5-grams and all documents that had a similarity threshold of greater than 0.5 were removed. The remaining documents were tokenized, and documents with fewer than 128 tokens were removed. This left 38GB of text data (40GB using SI units) from 8,013,769 documents.
Dataset Characteristics
Text
Subject Area
Other
Associated Tasks
Classification, Regression, Clustering
Feature Type
-
# Instances
8013769
# Features
-
Dataset Information
Has Missing Values?
No
Introductory Paper
By Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever. 2019
Published in Conference
Reviews
There are no reviews for this dataset yet.
pip install ucimlrepo
from ucimlrepo import fetch_ucirepo # fetch dataset open_web_text_corpus = fetch_ucirepo(id=696) # data (as pandas dataframes) X = open_web_text_corpus.data.features y = open_web_text_corpus.data.targets # metadata print(open_web_text_corpus.metadata) # variable information print(open_web_text_corpus.variables)
Open Web Text Corpus [Dataset]. (2019). UCI Machine Learning Repository. https://doi.org/10.24432/C5KK7P.
Citations/Acknowledgements
If you use this dataset, please follow the acknowledgment policy on the original dataset website.