Victorian Era Authorship Attribution
Donated on 5/30/2018
To create the largest authorship attribution dataset, we extracted works of 50 well-known authors. To have a non-exhaustive learning, in training there are 45 authors whereas, in the testing, it's 50
Dataset Characteristics
Text
Subject Area
Computer Science
Associated Tasks
Classification
Feature Type
-
# Instances
93600
# Features
1000
Dataset Information
Additional Information
To decrease the bias and create a reliable authorship attribution dataset the following criteria have been chosen to filter out authors in Gdelt database: English language writing authors, authors that have enough books available (at least 5), 19th century authors. With these criteria 50 authors have been selected and their books were queried through Big Query Gdelt database. The next task has been cleaning the dataset due to OCR reading problems in the original raw form. To achieve that, firstly all books have been scanned through to get the overall number of unique words and each words frequencies. While scanning the texts, the first 500 words and the last 500 words have been removed to take out specific features such as the name of the author, the name of the book and other word specific features that could make the classification task easier. After this step, we have chosen top 10,000 words that occurred in the whole 50 authors text data corpus. The words that are not in top 10,000 words were removed while keeping the rest of the sentence structure intact. The entire book is split into text fragments with 1000 words each. We separately maintained author and book identification number for each one of them in different arrays. Text segments with less than 1000 words were filled with zeros to keep them in the dataset as well. 1000 words make approximately 2 pages of writing, which is long enough to extract a variety of features from the document. Each instance in the training set consists of a text piece of 1000 words and an author id attached. In the testing set, there is only the text piece of 1000 words to do authorship attribution. Training data consists of 45 authors and testing data has 50 information. %34 of testing data is the percentile of unknown authors in the testing set.
Has Missing Values?
No
Variable Information
Each instance consists of 1000 word sequences that are divided from the works of every author's book. In the training, the author id is also provided.
Dataset Files
File | Size |
---|---|
dataset.zip | 156.2 MB |
Data Description.pdf | 90.1 KB |
Reviews
There are no reviews for this dataset yet.
pip install ucimlrepo
from ucimlrepo import fetch_ucirepo # fetch dataset victorian_era_authorship_attribution = fetch_ucirepo(id=454) # data (as pandas dataframes) X = victorian_era_authorship_attribution.data.features y = victorian_era_authorship_attribution.data.targets # metadata print(victorian_era_authorship_attribution.metadata) # variable information print(victorian_era_authorship_attribution.variables)
Gungor, A. (2018). Victorian Era Authorship Attribution [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5SW4H.
Creators
Abdulmecit Gungor
DOI
License
This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.