Russian Corpus of Biographical Texts
Donated on 6/2/2020
Sentence classification (Russian). The corpus contains Wikipedia texts splitted into sentences/ Each sentence has a topic label.
Dataset Characteristics
Text
Subject Area
Other
Associated Tasks
Classification
Feature Type
-
# Instances
200
# Features
2
Dataset Information
Additional Information
The corpus was created for the task of automatic search for fragments containing biographical information in a text in a natural language. The corpus includes 200 Russian biographical articles (Wikipedia, 2018). Text pre-processing and selection included the following steps: - firstly, initial collection of texts was carried out automatically using open Python libraries; - we deleted short texts containing only years of a person’s life and a list of his places of work; - we have deleted all sections except the 'Biography' section. This is due to the fact that biographical articles on Wikipedia contain lists of awards, scientific works, works and other sections that are inconvenient for marking up. The corpus includes biographies of individuals whose main activity is related to one of the following areas: - military and law enforcement officers; - figures of culture and art; - figures of science, technology and education; - politicians and public figures; - entrepreneurs and managers; - religious figures.
Has Missing Values?
No
Variable Information
The corpus is a text collection, divided into sentences. Each sentence refers to one or two thematic classes: non-biographical fact (none); personal events (personal_events); professional events (professional_events); birth death nationality information about the parental family (parenting)); affiliation education family place of residence, residence (residence); occupation, position (occupation); other biographical facts (other). The corpus of biographical texts consists of the following elements: - texts presented in .xml format (each sentence includes the attributes 'text' and 'type' (thematic class), if available - 'additional_type' (additional thematic class); - a file with a description of the corps in .csv format, which contains information about the texts (name of the person, years of life, area of main activity).
Dataset Files
File | Size |
---|---|
corpus/Сурков, Владислав Юрьевич.xml | 27 KB |
corpus/Адамс, Артур Александрович.xml | 16.3 KB |
corpus/Абденанова, Алиме.xml | 15.2 KB |
corpus/Чашник, Илья Григорьевич.xml | 14.5 KB |
corpus/Кутиков, Александр Викторович.xml | 14 KB |
0 to 5 of 202
Reviews
There are no reviews for this dataset yet.
pip install ucimlrepo
from ucimlrepo import fetch_ucirepo # fetch dataset russian_corpus_of_biographical_texts = fetch_ucirepo(id=576) # data (as pandas dataframes) X = russian_corpus_of_biographical_texts.data.features y = russian_corpus_of_biographical_texts.data.targets # metadata print(russian_corpus_of_biographical_texts.metadata) # variable information print(russian_corpus_of_biographical_texts.variables)
Russian Corpus of Biographical Texts [Dataset]. (2020). UCI Machine Learning Repository. https://doi.org/10.24432/C5C60C.
DOI
License
This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.