Record Linkage Comparison Patterns
Donated on 3/9/2011
Element-wise comparison of records with personal data from a record linkage setting. The task is to decide from a comparison pattern whether the underlying records belong to one person.
Dataset Characteristics
Multivariate
Subject Area
Other
Associated Tasks
Classification
Feature Type
Real
# Instances
5749132
# Features
-
Dataset Information
Additional Information
The records represent individual data including first and family name, sex, date of birth and postal code, which were collected through iterative insertions in the course of several years. The comparison patterns in this data set are based on a sample of 100.000 records dating from 2005 to 2008. Data pairs were classified as 'match' or 'non-match' during an extensive manual review where several documentarists were involved. The resulting classification formed the basis for assessing the quality of the registry’s own record linkage procedure. In order to limit the amount of patterns, a blocking procedure was applied, which selects only record pairs that meet specific agreement conditions. The results of the following six blocking iterations were merged together: 1. Phonetic equality of first name and family name, equality of date of birth. 2. Phonetic equality of first name, equality of day of birth. 3. Phonetic equality of first name, equality of month of birth. 4. Phonetic equality of first name, equality of year of birth. 5. Equality of complete date of birth. 6. Phonetic equality of family name, equality of sex. This procedure resulted in 5.749.132 record pairs, of which 20.931 are matches. The data set is split into 10 blocks of (approximately) equal size and ratio of matches to non-matches. The separate file frequencies.csv contains for every predictive attribute the average number of values in the underlying records. These values can, for example, be used as u-probabilities in weight-based record linkage following the framework of Fellegi and Sunter.
Has Missing Values?
Yes
Variables Table
Variable Name | Role | Type | Description | Units | Missing Values |
---|---|---|---|---|---|
no | |||||
no | |||||
no | |||||
no | |||||
no | |||||
no | |||||
no | |||||
no | |||||
no | |||||
no |
0 to 10 of 12
Additional Variable Information
1. id_1: internal identifier of first record. 2. id_2: internal identifier of second record. 3. cmp_fname_c1: agreement of first name, first component 4. cmp_fname_c2: agreement of first name, second component 5. cmp_lname_c1: agreement of family name, first component 6. cmp_lname_c2: agreement of family name, second component 7. cmp_sex: agreement sex 8. cmp_bd: agreement of date of birth, day component 9. cmp_bm: agreement of date of birth, month component 10. cmp_by: agreement of date of birth, year component 11. cmp_plz: agreement of postal code 12. is_match: matching status (TRUE for matches, FALSE for non-matches) The agreement of name components is measured as a real number in the interval [0,1], where 0 denotes maximal disagreement and 1 equality of the underlying values. For the other comparisons, only the values 0 (not equal) and 1 (equal) are used. is_match is the outcome variable. id_1 and id_2 are not used for prediction but could be used to construct connected components from the found matches.
Dataset Files
File | Size |
---|---|
donation.zip | 53.8 MB |
documentation | 4.4 KB |
Reviews
There are no reviews for this dataset yet.
pip install ucimlrepo
from ucimlrepo import fetch_ucirepo # fetch dataset record_linkage_comparison_patterns = fetch_ucirepo(id=210) # data (as pandas dataframes) X = record_linkage_comparison_patterns.data.features y = record_linkage_comparison_patterns.data.targets # metadata print(record_linkage_comparison_patterns.metadata) # variable information print(record_linkage_comparison_patterns.variables)
Schmidtmann, I., Hammer, G., Sariyar, M., & Gerhold-Ay, A. (2009). Record Linkage Comparison Patterns [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C51K6B.
Creators
Irene Schmidtmann
Gael Hammer
Murat Sariyar
Aslihan Gerhold-Ay
DOI
License
This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.