Repository logo
  • Log In
    New user? Click here to register.Have you forgotten your password?
University College Dublin
    Colleges & Schools
    Statistics
    All of DSpace
  • Log In
    New user? Click here to register.Have you forgotten your password?
  1. Home
  2. UCD E-Theses
  3. College of Science
  4. Mathematics and Statistics Theses
  5. Record linkage approaches for matching databases with nested records
 
  • Details
Options

Record linkage approaches for matching databases with nested records

Author(s)
Pacheco Menezes, Thais  
Uri
http://hdl.handle.net/10197/29386
Date Issued
2025
Date Available
2025-10-24T09:01:11Z
Abstract
This thesis explores methodologies for matching records across databases when unique identifiers are unavailable, which can be particularly useful in linking census or medical record data. This area of study, known as record linkage, has expanded rapidly with the digitalization of historical records and large-scale surveys. The primary objective of record linkage is to match information from various datasets that are believed to refer to the same individual. While many methodologies focus solely on individual-level matching, they often overlook group information — such as household structure — that can significantly enhance matching accuracy. In many real-life situations, such as matching records from a census, household structure plays a key role in identifying true matches. Individuals with similar names and ages can often be distinguished by examining their household relationships. This thesis aims to incorporate such group-level information to improve individual matches. Chapter 2 introduces a multi-step record linkage procedure that incorporates household-level data to estimate matches between databases. By using the Hausdorff distance to measure household discrepancies, logistic regression models, and linear programming, the method first identifies household matches, followed by individual matches within these households. Applied to data from the Italian Survey of Household Income and Wealth, the results show a significant improvement in individual match quality compared to approaches that disregard household information. Chapter 3 presents a novel unsupervised approach that jointly estimates household and individual match statuses. The model classifies individuals into three categories: matching individuals from matching households, non-matching individuals from matching households, and non-matching individuals from non-matching households. Using a Bernoulli model and a Classification EM algorithm, the approach achieves an F1 score of 80% on the Italian survey data. Finally, Chapter 4 extends this approach to accommodate matching fields of mixed types. For string variables, the computed distances between individual responses are discretized and modelled with a multinomial distribution, while the differences between numerical variables are captured using a Gaussian distribution. Applied to the 1901 and 1911 Irish census data, this method shows improved precision and recall when household information is integrated. The model’s unsupervised design underscores its broad applicability across datasets, eliminating the need for prior knowledge of true matches. This thesis demonstrates the significant impact of leveraging household data to enhance record linkage performance across diverse contexts.
Type of Material
Doctoral Thesis
Qualification Name
Doctor of Philosophy (Ph.D.)
Publisher
University College Dublin. School of Mathematics and Statistics
Copyright (Published Version)
2025 the Author
Subjects

Household information...

Linear programming

Matching databases

Record linkage

Language
English
Status of Item
Peer reviewed
This item is made available under a Creative Commons License
https://creativecommons.org/licenses/by-nc-nd/3.0/ie/
File(s)
Loading...
Thumbnail Image
Name

Theis_ThaisMenezes_final.pdf

Size

6.38 MB

Format

Adobe PDF

Checksum (MD5)

414f03c761c02045e85a7b35a1790405

Owning collection
Mathematics and Statistics Theses

Item descriptive metadata is released under a CC-0 (public domain) license: https://creativecommons.org/public-domain/cc0/.
All other content is subject to copyright.

For all queries please contact research.repository@ucd.ie.

Built with DSpace-CRIS software - Extension maintained and optimized by 4Science

  • Cookie settings
  • Privacy policy
  • End User Agreement