Optimal Parameter Choice for Bloom Filter-based Privacy-preserving Record Linkage
Record Linkage is, for most scientific disciplines, an increasingly popular set of methods to gather or enrich research data for analysis. Since in most countries, perfectly unique personal identifier numbers (PIDs) are not available, data linkage is restricted to attributes like names and birth dates to discriminate between records and their corresponding real-life entities.</br> However, these identifiers are often legally required to be encrypted. This gave way to the field of Privacy-preserving Record Linkage (PPRL). Recently, Bloom filters have gained much attention in PPRL research.</br> Hindering their widespread use is the fact that choosing the right parameters for private linkage operations will, at the moment, require in-depth expert knowledge about the data, since the quality of Privacy-preserving Record Linkage using Bloom filters is highly dependent on the encryption parameter choices.</br> Since there is currently no literature about the optimal choice for these parameters, this thesis aims for an optimal choice automation method for best linkage quality using model estimates based on simulations of the entire parameter space. After giving an in-depth overview of the state of the art in PPRL, the approach is described in depth. The resulting models are then tested using simulated and real-world data sets. Using a naive approach based on current recommendations is tested against the encryption parameters resulting from the model estimates. The results are compared in-depth for each data set.</br> It can be shown that the optimal parameter choices consistently outperform current best-practice parameter settings, sometimes drastically. The thesis concludes with an outlook on open research questions and closes with updated recommendations for Bloom filter (BF)-based Privacy-preserving Record Linkage.