A Comparison of Privacy Metrics for Synthetic Data Generation

Authors

Hansen, Astrid Melodi ; Stær, Frederik

Term

4. term

Education

Computer Science, Master

Publication year

2024

Submitted on

2024-06-17

Pages

Abstract

Data sharing has become a major factor in the development of robust new machine learning models, especially, in the health sector for e.g. disease prediction. However, sharing such data presents a privacy risks for individuals present in the data. Therefore, privacy laws have been introduced to protect such individuals, those protections being GDPR (General Data Protection Regulation) and HIPAA (Health Insurance Portability and Accountability Act). In the context of data sharing, however, this makes it difficult to share data between institutions. To overcome this issue anonymisation techniques has been suggested to anonymise the data. Anonymisation techniques are essential in safeguarding sensitive data while still allowing its utilisation for research, analysis, and other purposes. These techniques aim to remove or obscure personally identifiable information from datasets, thus reducing the risk leaking sensitive information while preserving the data’s utility. Several anonymisation methods exist, each with its strengths, limitations, and suitability for different data types and use cases. Evaluating anonymisation techniques usually revolves around testing the utility and privacy of the anonymised data. However, in the current literature not much attention has been paid to testing privacy with some papers only testing the utility of the anonymised dataset and others only testing a limited number of privacy attacks. Therefore, in this paper, we evaluate the state of the art privacy metrics, covering different privacy attacks, in order to establish which privacy metrics are necessary to thoroughly test anonymised datasets. We perform two different sets of experiments. The first is aimed at testing whether the privacy metrics work as expected, which is measured as whether the privacy level of a given anonymisation technique correlates with the score of a given privacy metric. The second investigates whether different privacy metrics capture different aspects in terms of the anonymisation. This is measured by calculating the correlation between the scores of the individual privacy metrics and by performing clustering on these scores. The experiments are conducted on two different tabular datasets; MedOnc (8,630 rows) and Texas (25,000 rows). Through a metric selection process, the results showed that 7 out of 21 metrics (1) worked as expected to some degree and (2) are able to capture different aspects of anonymisation. These metrics are therefore deemed sufficient for evaluating the privacy of anonymised data in the context of tabular data.

Keywords

Sundhedsdata ; Machine learning ; Privacy metrics ; Anonymisation ; Anonymisation techniques ; Synthetic data generation ; Differential privacy ; Data sharing ; Privacy attacks ; uwu

Documents

Download
View record in AAU Student Projects

A master's thesis from Aalborg University

A Comparison of Privacy Metrics for Synthetic Data Generation