Author(s)
Term
4. term
Education
Publication year
2024
Submitted on
2024-06-17
Pages
56 pages
Abstract
Data sharing has become a major factor in the development of robust new machine learning models, especially, in the health sector for e.g. disease prediction. However, sharing such data presents a privacy risks for individuals present in the data. Therefore, privacy laws have been introduced to protect such individuals, those protections being GDPR (General Data Protection Regulation) and HIPAA (Health Insurance Portability and Accountability Act). In the context of data sharing, however, this makes it difficult to share data between institutions. To overcome this issue anonymisation techniques has been suggested to anonymise the data. Anonymisation techniques are essential in safeguarding sensitive data while still allowing its utilisation for research, analysis, and other purposes. These techniques aim to remove or obscure personally identifiable information from datasets, thus reducing the risk leaking sensitive information while preserving the data’s utility. Several anonymisation methods exist, each with its strengths, limitations, and suitability for different data types and use cases. Evaluating anonymisation techniques usually revolves around testing the utility and privacy of the anonymised data. However, in the current literature not much attention has been paid to testing privacy with some papers only testing the utility of the anonymised dataset and others only testing a limited number of privacy attacks. Therefore, in this paper, we evaluate the state of the art privacy metrics, covering different privacy attacks, in order to establish which privacy metrics are necessary to thoroughly test anonymised datasets. We perform two different sets of experiments. The first is aimed at testing whether the privacy metrics work as expected, which is measured as whether the privacy level of a given anonymisation technique correlates with the score of a given privacy metric. The second investigates whether different privacy metrics capture different aspects in terms of the anonymisation. This is measured by calculating the correlation between the scores of the individual privacy metrics and by performing clustering on these scores. The experiments are conducted on two different tabular datasets; MedOnc (8,630 rows) and Texas (25,000 rows). Through a metric selection process, the results showed that 7 out of 21 metrics (1) worked as expected to some degree and (2) are able to capture different aspects of anonymisation. These metrics are therefore deemed sufficient for evaluating the privacy of anonymised data in the context of tabular data.
Keywords
Documents
Colophon: This page is part of the AAU Student Projects portal, which is run by Aalborg University. Here, you can find and download publicly available bachelor's theses and master's projects from across the university dating from 2008 onwards. Student projects from before 2008 are available in printed form at Aalborg University Library.
If you have any questions about AAU Student Projects or the research registration, dissemination and analysis at Aalborg University, please feel free to contact the VBN team. You can also find more information in the AAU Student Projects FAQs.