Bayesian Estimation of Attribute Disclosure Risk of PrivBayes
Authors
Trudslev, Frederik Marinus ; Bachmann, Silas Oliver Torup
Term
4. term
Education
Publication year
2024
Submitted on
2024-06-13
Pages
42
Abstract
Synthetic data can help share data while protecting privacy. However, ensuring that no person can be identified or have hidden information inferred remains difficult. Differential privacy (DP) is a common framework: it limits how much the overall distribution can change when one person’s record is added or removed. In settings where all attributes must stay secret, DP can be a concern because attackers may attempt attribute inference—guessing a missing value about a specific person. To assess this threat, privacy metrics should reflect what an attacker might already know. Many metrics focus only on what is in the released synthetic data, even though an attacker could also know how the data were generated or have outside information about people. Bayesian statistics naturally represent and update such degrees of knowledge. Hornby & Hu proposed a Bayesian method to estimate the risk of attribute inference that incorporates auxiliary information and knowledge of the synthesis method. In this thesis, we connect that risk model with PrivBayes, a differentially private, Bayesian synthetic data generator. We investigate how well the Hornby & Hu method estimates the risk of disclosing continuous (numerical) attributes in PrivBayes‑generated datasets under two scenarios using different datasets: (1) varying the DP parameter ε (which controls the privacy–utility trade‑off), and (2) injecting outliers (extreme values) into the real data. Even when we allowed the attacker extra knowledge, our experiments showed low risk of disclosing continuous attributes across all ε values. Moreover, the risk did not vary directly with the amount of DP noise added by PrivBayes. These findings suggest that Bayesian modeling is useful because it lets us tune assumptions about an attacker’s knowledge and produce more grounded privacy risk estimates for protecting sensitive attributes.
Syntetiske data kan hjælpe med at dele data og samtidig beskytte privatliv. Men at sikre, at ingen person kan genkendes eller få afsløret skjulte oplysninger, er stadig svært. Differential privacy (DP) er et udbredt rammeværk: det begrænser, hvor meget den samlede fordeling ændrer sig, hvis én persons data tilføjes eller fjernes. I miljøer, hvor alle attributter skal holdes hemmelige, kan DP dog virke utilstrækkelig, fordi angribere kan forsøge attribut‑inferens—at gætte en manglende værdi om en bestemt person. For at vurdere denne trussel bør privatlivsmetrikker afspejle, hvad en angriber allerede kan vide. Mange metrikker ser primært på den udsendte syntetiske data, selv om en angriber også kan kende, hvordan data blev genereret, eller have ekstern viden om personer. Bayesiansk statistik kan naturligt repræsentere og opdatere sådanne grader af viden. Hornby & Hu har foreslået en bayesiansk metode til at estimere risikoen for attribut‑inferens, som indarbejder hjælpedata (auxiliær viden) og kendskab til syntesemetoden. I dette arbejde kobler vi denne risikomodel til PrivBayes, en differentielt privat, bayesiansk generator af syntetiske data. Vi undersøger, hvor godt Hornby & Hu‑metoden vurderer risikoen for at afsløre kontinuerte (numeriske) attributter i PrivBayes‑genererede datasæt under to scenarier med forskellige datasæt: (1) vi varierer DP‑parameteren ε (der styrer balancen mellem privatliv og nytte), og (2) vi injicerer outliers (ekstreme værdier) i de virkelige data. Selv når vi giver angriberen ekstra viden, viser vores forsøg lav risiko for at afsløre kontinuerte attributter på tværs af alle ε‑værdier. Derudover varierede risikoen ikke direkte med mængden af DP‑støj, som PrivBayes tilføjer. Disse resultater peger på, at bayesiansk modellering er nyttig, fordi den gør det muligt at justere antagelser om angriberens viden og dermed give mere velbegrundede risikovurderinger til beskyttelse af følsomme attributter.
[This apstract has been rewritten with the help of AI based on the project's original abstract]
Keywords
EHR ; Synthetic data ; SDG ; Differential Privacy ; DP ; Bayesian Statistics ; UwU ; Bayes Theorem
