Predicting HFE C282Y Homozygosity from Blood Biomarkers: A Cost-E!cient Machine Learning Approach
Authors
Eg, Aron Reinholdt ; Bredvig, Magnus
Term
4. term
Education
Publication year
2026
Submitted on
2026-05-21
Pages
52
Abstract
This study explores whether blood-based biomarkers can support pre-screening for HFE C282Y homozygosity (having two copies of the C282Y variant in the HFE gene) by building and evaluating a modeling framework. Because UK Biobank (UKB) data were temporarily unavailable, analyses used synthetic data designed to mimic UKB distributions across 63 variables, and missing values were handled with multiple imputation by chained equations (MICE), which fills in missing entries multiple times to reflect uncertainty. The pipeline combined imputation, bagging to address class imbalance, and predictive modeling, using logistic regression as a baseline and XGBoost, a tree-based machine-learning method, as the primary model. Models were evaluated on separate validation and test sets. A cost-sensitive framework selected classification thresholds under specified sensitivity (true-positive rate) and cost constraints, using heuristic optimization and a mixed-integer linear programming formulation. On the synthetic data, model performance was close to random guessing, likely because the dataset preserved marginal distributions but not the predictive relationships between biomarkers and genotype.
Dette studie undersøger, om blodbaserede biomarkører kan bruges til forscreening for HFE C282Y homozygoti (to kopier af C282Y-varianten i HFE-genet) ved at opbygge og evaluere en modelleringsramme. Da UK Biobank (UKB) midlertidigt var utilgængelig, blev der anvendt syntetiske data, der efterligner UKB-fordelinger på 63 variabler, og manglende data blev håndteret med multiple imputation by chained equations (MICE), som udfylder manglende værdier flere gange for at afspejle usikkerhed. Pipeline kombinerede imputering, bagging for at håndtere klasse-ubalance, og prædiktiv modellering, med logistisk regression som baseline og XGBoost, en træbaseret maskinlæringsmetode, som hovedmodel. Modellerne blev evalueret på separate validerings- og testdatasæt. En omkostningsfølsom ramme blev anvendt til at vælge klassifikationstærskler under krav til sensitivitet (andel sande positive) og omkostninger ved hjælp af heuristisk optimering og en mixed-integer linear programming-formulering. På de syntetiske data var modellernes ydeevne tæt på tilfældig gætning, sandsynligvis fordi datasættet bevarer marginale fordelinger, men ikke de forudsigende relationer mellem biomarkører og genotype.
[This apstract has been rewritten with the help of AI based on the project's original abstract]
