AAU Student Projects is unavailable between June 15th 1.30pm and 17th 1.30pm due to planned system maintenance. The projects cannot be downloaded during this period.
AAU Student Projects - visit Aalborg University's student projects portal
A master's thesis from Aalborg University
Book cover


Predicting HFE C282Y Homozygosity from Blood Biomarkers: A Cost-E!cient Machine Learning Approach

Authors

;

Term

4. term

Publication year

2026

Submitted on

Abstract

This study develops a framework to assess whether blood-based biomarkers can be used to pre-screen for HFE C282Y homozygosity (a specific genetic pattern). Because the UK Biobank was temporarily inaccessible, analyses used synthetic data with 63 features designed to mimic UKB distributions. Missing data were handled with multiple imputation by chained equations, a method that fills in missing values multiple times to produce more stable estimates. The modeling pipeline combined imputation, bagging to address class imbalance (where positive cases are rare), and predictive modeling. Logistic regression served as a baseline, and XGBoost (a tree-based machine learning method) was the primary model. Models were evaluated on separate validation and test sets. A cost-sensitive approach was applied to choose classification thresholds under sensitivity (ability to detect true positives) and error-cost constraints, using both heuristic optimization and a mixed-integer linear programming formulation. Model performance was near random because the synthetic data preserve marginal distributions but do not capture real relationships between biomarkers and genotype.

Dette studie udvikler en ramme til at vurdere, om blodbaserede biomarkører kan bruges til prescreening for HFE C282Y-homozygoti (et specifikt genetisk mønster). Da UK Biobank midlertidigt var utilgængelig, blev analyserne udført på syntetiske data med 63 variable, konstrueret til at ligne fordelingen i UKB. Manglende data blev håndteret med multiple imputation by chained equations, en metode der udfylder manglende værdier flere gange for at give mere stabile estimater. Modelleringsprocessen kombinerede imputering, bagging for at håndtere skæv klassefordeling (hvor positive tilfælde er sjældne), og prædiktive modeller. Logistisk regression fungerede som baseline, og XGBoost (en træ-baseret maskinlæringsmetode) var hovedmodellen. Modellerne blev vurderet på separate validerings- og testdatasæt. Der blev desuden anvendt en omkostningsfølsom tilgang til at vælge klassifikationsgrænser under krav til sensitivitet (evnen til at fange sande positive) og omkostninger ved fejl, baseret på både heuristisk optimering og en formulering med mixed-integer linear programming. Modelpræstationen var tæt på tilfældig, fordi de syntetiske data bevarer marginale fordelinger, men ikke fanger de egentlige sammenhænge mellem biomarkører og genotype.

[This abstract has been rewritten with the help of AI based on the project's original abstract]