Regression models and feature selection for high-dimensional genomics data

Student thesis: Master thesis (including HD thesis)

  • Regitze Kuhr Skals
4. term, Mathematics, Master (Master Programme)
DNA-methylation is a process that happens in connection with gene expression. This process has shown to be a promising predictor of age. The relation is interesting in the field of forensic science. If the age of a suspect could be predicted on the basis of DNA, a group of suspects could be narrowed down or it could form a lead for the police, if they had no other leads.
In this thesis regression models usable for handling high dimensional genomics data of DNA-methylation has been studied. The purpose was to find few good predictors of age among hundreds of thousands, and to determine consistency of those.
The methods which were studied for the purpose were Ridge regression, Elastic net and Lasso. Especially Elastic net and Lasso were relevant methods, as they performed variable selection. The consistency of predictors was determined for the Lasso and Elastic net method by Stability selection. Moreover Partial least squares was applied to the data.
The final result was a Ridge regression model found by Elastic net combined with Stability selection. It contained $18$ stable predictors, and resulted in an RMSE at 2.43 on the validation data.
LanguageEnglish
Publication date10 Jun 2015
Number of pages79
Publishing institutionDept. of Mathematical Sciences, Aalborg University
External collaboratorRetsgenetisk Afdeling, Københavns Universitet
Institutleder Niels Morling Niels.Morling@sund.ku.dk
Other
ID: 213876578