Forfatter(e)
Semester
4. semester
Uddannelse
Udgivelsesår
2025
Afleveret
2025-08-04
Antal sider
85 pages
Abstract
This master thesis in the intersection of sociology and social data science investigates the potential of transformer-based prediction models to identify rare but significant life course outcomes based on childhood trajectories. Using an extensive volume of Danish register data, we model life histories of 140,708 individuals in a 3-year birth cohort from birth to age 17 tracking ~60 background variables with yearly observations, capturing a wide range of dynamic, static and irregular family-related, socio-economic and socio-spatial indicators. Our primary objective is to explore how recent developments in transformer architecture adapted to a social science context can adapt tabular data to text sequences simulating language to predict low-prevalence outcomes such as elite attainment in adulthood at age ~40. We operationalize six different elite attainment outcomes for binary prediction; ‘income’) being among the highest five percentiles in annual income; ‘area’) living in the top five most elitist micro areas in Denmark; ‘educational’) having attained a high prestige education with the highest average income levels; ‘city’) living in one of four major cities in combination with a master’s degree and a 1.5xhigher-than-median annual income level; ‘managerial’) possessing a high-level managerial position in the labour market or ‘self-employed’) being self-employed in combination with a 2xhigher-than-median annual income level. We tokenize all background variables and events during childhood and their according annual time steps, convert them into a long text sequence per childhood trajectory, apply label encoding and sequence padding, attention mask, embedding layers and learnable position embeddings, and train the model using supervised deep learning similar to language models with dynamic instance weighting and focal loss to address class imbalance. The model for ‘income elite’ performs a ROC AUC of 0.757 while the model for ‘city elite’ performs a ROC AUC of 0.755. The models for ‘educational elite’ and ‘area elite’ performs ROC AUC’s of 0.713 and 0.720 respectively, while models for ‘managerial elite’ and ‘self-employed elite’ which were trained on less data due to very small positive outcome subgroups, performs lower ROC AUC scores of 0.648 and 0.647 respectively. Threshold tuning ensures robust out-of-sample evaluation and allows predictions to focus on higher recall for positive outcomes for potential further analytical exploration of clusters and subgroups. Our findings suggest that transformer-based models analyzing life trajectories as language offer a powerful, domain-flexible approach for detecting latent life course patterns in register-based population data, while data volume proves important as well as extreme class imbalances with positives less than 5% of the total data volume can potentially present challenges for training and prediction results.
Emneord
Kolofon: Denne side er en del af AAU Studenterprojekter — Aalborg Universitets studenterprojektportal. Her kan du finde og downloade offentligt tilgængelige kandidatspecialer og masterprojekter fra hele universitetet fra 2008 og frem. Studenterprojekter fra før 2008 kan findes i trykt form på Aalborg Universitetsbibliotek.
Har du spørgsmål til AAU Studenterprojekter eller Aalborg Universitets forskningsregistrering, formidling og analyse, er du altid velkommen til at kontakte VBN-teamet. Du kan også læse mere i AAU Studenterprojekter FAQ.