Vejen til eliten: Kan dit tidlige liv prædiktere din vej til toppen af det sociale hierarki - en transformerbaseret prædiktion af eliteudfald med dansk registerdata
Oversat titel
The Path to the Elite: Can your childhood trajectory predict your path to the top of the social hierarchy - a transformer-based prediction of elite outcomes with Danish Register Data
Forfattere
Semester
4. semester
Uddannelse
Udgivelsesår
2025
Afleveret
2025-08-04
Antal sider
85
Abstract
This master thesis in the intersection of sociology and social data science investigates the potential of transformer-based prediction models to identify rare but significant life course outcomes based on childhood trajectories. Using an extensive volume of Danish register data, we model life histories of 140,708 individuals in a 3-year birth cohort from birth to age 17 tracking ~60 background variables with yearly observations, capturing a wide range of dynamic, static and irregular family-related, socio-economic and socio-spatial indicators. Our primary objective is to explore how recent developments in transformer architecture adapted to a social science context can adapt tabular data to text sequences simulating language to predict low-prevalence outcomes such as elite attainment in adulthood at age ~40. We operationalize six different elite attainment outcomes for binary prediction; ‘income’) being among the highest five percentiles in annual income; ‘area’) living in the top five most elitist micro areas in Denmark; ‘educational’) having attained a high prestige education with the highest average income levels; ‘city’) living in one of four major cities in combination with a master’s degree and a 1.5xhigher-than-median annual income level; ‘managerial’) possessing a high-level managerial position in the labour market or ‘self-employed’) being self-employed in combination with a 2xhigher-than-median annual income level. We tokenize all background variables and events during childhood and their according annual time steps, convert them into a long text sequence per childhood trajectory, apply label encoding and sequence padding, attention mask, embedding layers and learnable position embeddings, and train the model using supervised deep learning similar to language models with dynamic instance weighting and focal loss to address class imbalance. The model for ‘income elite’ performs a ROC AUC of 0.757 while the model for ‘city elite’ performs a ROC AUC of 0.755. The models for ‘educational elite’ and ‘area elite’ performs ROC AUC’s of 0.713 and 0.720 respectively, while models for ‘managerial elite’ and ‘self-employed elite’ which were trained on less data due to very small positive outcome subgroups, performs lower ROC AUC scores of 0.648 and 0.647 respectively. Threshold tuning ensures robust out-of-sample evaluation and allows predictions to focus on higher recall for positive outcomes for potential further analytical exploration of clusters and subgroups. Our findings suggest that transformer-based models analyzing life trajectories as language offer a powerful, domain-flexible approach for detecting latent life course patterns in register-based population data, while data volume proves important as well as extreme class imbalances with positives less than 5% of the total data volume can potentially present challenges for training and prediction results.
Emneord
