Monte Carlo Dropout for Weather Forecasting: Limitations and Lessons from Aurora
Author
Thorbensen, Frederik Bode
Term
4. term
Education
Publication year
2025
Abstract
Dette speciale undersøger Monte Carlo Dropout (MC-Dropout) som en praktisk bayesiansk tilnærmelse til at estimere usikkerhed i maskinlæringsbaserede vejrprognoser (MLWP). Nye transformer-baserede foundation-modeller, som Aurora, leverer stærke punktprognoser, men siger kun lidt om, hvor sikre forudsigelserne er, medmindre man bygger store ensembler af forstyrrede eller fintunede modeller - en løsning, der er beregningstung. MC-Dropout tilbyder et alternativ: Ved at holde dropout aktiveret under forudsigelse og køre modellen mange gange fås et spænd af udfald, som tilnærmer bayesiansk inferens og i praksis efterligner et mangfoldigt ensemble med én enkelt trænet model. Vi vurderer, hvor godt disse probabilistiske prognoser er kalibreret (stemmer de forudsagte intervaller med virkeligheden), hvor pålidelige de er, og deres samlede prognoseevne sammenlignet med deterministiske baselines. Studiet fokuserer på overfladenære variabler - 2-meter lufttemperatur (2t) og 10-meter vindkomponenter (10u, 10v) - og bruger etablerede reanalyse-datasæt som benchmarks. Resultaterne understreger værdien af probabilistisk modellering for mellem- til langtidsprognoser for vejret og tydeliggør afvejningerne mellem ensemble-mangfoldighed og beregningsomkostning.
This thesis examines Monte Carlo Dropout (MC-Dropout) as a practical Bayesian approximation for estimating uncertainty in machine-learning weather prediction (MLWP). Recent transformer-based foundation models, such as Aurora, achieve strong point forecasts, but they reveal little about confidence in those forecasts unless large ensembles of perturbed or fine-tuned models are run - an approach that is computationally expensive. MC-Dropout offers an alternative: by keeping dropout active at prediction time and sampling the model multiple times, it produces a spread of outcomes that approximates Bayesian inference and effectively mimics a diverse ensemble using a single trained model. We evaluate how these probabilistic forecasts perform in terms of calibration (do predicted ranges match reality), reliability, and overall predictive skill when compared with deterministic baselines. The study focuses on near-surface variables - 2-meter air temperature (2t) and 10-meter wind components (10u, 10v) - and uses established reanalysis datasets as benchmarks. The results highlight the value of probabilistic modeling for medium- to long-range weather forecasting and clarify the trade-offs between ensemble diversity and computational cost.
[This summary has been rewritten with the help of AI based on the project's original abstract]
Documents
