Multimodal Learning for Understanding and Classifying Stress
Authors
Hansen, Andreas Peter Juhl ; Münster, Jeppe Roden ; Segaard, Simon Bock
Term
4. semester
Education
Publication year
2026
Submitted on
2026-05-27
Pages
92
Abstract
Measuring and reliably classifying human stress is difficult because people’s baseline levels vary widely, self-reported stress does not always match sensor-based measures, and studies use different ways to elicit stress. To address this, the thesis introduces two multimodal models that combine multiple data types (audio and video) to identify stress: TempFeat, which relies on predefined features, and 3SI, which uses self-supervised learning (it learns useful patterns from the data without manual labels). It also proposes Synergy Identification via Network Clustering (SINC), a model-agnostic explainable AI framework that helps interpret these complex systems by showing which inputs and interactions drive decisions. Evaluations show that 3SI generalizes best to new people, outperforming the state-of-the-art ADAPT model by at least 22.2% in balanced accuracy and 13.6% in weighted F1 score on unseen subjects (two common measures of classification performance). Applying SINC to these models provided insight into their decision processes: SINC reveals that 3SI’s strong performance comes from capturing complex cross-modal audio-visual interactions, whereas TempFeat mainly relies on dependencies within each modality.
At måle og pålideligt klassificere menneskelig stress er vanskeligt, fordi folks normale stressniveauer varierer meget, selvrapporteret stress ikke altid stemmer overens med objektive målinger, og forskningsprotokoller fremkalder stress på forskellige måder. For at håndtere dette præsenterer afhandlingen to multimodale modeller, der kombinerer flere datatyper (lyd og video) til at identificere stress: TempFeat, som bygger på foruddefinerede træk, og 3SI, som bruger selv-superviseret læring (den lærer nyttige mønstre fra data uden manuelle etiketter). Derudover introduceres Synergy Identification via Network Clustering (SINC), en ny, model-uafhængig ramme for forklarlig kunstig intelligens, der hjælper med at forstå sådanne komplekse systemer ved at vise, hvilke input og samspil der driver beslutningerne. Evalueringerne viser, at 3SI generaliserer bedst til nye personer og overgår den førende ADAPT-model med mindst 22,2% i balanced accuracy og 13,6% i weighted F1-score på usete forsøgspersoner (to udbredte mål for klassifikationsydelse). Når SINC anvendes på modellerne, giver det indblik i, hvorfor de træffer bestemte beslutninger: SINC afslører, at 3SI’s styrke skyldes, at den fanger komplekst samspil på tværs af lyd og video, mens TempFeat primært bygger på afhængigheder inden for hver enkelt datatype.
[This apstract has been rewritten with the help of AI based on the project's original abstract]
Keywords
