AAU Student Projects - visit Aalborg University's student projects portal
An executive master's programme thesis from Aalborg University
Book cover


Trusting Gut Instincts: Transformer-Based Extraction of Structured Data from Gut-Brain Axis Publications: A Model Ensembling and Weighted Training Approach for GutBrainIE

Translated title

Trusting Gut Instincts: Transformer-Based Extraction of Structured Data from Gut-Brain Axis Publications

Authors

; ;

Term

4. term

Education

Publication year

2025

Submitted on

Pages

24

Abstract

Vi præsenterer vores teams (Gut-Instincts) løsning til GutBrainIE-udfordringen, der går ud på at genkende vigtige termer (navngivne entiteter) og udtrække relationer mellem dem i biomedicinske artikler om tarm‑hjerne-aksen. For at håndtere den specialiserede terminologi bruger vi transformer-baserede sprogmodeller, der er fortrænet på biomedicinske tekster. Til entitetsgenkendelse (NER) afprøver vi tre typer klassifikationshoveder: (1) et fuldt forbundet lag (dense), (2) et fuldt forbundet lag efterfulgt af et CRF-lag (Conditional Random Field), som kan udnytte afhængigheder mellem mærkater i en sekvens, og (3) et bidirektionalt LSTM-lag efterfulgt af et CRF-lag. Til relationsekstraktion (RE) indfører vi negative eksempler (par uden en ægte relation) og eksperimenterer med forskellige forhold mellem positive og negative eksempler. På tværs af alle opgaver kombinerer vi flere modeller (ensembler) for at reducere variation og øge robusthed. Da datasættet rummer kilder af forskellig kvalitet, bruger vi desuden vægtet træning, så alle data kan udnyttes, men data af høj kvalitet får størst indflydelse under optimeringen. Vores eksperimenter tyder på, at et højt forhold mellem negative og positive eksempler, model-ensembler og vægtet træning forbedrer resultaterne i både NER og RE. I GutBrainIE placerede vi os som nummer to i NER-opgave 6.1 med en micro F1-score på 0,8382, og som nummer ét i alle tre RE-opgaver (6.2.1, 6.2.2 og 6.2.3) med micro F1-scorer på 0,6864, 0,6866 og 0,4635.

We present our team’s (Gut-Instincts) solution to the GutBrainIE challenge, which asks systems to recognize key terms (named entities) and extract relations between them in biomedical articles about the gut–brain axis. To handle domain-specific language, we use transformer-based language models pretrained on biomedical text. For named-entity recognition (NER), we test three classification heads: (1) a dense (fully connected) layer, (2) a dense layer followed by a Conditional Random Field (CRF) layer, which models dependencies between neighboring labels, and (3) a bidirectional long short-term memory (BiLSTM) layer followed by a CRF. For relation extraction (RE), we add negative samples (pairs without a true relation) and vary the ratio of positive to negative examples. Across all tasks, we ensemble multiple models to reduce variability and improve robustness. Because the dataset mixes sources of different quality, we use weighted training so the models learn from all available data while giving higher weight to high-quality sources. Our experiments suggest that using a high negative-to-positive ratio, model ensembling, and weighted training improves performance on both NER and RE. In the GutBrainIE challenge, we placed second in NER task 6.1 with a micro F1 score of 0.8382, and first in all three RE tasks (6.2.1, 6.2.2, 6.2.3) with micro F1 scores of 0.6864, 0.6866, and 0.4635, respectively.

[This summary has been rewritten with the help of AI based on the project's original abstract]