GetGut: A Multi-Stage Pipeline for Gut-Brain Axis Information Extraction
Authors
Jørgensen, Sophus ; Garavito Molina, Ernesto
Term
4. term
Education
Publication year
2026
Submitted on
2026-06-24
Abstract
This thesis presents GetGut@AAU, a multi-stage “pipeline” that automatically extracts knowledge from scientific texts in the gut–brain domain. The system is developed for the GutBrainIE task in the BioASQ Lab at CLEF 2026 and tackles four subtasks: (1) finding relevant technical terms in text (Named Entity Recognition, NER), (2) linking these terms to the correct concepts in scientific databases (Named Entity Recognition and Disambiguation, NERD), (3) identifying relations between specific mentions of concepts in the text (Mention-Level Relation Extraction, M-RE), and (4) identifying relations between the underlying concepts themselves (Concept-Level Relation Extraction, C-RE). A key challenge is that the available training datasets differ greatly in how detailed and reliable their annotations are. To make the best use of them, we introduce a method called Weighted Funnel Fine-Tuning for our NER and M-RE models. We first train on large amounts of weakly supervised or automatically annotated data and then gradually fine-tune on smaller datasets that have been carefully annotated by domain experts. This allows the models to first learn broad biomedical terminology and then become more precise in how they detect and delimit individual entities. We also improve data quality for each subtask separately. In M-RE, we reduce bias toward predicting too many positive relations by systematically adding selected negative examples through controlled negative sampling. For NERD, we improve accuracy by removing noisy, distantly supervised entries from the reference dictionary that the system relies on. In the official evaluation, our system ranked 3rd in both the M-RE and C-RE subtasks, with Micro F1 scores of 0.4054 and 0.2020, respectively. It ranked 6th and 5th in the NER and NERD subtasks, with Micro F1 scores of 0.8014 and 0.5517. All implementation code is publicly available at: https://github.com/SophusJ/GetGut-AAU.
Dette speciale præsenterer GetGut@AAU, en flertrins såkaldt “pipeline”, der automatisk udtrækker viden fra videnskabelige tekster om tarm-hjerne-området. Systemet er udviklet til GutBrainIE-opgaven i BioASQ-laboratoriet ved CLEF 2026 og løser fire delopgaver: 1) at finde relevante fagudtryk i teksten (Named Entity Recognition, NER), 2) at koble disse fagudtryk til de korrekte begreber i videnskabelige databaser (Named Entity Recognition and Disambiguation, NERD), 3) at finde relationer mellem konkrete nævnte begreber i teksten (Mention-Level Relation Extraction, M-RE), og 4) at finde relationer mellem de overordnede begreber (Concept-Level Relation Extraction, C-RE). En central udfordring er, at de tilgængelige træningsdata har meget forskellig kvalitet og detaljeringsgrad. For at udnytte disse data bedst muligt introducerer vi en metode kaldet Weighted Funnel Fine-Tuning til vores NER- og M-RE-modeller. Her starter vi med store mængder automatisk eller svagt annoterede data og finjusterer derefter gradvist på mindre datasæt, som er manuelt gennemgået af eksperter. På den måde lærer modellen først et bredt biomedicinsk ordforråd og bliver derefter mere præcis i at afgrænse de enkelte fagudtryk. Vi forbedrer også datakvaliteten specifikt for hver delopgave. I M-RE reducerer vi for eksempel skævhed mod positive relationer ved systematisk at tilføje udvalgte negative eksempler (kontrolleret negativ sampling). For NERD øger vi nøjagtigheden ved at fjerne støjfyldte, fjernannoterede opslag fra det opslagsværk (ordbog), systemet bruger som reference. I den officielle evaluering opnåede vores system en 3.-plads i både M-RE- og C-RE-delopgaverne med Micro F1-scorer på henholdsvis 0,4054 og 0,2020. I NER- og NERD-delopgaverne opnåede systemet en 6.- og 5.-plads med Micro F1-scorer på 0,8014 og 0,5517. Al kildekode til løsningen er frit tilgængelig på: https://github.com/SophusJ/GetGut-AAU.
[This abstract has been rewritten with the help of AI based on the project's original abstract]
