Towards Reliable and Deployable LLM-Based Patient Question Answering
Author
Maniatis, Alexandros
Term
4. semester
Publication year
2026
Submitted on
2026-06-01
Pages
74
Abstract
Large Language Models (LLMs)—advanced AI systems that can understand and generate text—offer promise for healthcare, but real clinical use requires more than good test scores. Systems must be reliable, grounded in evidence, compliant with regulations, and feasible to operate. This thesis examines an LLM-based clinical question answering system through participation in the ArchEHR-QA 2026 shared task, combining standardized benchmark results with a deployment-focused analysis to answer a question that technical metrics alone cannot: under what conditions should Danish healthcare institutions deploy these systems, and which deployment architecture fits their context. More than 34 LLM configurations were evaluated across four patient-centered subtasks: query reformulation, evidence retrieval, answer generation, and evidence alignment. Prompting strategies included zero-shot (no examples), few-shot (a few examples), constraint-based (explicit rules and formats), and ensemble-based (combined setups). The main technical finding is that model capability matters more than prompt engineering: prompting reliably improved structural consistency but did not overcome reasoning limits in weaker models. Deployment readiness varied more across subtasks than across models; evidence alignment was comparatively stable, while evidence retrieval remained fragile even with ensembles. A simplified deployment framework further showed that economic viability depends strongly on institutional scale, and that regulatory constraints from the GDPR (EU, 2016) and the EU AI Act (EU, 2024) create a structural asymmetry that smaller institutions cannot resolve through technical choices alone. Overall, current LLM-based patient QA systems are best positioned as assistive tools within human-in-the-loop workflows, and deployment feasibility should be addressed early in the design process—not only after technical performance looks good.
Store sprogmodeller (LLM’er) – avanceret AI, der kan forstå og generere tekst – kan potentielt hjælpe sundhedsvæsenet, men faktisk brug i klinikken kræver mere end høj nøjagtighed i tests. Systemerne skal også være driftssikre, bygge på dokumenteret viden, overholde regler og kunne passe ind i hverdagspraksis. Denne afhandling undersøger en LLM-baseret løsning til kliniske spørgsmål og svar gennem deltagelse i den fælles opgave ArchEHR-QA 2026. Arbejdet kombinerer resultater fra et standardiseret benchmark med en praktisk analyse af implementering for at besvare et spørgsmål, som tekniske målinger alene ikke kan: under hvilke betingelser bør danske sundhedsinstitutioner tage disse systemer i brug, og hvilken implementeringsarkitektur passer til deres kontekst. Mere end 34 LLM-konfigurationer blev afprøvet på fire patientcentrerede delopgaver: omformulering af spørgsmål, evidensindhentning, svargenerering og evidens-tilpasning. Der blev anvendt forskellige promptstrategier: zero-shot (uden eksempler), few-shot (med få eksempler), constraint-baseret (med klare regler og formater) og ensemble-baseret (kombinerede opsætninger). Den centrale tekniske konklusion er, at modelkapacitet betyder mere end promptteknikker: promptning forbedrede den strukturelle konsistens, men kunne ikke opveje manglende ræsonnement i svagere modeller. Implementeringsparathed varierede mere mellem delopgaver end mellem modeller; evidens-tilpasning var relativt stabil, mens evidensindhentning forblev skrøbelig, selv i ensemblekonfigurationer. En forenklet implementeringsvurdering viste desuden, at den økonomiske bæredygtighed i høj grad afhænger af institutionens størrelse, og at GDPR (EU, 2016) samt EU’s AI-forordning (EU, 2024) skaber en strukturel asymmetri, som mindre institutioner ikke kan løse med teknik alene. Samlet peger resultaterne på, at LLM-baserede patient-QA-systemer i dag bedst anvendes som assistive værktøjer i arbejdsgange med menneskelig faglig kontrol, og at spørgsmål om implementerbarhed bør adresseres tidligt i designet – ikke først efter, at de tekniske mål er nået.
[This apstract has been rewritten with the help of AI based on the project's original abstract]
Keywords
