AAU Student Projects - visit Aalborg University's student projects portal
A master's thesis from Aalborg University
Book cover


When LLMs Lie: Building Observable LLMOps for Evaluating EU Legislative RAG - From GPU-Poor to GPU-Rich: From Local Setups to Cloud APIs: Observable Quest for Reliable EU Legal AI MLOps

Translated title

When LLMs Lie: Building Observable LLMOps for Evaluating EU Legislative RAG - From GPU-Poor to GPU-Rich

Author

Term

4. term

Education

Publication year

2025

Submitted on

Pages

54

Abstract

Specialet undersøger, hvordan man kan gøre AI-systemer til juridisk tekstanalyse både observerbare og pålidelige. Med EU-lovtekster fra MultiEURLEX bygger og sammenligner det tre tilgange: (1) en enkel stor sprogmodel (LLM), (2) en Retrieval-Augmented Generation (RAG)-pipeline, der først henter relevante dokumenter og derefter genererer svar, og (3) en agent-baseret flertrinsvariant, der opdeler opgaven i flere trin. Systemerne evalueres på et kurateret guldstandard-datasæt med kvantitative mål - F1, præcision og recall, som angiver, hvor ofte modellen har ret, og hvor komplet den finder relevant information - samt med kvalitativ vurdering via LLM-as-a-Judge (en LLM, der bedømmer output). Værktøjer som Langfuse og LiteLLM giver end-to-end observerbarhed med tracing og metrik-logning på tværs af lokale open source-, open-weights- og proprietære cloud-modeller. Hovedresultatet er, at direkte brug af en LLM overgår RAG-varianterne, fordi retrieval-delen har lav recall; i denne kontekst er retrieval flaskehalsen. Arbejdet demonstrerer også en fuld-stack MLOps-implementering på AAU's uCloud high-performance computing (HPC) GPU-platform og understreger vigtigheden af sporbarhed og menneskecentreret evaluering i troværdig AI. Samlet set leverer specialet en praktisk køreplan og væsentlige indsigter til at operationalisere generativ AI i højrisikodomæner.

This thesis investigates how to make AI systems for legal text analysis both observable and reliable. Using EU legislative documents from the MultiEURLEX dataset, it builds and compares three approaches: (1) a straightforward large language model (LLM), (2) a retrieval-augmented generation (RAG) pipeline that first fetches relevant documents and then generates answers, and (3) an agentic multi-step variant that breaks the task into several steps. The systems are evaluated on a curated gold-standard dataset with quantitative metrics - F1, precision, and recall, which indicate how often the model is correct and how completely it finds relevant information - and with qualitative review using LLM-as-a-Judge (an LLM that scores outputs). Tools such as Langfuse and LiteLLM provide end-to-end observability, including tracing and metric logging across local open-source, open-weights, and cloud-hosted proprietary model configurations. The main result is that direct LLM use outperforms the RAG variants because retrieval has low recall; in this setting, retrieval is the bottleneck. The work also demonstrates a full-stack MLOps deployment on AAU's uCloud high-performance computing (HPC) GPU platform, and it underscores the importance of traceability and human-centered evaluation for trustworthy AI. Overall, the thesis offers a practical blueprint and critical lessons for operating generative AI in high-stakes domains.

[This summary has been rewritten with the help of AI based on the project's original abstract]