When LLMs Lie: Building Observable LLMOps for Evaluating EU Legislative RAG - From GPU-Poor to GPU-Rich: From Local Setups to Cloud APIs: Observable Quest for Reliable EU Legal AI MLOps
Translated title
When LLMs Lie: Building Observable LLMOps for Evaluating EU Legislative RAG - From GPU-Poor to GPU-Rich
Author
Do, Tony Thai
Term
4. term
Education
Publication year
2025
Submitted on
2025-05-26
Pages
54
Abstract
This thesis investigates how to make AI systems for legal text analysis both observable and reliable. Using EU legislative documents from the MultiEURLEX dataset, it builds and compares three approaches: (1) a straightforward large language model (LLM), (2) a retrieval-augmented generation (RAG) pipeline that first fetches relevant documents and then generates answers, and (3) an agentic multi-step variant that breaks the task into several steps. The systems are evaluated on a curated gold-standard dataset with quantitative metrics - F1, precision, and recall, which indicate how often the model is correct and how completely it finds relevant information - and with qualitative review using LLM-as-a-Judge (an LLM that scores outputs). Tools such as Langfuse and LiteLLM provide end-to-end observability, including tracing and metric logging across local open-source, open-weights, and cloud-hosted proprietary model configurations. The main result is that direct LLM use outperforms the RAG variants because retrieval has low recall; in this setting, retrieval is the bottleneck. The work also demonstrates a full-stack MLOps deployment on AAU's uCloud high-performance computing (HPC) GPU platform, and it underscores the importance of traceability and human-centered evaluation for trustworthy AI. Overall, the thesis offers a practical blueprint and critical lessons for operating generative AI in high-stakes domains.
Specialet undersøger, hvordan man kan gøre AI-systemer til juridisk tekstanalyse både observerbare og pålidelige. Med EU-lovtekster fra MultiEURLEX bygger og sammenligner det tre tilgange: (1) en enkel stor sprogmodel (LLM), (2) en Retrieval-Augmented Generation (RAG)-pipeline, der først henter relevante dokumenter og derefter genererer svar, og (3) en agent-baseret flertrinsvariant, der opdeler opgaven i flere trin. Systemerne evalueres på et kurateret guldstandard-datasæt med kvantitative mål - F1, præcision og recall, som angiver, hvor ofte modellen har ret, og hvor komplet den finder relevant information - samt med kvalitativ vurdering via LLM-as-a-Judge (en LLM, der bedømmer output). Værktøjer som Langfuse og LiteLLM giver end-to-end observerbarhed med tracing og metrik-logning på tværs af lokale open source-, open-weights- og proprietære cloud-modeller. Hovedresultatet er, at direkte brug af en LLM overgår RAG-varianterne, fordi retrieval-delen har lav recall; i denne kontekst er retrieval flaskehalsen. Arbejdet demonstrerer også en fuld-stack MLOps-implementering på AAU's uCloud high-performance computing (HPC) GPU-platform og understreger vigtigheden af sporbarhed og menneskecentreret evaluering i troværdig AI. Samlet set leverer specialet en praktisk køreplan og væsentlige indsigter til at operationalisere generativ AI i højrisikodomæner.
[This apstract has been rewritten with the help of AI based on the project's original abstract]
Keywords
