AAU Student Projects - visit Aalborg University's student projects portal
A master's thesis from Aalborg University
Book cover


When LLMs Lie: Building Observable LLMOps for Evaluating EU Legislative RAG - From GPU-Poor to GPU-Rich: From Local Setups to Cloud APIs: Observable Quest for Reliable EU Legal AI MLOps

Translated title

When LLMs Lie: Building Observable LLMOps for Evaluating EU Legislative RAG - From GPU-Poor to GPU-Rich

Term

4. term

Education

Publication year

2025

Submitted on

Pages

54

Abstract

This thesis explores how to build observable and reliable AI systems for legal text analysis using Large Language Models (LLMs). Focusing on EU legislative documents from MultiEURLEX, a baseline LLM system, a Retrieval-Augmented Generation (RAG) pipeline, and an agentic multi-step variant are developed and compared. The systems are evaluated using a curated gold-standard dataset, quantitative metrics (F1, precision, recall), and qualitative assessments (LLM-as-a-Judge). Tools like Langfuse, LiteLLM provide full observability, tracing, metric logging across local free open-source, open-weights and cloud based proprietary LLM configurations. Key findings reveal direct LLM access outperforms RAG variants due to low retrieval recall, highlighting retrieval is current bottleneck in specific domain RAG application. The work skills and competences demonstrate a full-stack MLOps deployment on AAU's uCloud HPC GPU platform and highlights importance of traceability, and human centered evaluation in trustworthy AI. This thesis and related research contribute both a methodological blueprint and critical insights for operationalizing GenAI in high stakes domains.