When LLMs Lie: Building Observable LLMOps for Evaluating EU Legislative RAG - From GPU-Poor to GPU-Rich: From Local Setups to Cloud APIs: Observable Quest for Reliable EU Legal AI MLOps

Translated title

When LLMs Lie: Building Observable LLMOps for Evaluating EU Legislative RAG - From GPU-Poor to GPU-Rich

Author

Do, Tony Thai

Term

4. term

Education

Medialogy, Master

Publication year

2025

Submitted on

2025-05-26

Pages

Abstract

This thesis explores how to build observable and reliable AI systems for legal text analysis using Large Language Models (LLMs). Focusing on EU legislative documents from MultiEURLEX, a baseline LLM system, a Retrieval-Augmented Generation (RAG) pipeline, and an agentic multi-step variant are developed and compared. The systems are evaluated using a curated gold-standard dataset, quantitative metrics (F1, precision, recall), and qualitative assessments (LLM-as-a-Judge). Tools like Langfuse, LiteLLM provide full observability, tracing, metric logging across local free open-source, open-weights and cloud based proprietary LLM configurations. Key findings reveal direct LLM access outperforms RAG variants due to low retrieval recall, highlighting retrieval is current bottleneck in specific domain RAG application. The work skills and competences demonstrate a full-stack MLOps deployment on AAU's uCloud HPC GPU platform and highlights importance of traceability, and human centered evaluation in trustworthy AI. This thesis and related research contribute both a methodological blueprint and critical insights for operationalizing GenAI in high stakes domains.

Keywords

Legal Tech ; LLM Evaluation ; Cloud Compuuting ; AI Engineering ; DevOps ; DevSecFinPlatOps ; Software Development ; HCI ; Observability ; RAG ; LLMOps ; LLM ; EU Legislation ; Full Stack Development ; Generative AI ; Open-weights

Documents

Download
View record in AAU Student Projects

A master's thesis from Aalborg University

When LLMs Lie: Building Observable LLMOps for Evaluating EU Legislative RAG - From GPU-Poor to GPU-Rich: From Local Setups to Cloud APIs: Observable Quest for Reliable EU Legal AI MLOps