When LLMs Lie: Building Observable LLMOps for Evaluating EU Legislative RAG - From GPU-Poor to GPU-Rich: From Local Setups to Cloud APIs: Observable Quest for Reliable EU Legal AI MLOps
Translated title
When LLMs Lie: Building Observable LLMOps for Evaluating EU Legislative RAG - From GPU-Poor to GPU-Rich
Author
Term
4. term
Education
Publication year
2025
Submitted on
2025-05-26
Pages
54
Abstract
This thesis explores how to build observable and reliable AI systems for legal text analysis using Large Language Models (LLMs). Focusing on EU legislative documents from MultiEURLEX, a baseline LLM system, a Retrieval-Augmented Generation (RAG) pipeline, and an agentic multi-step variant are developed and compared. The systems are evaluated using a curated gold-standard dataset, quantitative metrics (F1, precision, recall), and qualitative assessments (LLM-as-a-Judge). Tools like Langfuse, LiteLLM provide full observability, tracing, metric logging across local free open-source, open-weights and cloud based proprietary LLM configurations. Key findings reveal direct LLM access outperforms RAG variants due to low retrieval recall, highlighting retrieval is current bottleneck in specific domain RAG application. The work skills and competences demonstrate a full-stack MLOps deployment on AAU's uCloud HPC GPU platform and highlights importance of traceability, and human centered evaluation in trustworthy AI. This thesis and related research contribute both a methodological blueprint and critical insights for operationalizing GenAI in high stakes domains.
Keywords
Documents
