AAU Student Projects - visit Aalborg University's student projects portal
A master's thesis from Aalborg University
Book cover


LLMEnsembleEval: A Modular Framework for Large Language Model Ensemble Evaluation

Term

4. term

Education

Publication year

2025

Submitted on

Pages

25

Abstract

Large Language Models (LLMs) achieve remarkable performance across diverse NLP tasks, yet suffer from critical reliability issues including hallucinations and inconsistent outputs. Ensemble methods emerge as promising solutions by combining predictions from multiple models to improve robustness and performance. However, current ensemble evaluation practices lack standardization, hindering method comparison and reproducibility. This work addresses two key challenges in LLM ensemble research. First, we validate the Generation of Each token by LLMs as a Classification (GAC) strategy by reproducing core results and extending evaluation to additional models and benchmarks. Our experiments across MMLU, PIQA, ARC Challenge, and Winogrande reveal that GAC's effectiveness depends critically on performance similarity between ensemble members, with uniform weighting working best when models have comparable capabilities. Second, we develop LLMEnsembleEval, the first standardized framework for LLM ensemble evaluation that integrates with lm-evaluation-harness. The modular architecture supports multi-GPU deployment and enables systematic comparison of ensemble strategies while maintaining reproducible protocols. Our findings demonstrate that GAC consistently improves performance on knowledge-intensive tasks like MMLU (gains of 0,1\% to 3,6\%) but shows mixed results on complex reasoning tasks, highlighting the need for task-specific strategies. The performance similarity hypothesis show that ensembles work best with models of comparable capability. LLMEnsembleEval provides the foundation for systematic evaluation of emerging ensemble strategies, potentially accelerating progress toward more reliable and effective LLM systems.