AAU Student Projects - visit Aalborg University's student projects portal
A master thesis from Aalborg University

LLMEnsembleEval: A Modular Framework for Large Language Model Ensemble Evaluation

Author(s)

Term

4. term

Education

Publication year

2025

Submitted on

2025-06-08

Pages

25 pages

Abstract

Large Language Models (LLMs) achieve remarkable performance across diverse NLP tasks, yet suffer from critical reliability issues including hallucinations and inconsistent outputs. Ensemble methods emerge as promising solutions by combining predictions from multiple models to improve robustness and performance. However, current ensemble evaluation practices lack standardization, hindering method comparison and reproducibility. This work addresses two key challenges in LLM ensemble research. First, we validate the Generation of Each token by LLMs as a Classification (GAC) strategy by reproducing core results and extending evaluation to additional models and benchmarks. Our experiments across MMLU, PIQA, ARC Challenge, and Winogrande reveal that GAC's effectiveness depends critically on performance similarity between ensemble members, with uniform weighting working best when models have comparable capabilities. Second, we develop LLMEnsembleEval, the first standardized framework for LLM ensemble evaluation that integrates with lm-evaluation-harness. The modular architecture supports multi-GPU deployment and enables systematic comparison of ensemble strategies while maintaining reproducible protocols. Our findings demonstrate that GAC consistently improves performance on knowledge-intensive tasks like MMLU (gains of 0,1\% to 3,6\%) but shows mixed results on complex reasoning tasks, highlighting the need for task-specific strategies. The performance similarity hypothesis show that ensembles work best with models of comparable capability. LLMEnsembleEval provides the foundation for systematic evaluation of emerging ensemble strategies, potentially accelerating progress toward more reliable and effective LLM systems.

Documents


Colophon: This page is part of the AAU Student Projects portal, which is run by Aalborg University. Here, you can find and download publicly available bachelor's theses and master's projects from across the university dating from 2008 onwards. Student projects from before 2008 are available in printed form at Aalborg University Library.

If you have any questions about AAU Student Projects or the research registration, dissemination and analysis at Aalborg University, please feel free to contact the VBN team. You can also find more information in the AAU Student Projects FAQs.