AAU Student Projects - visit Aalborg University's student projects portal
A master's thesis from Aalborg University
Book cover


LLMEnsembleEval: A Modular Framework for Large Language Model Ensemble Evaluation

Author

Term

4. term

Education

Publication year

2025

Submitted on

Pages

25

Abstract

Store sprogmodeller (LLM'er) klarer sig imponerende på mange opgaver i naturlig sprogbehandling, men de kan være upålidelige: de kan hallucinere (give selvsikre, men forkerte svar) og give uens resultater. Ensembler, hvor man kombinerer flere modeller, er et lovende middel til at øge robusthed og ydeevne, men evalueringen af sådanne metoder er i dag ikke standardiseret, hvilket gør det svært at sammenligne resultater og genskabe studier. Dette arbejde tager fat på to centrale udfordringer i forskning i LLM-ensembler. For det første validerer vi GAC-strategien (Generation of Each token by LLMs as a Classification) ved at reproducere centrale resultater og udvide evalueringen til flere modeller og standardbenchmarks som MMLU, PIQA, ARC Challenge og Winogrande. Vores forsøg viser, at GAC's effekt afhænger kritisk af, hvor ens ensemblemedlemmerne præsterer: ens vægtning fungerer bedst, når modellerne har sammenlignelige evner. For det andet udvikler vi LLMEnsembleEval, den første standardiserede ramme for evaluering af LLM-ensembler, som integrerer med lm-evaluation-harness. Den modulære arkitektur understøtter kørsel på flere GPU'er og muliggør systematisk sammenligning af ensemble-strategier med reproducerbare protokoller. Vores resultater viser, at GAC konsekvent forbedrer præstationen på vidensintensive opgaver som MMLU (gevinster på 0,1 % til 3,6 %), men giver blandede resultater på komplekse ræsonneringsopgaver, hvilket peger på behovet for opgavespecifikke strategier. Hypotesen om præstationslighed understøttes: ensembler virker bedst, når modeller har sammenlignelig kapacitet. LLMEnsembleEval giver et fundament for systematisk evaluering af nye ensemblemetoder og kan fremskynde udviklingen af mere pålidelige og effektive LLM-systemer.

Large language models (LLMs) perform remarkably well on many natural language processing tasks, but they can be unreliable: they may hallucinate (produce confident but incorrect answers) and give inconsistent outputs. Ensembles—combining multiple models—are a promising way to improve robustness and performance, yet current evaluation practices are not standardized, making comparisons and reproduction difficult. This thesis addresses two key challenges in LLM ensemble research. First, we validate the GAC strategy (Generation of Each token by LLMs as a Classification) by reproducing core results and extending evaluation to additional models and standard benchmarks such as MMLU, PIQA, ARC Challenge, and Winogrande. Our experiments show that GAC’s effectiveness depends critically on how similar the ensemble members’ performance is: simple uniform weighting works best when models have comparable capabilities. Second, we develop LLMEnsembleEval, the first standardized framework for evaluating LLM ensembles, integrated with lm-evaluation-harness. Its modular design supports multi-GPU deployment and enables systematic, reproducible comparisons of ensemble strategies. We find that GAC consistently improves performance on knowledge-intensive tasks like MMLU (gains of 0.1% to 3.6%) but yields mixed results on complex reasoning tasks, underscoring the need for task-specific strategies. Our performance similarity hypothesis is supported: ensembles work best when models have comparable capability. LLMEnsembleEval provides a foundation for systematic evaluation of emerging ensemble methods and may accelerate progress toward more reliable and effective LLM systems.

[This summary has been rewritten with the help of AI based on the project's original abstract]