AAU Student Projects - visit Aalborg University's student projects portal
A master's thesis from Aalborg University
Book cover


EnergyBench: A Holistic and Systematic Benchmark for Measuring the Correctness and Energy-Efficiency of LLM-Generated Code

Author

Term

4. term

Publication year

2025

Submitted on

Pages

22

Abstract

Den stigende brug af kunstig intelligens øger energiforbruget i datacentre og skaber miljøbekymringer i hele ICT-sektoren. Inden for de seneste fem år har fremskridt i blandt andet sprogbehandling og computer vision øget udbredelsen af AI markant. Et centralt spørgsmål opstår i krydsfeltet mellem store sprogmodeller (LLM’er) og moderne softwareudvikling: Branchen bevæger sig mod automatisk kodegenerering med LLM-assistenter og -editorer, men hvor bæredygtig er den kode, som disse modeller producerer? Tidligere målinger viser, at LLM’er kan skrive mere energieffektiv kode, når de får eksplicitte optimeringsprompter (instruktioner), men det er uklart, hvilke problemtyper, programmeringssprog og promptstrategier der giver den bedste balance mellem korrekthed og energiforbrug. For at afklare dette introduceres EnergyBench – et benchmark-rammeværk, der systematisk analyserer de faktorer, der påvirker LLM’ers evne til at generere korrekt og energieffektiv kode. Forsøgene viser, at prompts, der beder om at reducere to centrale performance-målinger, kan gøre koden op til 91,9% mere energieffektiv i 5 ud af 7 testede LLM’er, men med en pris i form af lavere samlet nøjagtighed (hvor ofte koden er korrekt). Yderligere forsøg viser, at LLM’er er meget følsomme over for promptindhold: To mindre ændringer i problemformuleringen gav markant forskellige og modsatrettede effekter på både korrekthed og energiforbrug. I ét tilfælde steg effektiviteten mere end fire gange, mens nøjagtigheden i et andet tilfælde faldt til nul. EnergyBench er offentligt tilgængelig og inviterer til bidrag, så feltet kan få et bredere og mere nuanceret billede af bæredygtigheden ved LLM-baseret kodning.

The growing use of artificial intelligence is increasing data center energy demand and raising environmental concerns across the ICT sector. Over the past five years, advances in areas such as language processing and computer vision have accelerated AI adoption. This creates an important question at the intersection of large language models (LLMs) and modern software development: the industry is moving toward automatic code generation with LLM assistants and editors, but how sustainable is the code these models produce? Existing measurements show that LLMs can generate more energy-efficient code when given explicit optimization prompts (instructions), yet it remains unclear which problem types, programming languages, and prompting strategies yield the best balance between correctness and energy use. To address this, the thesis introduces EnergyBench—a benchmarking framework that systematically analyzes the factors that affect LLMs’ ability to produce correct and energy-efficient code. Experiments show that prompts aimed at reducing two key performance measures can make code up to 91.9% more energy-efficient in 5 of the 7 tested LLMs, at the cost of lower overall accuracy (how often the code is correct). Additional experiments reveal that LLMs are highly sensitive to prompt content: two small tweaks to the way a problem is stated led to starkly different, even opposite effects on both accuracy and energy efficiency. In one case, efficiency increased by more than four times; in another, accuracy dropped to zero. EnergyBench is publicly available and invites community contributions to build a broader, more detailed picture of the sustainability of LLM-based coding.

[This summary has been rewritten with the help of AI based on the project's original abstract]