EnergyBench: A Holistic and Systematic Benchmark for Measuring the Correctness and Energy-Efficiency of LLM-Generated Code

Author

Ionescu, Dragos

Term

4. term

Education

Computer Science (IT), Master

Publication year

2025

Submitted on

2025-06-06

Pages

Abstract

The growing adoption of Artificial Intelligence (AI) is a primary contributor to increased energy demands in data centers, raising environmental concerns for the Information, Communication, and Technology (ICT) sector as a whole. In the last five years, advances in areas like natural language processing and computer vision have significantly increased the global adoption of AI. An important topic of research lies at the intersection of Large Language Models (LLMs) with modern software development. The software industry is currently facing a steady transformation towards automated code generation, with LLM assistants and LLM-powered code editors at the forefront. The negative environmental impact of running these large AI models is certain, but questions still remain unanswered about the sustainability of LLM-generated code. Current benchmarks show that LLMs have the ability to generate energy-efficient code when given optimization prompts. However, the results remain unclear when arguing for which problems, programming languages, and prompting strategies lead to the best trade-off between accuracy and energy efficiency. To address these gaps, EnergyBench is introduced—a benchmarking framework that takes a holistic and systematic approach to analyzing the elements that most impact the ability of LLMs to generate correct and energy-efficient code. Experiments show that prompts focused on reducing two key performance metrics make code more energy-efficient by as much as 91.9% in 5 out of the 7 tested LLMs, though this comes at the cost of lowered overall accuracy. Additional experiments show that the systematic approach EnergyBench takes reveals gaps unexplored by the current state-of-the-art. LLMs are highly sensitive to prompt contents: two tweaks made when defining a programming problem in an LLM prompt have drastic and opposite effects on both accuracy and energy efficiency. In one case, efficiency is increased by more than 4×, while in another, accuracy drops to zero. EnergyBench's implementation is released to the public, inviting the community to contribute to the existing set of tests in the hopes that a broader, more detailed view of the sustainability of LLM coding is reached.

Keywords

green llms ; energy-efficiency ; llm ; llm energy benchmark ; rapl ; openai ; anthropic ; deepseek ; holistic energy benchmark

Documents

Download
View record in AAU Student Projects

A master's thesis from Aalborg University

EnergyBench: A Holistic and Systematic Benchmark for Measuring the Correctness and Energy-Efficiency of LLM-Generated Code