Question-Answer Generation-based Evaluation Framework
Author
Baks, Simon Loi
Term
4. semester
Publication year
2025
Submitted on
2025-06-04
Pages
82
Abstract
Assessing the quality of automatically generated summaries is challenging. Common metrics like ROUGE tally how many words overlap with a reference summary, but they miss whether the summary is factually correct, easy to read, or actually covers what matters. Newer methods assess quality by asking and answering questions about the text (e.g., UniEval), but they often act as black boxes and return only yes/no judgments without explanations. This thesis presents QAG-Eval, a modular evaluation framework that links three steps: generating questions, reasoning about answers, and turning the results into numerical scores. It scores summaries along four dimensions—coherence (does it flow logically), consistency (are facts accurate), fluency (is the language natural), and relevance (does it include the important points)—and, crucially, produces natural-language justifications for each score. That makes the evaluation transparent and easier to interpret. We test QAG-Eval on human-annotated datasets and compare it with ROUGE and retrained UniEval models. QAG-Eval aligns well with human judgments and is especially good at distinguishing mid-range quality (e.g., between “okay” and “good”). The thesis also analyzes how scores are distributed, the quality of the explanations, and performance trade-offs between multi-task learning (training several skills at once) and continual learning (updating skills over time). Overall, QAG-Eval points to more interpretable, modular, and human-aligned evaluation methods across diverse summarization tasks.
At vurdere kvaliteten af automatisk genererede resuméer er svært. Udbredte mål som ROUGE tæller, hvor mange ord der overlapper med et referencesammendrag, men de fanger ikke, om resuméet er faktuelt korrekt, let at læse eller dækker det vigtigste. Nyere metoder vurderer kvalitet ved at stille og besvare spørgsmål om teksten (fx UniEval), men de fungerer ofte som sorte bokse og giver kun ja/nej-udfald uden forklaring. Denne afhandling præsenterer QAG-Eval, en modulær evalueringsramme, der kobler tre trin: spørgsmålsgenerering, begrundet besvarelse og omregning til numeriske scores. Den bedømmer resuméer langs fire dimensioner—sammenhæng (flyder det logisk), konsistens (er fakta korrekte), flydende sprog (lyder sproget naturligt) og relevans (indeholder det det vigtige)—og, afgørende, producerer begrundelser i naturligt sprog for hver score. Det gør evalueringen gennemskuelig og lettere at tolke. Vi afprøver QAG-Eval på datasæt annoteret af mennesker og sammenligner med ROUGE og gentrænede UniEval-modeller. QAG-Eval stemmer godt overens med menneskelige vurderinger og er især god til at skelne i mellemområdet, fx mellem “okay” og “god”. Afhandlingen analyserer også scorefordelinger, kvaliteten af begrundelserne og afvejninger mellem fleropgave-læring (træning af flere færdigheder på én gang) og kontinuerlig læring (opdatering over tid). Samlet peger QAG-Eval mod mere fortolkelige, modulære og menneske-orienterede evalueringsmetoder på tværs af summariseringsopgaver.
[This apstract has been rewritten with the help of AI based on the project's original abstract]
