Question-Answer Generation-based Evaluation Framework

Author

Baks, Simon Loi

Term

4. semester

Education

Artificial Intelligence, Vision and Sound, MSc.

Publication year

2025

Submitted on

2025-06-04

Pages

Abstract

Evaluating the quality of automatically generated summaries remains a central challenge in natural language processing. Standard metrics like ROUGE focus on lexical overlap and often fail to capture deeper qualities such as factual consistency, fluency, or relevance. Recent QA-based approaches, such as UniEval, offer multi-dimensional evaluation but often act as black boxes, providing binary outputs without interpretability or reasoning. This thesis introduces QAG-Eval, a modular framework that combines question generation, answer reasoning, and scalar scoring to evaluate summaries across four quality dimensions: coherence, consistency, fluency, and relevance. By generating natural language justifications, QAG-Eval provides transparent, interpretable evaluations instead of opaque scalar scores. The framework is evaluated against ROUGE and retrained UniEval models on human-annotated datasets. Results show that QAG-Eval offers strong alignment with human judgments and captures subtle mid-range quality distinctions more effectively. The thesis also analyzes score distributions, justification quality, and performance trade-offs between multi-task and continual learning setups. By integrating reasoning and scoring in a transparent pipeline, QAG-Eval contributes toward more interpretable, modular, and human-aligned evaluation methods applicable across diverse summarization tasks.

Documents

Download
View record in AAU Student Projects

An executive master's programme thesis from Aalborg University

Question-Answer Generation-based Evaluation Framework