Question-Answer Generation-based Evaluation Framework
Author
Term
4. semester
Publication year
2025
Submitted on
2025-06-04
Pages
82
Abstract
Evaluating the quality of automatically generated summaries remains a central challenge in natural language processing. Standard metrics like ROUGE focus on lexical overlap and often fail to capture deeper qualities such as factual consistency, fluency, or relevance. Recent QA-based approaches, such as UniEval, offer multi-dimensional evaluation but often act as black boxes, providing binary outputs without interpretability or reasoning. This thesis introduces QAG-Eval, a modular framework that combines question generation, answer reasoning, and scalar scoring to evaluate summaries across four quality dimensions: coherence, consistency, fluency, and relevance. By generating natural language justifications, QAG-Eval provides transparent, interpretable evaluations instead of opaque scalar scores. The framework is evaluated against ROUGE and retrained UniEval models on human-annotated datasets. Results show that QAG-Eval offers strong alignment with human judgments and captures subtle mid-range quality distinctions more effectively. The thesis also analyzes score distributions, justification quality, and performance trade-offs between multi-task and continual learning setups. By integrating reasoning and scoring in a transparent pipeline, QAG-Eval contributes toward more interpretable, modular, and human-aligned evaluation methods applicable across diverse summarization tasks.
Documents
