GATekeeper: Detecting Schema-Induced Ambiguity in Natural Language Interfaces to Databases

Authors

Skadborg, Jacob Johannes Sigurd ; Mortensen, Martin Keck Søndersø

Term

4. term

Education

Software, Master

Publication year

2025

Submitted on

2025-06-12

Pages

Abstract

Natural Language to SQL (NL2SQL) systems translate natural lan- guage questions into executable SQL queries, enabling non-technical users to interact with databases. While recent advances in large language models (LLMs) and schema-aware techniques have driven performance on benchmarks such as Spider and BIRD, existing systems continue to struggle with ambiguity—particularly when queries admit multiple valid interpretations due to overlapping schema elements. This issue, termed Schema-Induced Ambiguity (SIA), arises when natural language tokens ambiguously refer to multiple tables, columns, or relations. SIA is especially common in real-world databases, where evolving and denormalised schemas diverge from the clean structure typically found in academic bench- marks. Current approaches address ambiguity only implicitly or par- tially. LLMs can reduce lexical ambiguity, but fail to reliably detect structural ambiguities without explicit schema reasoning. More- over, few systems are designed to proactively identify SIA before generating a query, leading to silent failures and misinterpretations. To address this gap, we propose a two-step detection framework: a fine-tuned BERT cross-encoder identifies schema elements likely to be involved in the intended query, followed by a Graph Attention Network (GAT) operating over the induced subgraph to predict the presence of ambiguity. Our method outperforms baseline ap- proaches in-domain, yet generalisation to unseen schemas remains limited, as evidenced by performance drops on BIRD-bench and Trial-Bench. Nonetheless, in-context training demonstrates strong potential for scaling ambiguity detection. While this work focuses exclusively on schema-induced sources, future extensions must address other forms of ambiguity to ensure reliability in production deployments. Code available at: https://github.com/P10-NLIDB.

Keywords

NL2SQL ; Ambiguity ; Schema Induced ; Ambiguity Detection

Documents

Download
View record in AAU Student Projects

A master's thesis from Aalborg University

GATekeeper: Detecting Schema-Induced Ambiguity in Natural Language Interfaces to Databases