AAU Student Projects - visit Aalborg University's student projects portal
A master's thesis from Aalborg University
Book cover


Graph Neural Networks for Sematic Entity suggestion: Vaialbity study of the use of Graph Neural Networks for Entity Suggestion via Dense Retrieval

Translated title

Graph Neural Networks for Sematic Entity suggestion

Author

Term

4. term

Publication year

2023

Submitted on

Pages

51

Abstract

Dette speciale undersøger, hvordan man mere præcist kan koble omtaler i videnskabelige tabeller til de rette poster i en vidensbase (entitetslinkning). I stedet for at bygge på leksikografiske træk (strengmatch) lærer projektet multimodale indlejringer, der fanger betydning fra både tabeller og tekst i korpusset. Tilgangen er en dual-encoder-arkitektur med en omtale-encoder og en entitets-encoder, som projicerer omtaler og entiteter fra vidensgrafer ind i samme vektorrum. Et let projekteringslag (projection head) udligner dimensioner og gør sammenligning effektiv. Omtale-encoderen anvender BERT med tilpassede inputformater: for tekst indkapsles omtaler efter en etableret metode, og for tabeller anvendes en forenklet struktur med fokus på kolonnertype-annotering. Entitets-encoderen tager input fra ontologier/vidensgrafer og kan inkludere graf-neurale netværk (GNN). Lighed mellem omtaler og entiteter måles med cosinus-lighed/prikprodukt, og træningen bruger tre typer tab/scoringsfunktioner: cosine embedding loss, triplet margin loss og en krydsentropi-lignende målfunktion, som trækker match tættere sammen og skubber ikke-match væk. For praktisk brug beregnes entitetsindlejringer offline; ved inferens indlejres en omtale og de nærmeste naboer hentes hurtigt, også i meget store vidensbaser som DBpedia. På grund af ressourcebegrænsninger køres hver konfiguration én gang, og nogle konfigurationer med en bestemt tekst-encoder nåede ikke at køre færdigt. Resultaterne viser, at systemet ofte ikke finder den helt korrekte entitet, men det foreslår oftere semantisk beslægtede kandidater frem for blot strenglighed. Det demonstrerer, at tæt (vektorbaseret) søgning er anvendelig til entitetsforslag, men yderligere finjustering er nødvendig for at nå state-of-the-art. Specialet beskriver datasæt og deres konvertering til omtale-datasæt, modeldesign og scorings-/tab-funktioner, forsøgsopsætning og metrikker, resultater og perspektiver for fremtidigt arbejde.

This thesis explores how to more accurately link mentions in scientific tables to the correct entries in a knowledge base (entity linking). Instead of relying on lexicographic cues (string matching), it learns multimodal embeddings that capture meaning from both table and text context across the corpus. The approach uses a dual-encoder architecture with a mention encoder and an entity encoder that map mentions and knowledge-graph entities into the same vector space. A lightweight projection head aligns dimensions to enable efficient comparison. The mention encoder is BERT-based with tailored input formats: mentions in text are wrapped following an established scheme, and table inputs are simplified with a focus on column type annotation. The entity encoder ingests ontology/knowledge-graph data and can incorporate graph neural network (GNN) layers. Similarity between mentions and entities is computed with cosine/dot functions, and training uses three objectives—cosine embedding loss, triplet margin loss, and a cross-entropy–style objective—to pull matched pairs together and push non-matches apart. For practical use, entity embeddings are computed offline; at inference, a mention is embedded and its nearest neighbors are retrieved quickly, even in very large knowledge bases such as DBpedia. Due to resource constraints, each configuration was run once, and some configurations with a specific text encoder did not finish. Results show that, while the system often fails to retrieve the exact gold entity, it more reliably suggests semantically related candidates rather than merely string-similar ones. This demonstrates the feasibility of dense (vector-based) retrieval for entity suggestion, though further fine-tuning is needed to reach state-of-the-art performance. The thesis documents the datasets and their conversion to mention datasets, model design and scoring/loss functions, experimental setup and metrics, results, and directions for future work.

[This summary has been rewritten with the help of AI based on the project's original abstract]