AAU Student Projects - visit Aalborg University's student projects portal
A master thesis from Aalborg University

Graph Neural Networks for Sematic Entity suggestion: Vaialbity study of the use of Graph Neural Networks for Entity Suggestion via Dense Retrieval

[Graph Neural Networks for Sematic Entity suggestion]

Author(s)

Term

4. term

Education

Publication year

2023

Submitted on

2023-10-18

Pages

51 pages

Abstract

The primary objective of this research project is to enhance the accuracy of entity linking in scientific table data by leveraging a knowledge base. This will be achieved by investigating the feasibility of employing machine learning techniques to generate multimodal embeddings for both entity linking and corpus embedding. In contrast to the current state-of-the-art methods that heavily depend on lexicographical features, this project aims to exploit the capabilities of a multimodal embedding approach to improve the suggestion of candidates for entity linking. The main focus is to understand how multimodal embedding can be used to extract relevant entities, considering the contextual data within the corpus for Entity Linking with a Knowledge Base. The structure of this thesis is organized into six chapters. Chapter 1 serves as an introduc- tion to the subject, providing necessary background information and discussing related works that have influenced this project. Chapter 2 delves into the datasets used in the project, specifically focusing on the conversion of these datasets into mention datasets that cover both tabular data and text data for the mentions. This chapter also covers the target knowledge graphs. Chapter 3 presents the model architecture of the project, including the projection head, mention encoder, entity encoder, and the dual encoder architecture. It also discusses the scoring functions that will be utilized. Chapter 4 outlines the experiments that will be conducted in the project and the evaluation metrics that will be used to assess the results. Chapter 5 presents the results of the experiments and provides a thorough evaluation of these results. Finally, Chapter 6 and Chapter 7 concludes the project, summarizing the findings and suggesting potential avenues for future work. This research project provides an in-depth exploration of the methodologies employed in the project, focusing on the concepts of text embedding and mention embedding. The input data for the project is categorized into text data and tabular data, each with its unique input structure in the BERT tokenizer. The text data input structure is based on the methodology proposed by Wu et al. , where each mention in the corpus is encapsulated within a specific string format. On the other hand, the tabular data input structure is inspired by the work of Trabelsi et al. , but with a simplified format due to the specific requirements and constraints of the project. The projection head, a fundamental component in machine learning, is utilized to transform input data, specifically embeddings, into a different space, thereby generating projected embeddings . This transformation is accomplished through a series of operations collectively known as projection layers. The projection head is essentially a simple feed-forward neural network that serves to project the entity and mention embeddings to the same dimensionality . It is equipped with a ReLU activation function and its primary function is to reduce the dimensionality of the embeddings, making them comparable. The Mention Encoder model is a sophisticated architecture that leverages the power of the BERT model with an additional projection head. The input data for this model is divided into two categories: text data and tabular data. Each mention in the corpus is encapsulated within a string and structured as per the methodology proposed by Wu et al. For tabular data, the project aims to perform Column Type Annotation, with the input into the BERT tokenizer inspired by the work of Trabelsi et al. The Mention Encoder’s architecture includes the BERT model and a projection head, which allows the model to project the output of the BERT model into a lower-dimensional space, enabling efficient computation and storage. The Entity Embedding model architecture is a crucial component of our research. The model architecture is divided into two main subsections: Input Encoding and Ontology Embedding Model. The input format for the ontology embedding process is derived from various ontologies/knowledge graphs, as detailed in Section 2. The model architecture for ontology embedding is based on the work of Wu et al. and Louis et al.The model’s flexibility, particularly in the utilization of Graph Neural Network (GNN) layers, is a key feature that allows it to be tailored to specific requirements and scenarios. The dual encoding model architecture is another significant aspect of our research. As underscored by Dong et al. , there exists a variety of dual encoder architectures, including but not limited to Siamese Dual Encoder (SDE), Asymmetric Dual Encoder (ADE), ADE with Shared Token Embedding (ADE-STE), ADE with Frozen Token Embedding (ADE- FTE), and ADE with a yet to be defined component (ADE-SPL). The primary focus of this project is the ADE model architecture, which is further bifurcated into two main components: the entity encoder and the mention encoder. This architecture facilitates simpler modifications to the different components of the dual encoder, thereby enabling each stack to adapt and better fit the conclusion. The loss function plays a critical role in steering both encoders to acquire identical representations. The score function must be adept at evaluating the similarity between mention and entity embeddings. Furthermore, it is essential for practical applications that the entity’s embedding can be computed offline. The inference should be achievable by calculating the mention embedding and retrieving the nearest k neighbors in less than a minute, even in extensive knowledge bases like DBpedia, which encompasses billions of entities. This study presents three subsections, each focusing on a different function relevant to the thesis. The first function, Cosine similarity/dot function, introduces the cosine embedding loss. This is a common scoring function that enables dense retrieval based on the angular similarity of embeddings representing entities. The second function, Triplet margin loss, introduces the triplet loss function. This is a common loss function that allows for dense retrieval based on the Euclidean similarity of embeddings representing entities. From the perspective that embeddings are maps from higher dimensionality into a manifold in lower dimension, the idea behind triplet loss is to move similar entities closer together, and dissimilar entities farther away. The third function, with slight abuse of notation cross-entropy, is a scoring function aimed at classification. However, the aim of the function is to maximize the value of the right ”class” and minimize the value to the other negative anchors. This can be seen as an updated triplet margin loss where the positive anchor is pushed closer and the remainder is pushed farther away. These three functions are presented because they provide a comprehensive understanding of the scoring and loss functions used in machine learning, which is crucial to the thesis. The experimental setup for this study is detailed in Table 4.2. The subsequent sections provide an in-depth analysis of the results derived from these experiments, with a particular emphasis on the influence of the configuration on the final outcomes. Due to constraints in resources, each configuration is executed only once. For a more robust statistical un- derstanding of the significance of each configuration, it would be necessary to conduct additional runs. However, due to time limitations, not all configurations of the model with the text encoder being bert-base-uncased were executed, with 4 runs remaining incomplete. The results of the executed runs are presented in Table 5.1. Table 7.1 indicates that while the model is unable to retrieve the correct entities, the suggested entities are more semantically in nature as opposed to lexicographic similarity. The model is capable of making suggestions based on semantic similarities, demonstrating the feasibility of entity suggestion based on dense retrieval. However, the model requires further fine-tuning to achieve state-of-the-art performance.

Keywords

Documents


Colophon: This page is part of the AAU Student Projects portal, which is run by Aalborg University. Here, you can find and download publicly available bachelor's theses and master's projects from across the university dating from 2008 onwards. Student projects from before 2008 are available in printed form at Aalborg University Library.

If you have any questions about AAU Student Projects or the research registration, dissemination and analysis at Aalborg University, please feel free to contact the VBN team. You can also find more information in the AAU Student Projects FAQs.