Exploring the Efficacy of Specially-Trained Transformers on Geospatial Entity Matching of Historic Toponyms
Author
Term
4. term
Education
Publication year
2024
Submitted on
2024-06-06
Pages
10
Abstract
Substantial effort has been put into digitizing and extracting information from historical and ancient manuscripts. These efforts often focus on a single civilization, its language, and culture. Thereby isolating these efforts and making it harder to collaborate and share knowledge between them. Some works have tried to connect these efforts and their data based on toponym matches using traditional methods such as transliteration for toponym matching. However, results have been uneven. The advent of transformer-based language models such as BERT has brought about improved performance in many language-related tasks, including toponym matching. However, these language models are often trained over large corpora of modern text in English. Even multi-lingual models are often trained on modern texts collected on the web. Here, we examine whether creating specially-trained multi-lingual models over ancient texts matching the toponym languages can be beneficial for this task. In this paper, we examine several methods using ancient manuscripts to adapt BERT-based models to identify matching toponyms in Arabic and Hebrew, two related Semitic languages with historical dialects and sizeable corpora of ancient texts. We evaluated our methods on a historical toponym matching task comprising several datasets of toponyms extracted from Middle East scholars The evaluation results were surprising in that the models presented in this work were outperformed by a multilingual model (mBERT) that was pre-trained on modern data.
Documents
