Rule Extraction from Pharmaceutical Documents for Automated Consistency Checks on Clinical Trial Databases

Authors

Nielsen, Christian Fillip Pinderup ; Olesen, Magnus

Term

4. semester

Education

Data Science and Machine Learning, Msc.

Publication year

2024

Submitted on

2024-06-10

Pages

Abstract

For pharmaceutical companies to get new drugs to market, they first must get clinical studies approved. This entails following rigid rules defined in large regulatory documents. This is both a costly and time-intensive process when done manually. The field of automated consistency checking (ACC) can assist in automating this process. As regulatory documents are large, complex, and contain rich natural language, implementing ACC solutions is complex. However, natural language processing (NLP) methods have become increasingly powerful in recent years, providing a better use case for ACC. Thus, this paper investigates ACC in the pharmaceutical domain in collaboration with Novo Nordisk. The paper explores the problem of ACC by dividing it into multiple NLP subproblems and presents a pipeline for ACC. The pipeline consists of identifying sentences representing rules in regulatory documents and extracting relevant data from these rules needed to serialize them into CDISC Core rules. This paper demonstrates how an in-domain dataset can be constructed needed to implement machine learning models. Using this dataset, we train multiple machine-learning models to solve each subproblem. For the first problem of identifying rules, an SVM classifier using TF-IDF embeddings obtains an 2 score of 0.79, outperforming other baselines and fine-tuned versions of BERT models. To assign operators to the classified rules, an MLkNN classifier also using TF-IDF embeddings obtains an 2-micro score of 0.71 . Lastly, to extract elements such as columns and values from the rule sentences, a fine-tuned version of LegalBERT can be used, obtaining an 2 score of 0.69. Utilizing the output of these three models, we show that it is possible to generate simple rules, which can be used to implement ACC on clinical trial study databases.

Documents

Download
View record in AAU Student Projects

A master's thesis from Aalborg University

Rule Extraction from Pharmaceutical Documents for Automated Consistency Checks on Clinical Trial Databases