AAU Student Projects - visit Aalborg University's student projects portal
A master's thesis from Aalborg University
Book cover


Palamut - An Expansion of the Bonito basecaller using language models

Authors

; ;

Term

4. term

Education

Publication year

2020

Submitted on

Pages

17

Abstract

I denne afhandling undersøger vi, hvordan metoder fra automatisk talegenkendelse (ASR) kan forbedre nanopore-basecalling – processen hvor rå sensorsignaler omsættes til nukleotidbogstaver. Vi fokuserer på Bonito, en moderne end-to-end basecaller, og udvider dens arkitektur med en afkoder (decoder), der kan bruge sandsynligheder fra en sprogmodel til at forfine basecalls. Vi træner og sammenligner to sprogmodeller på tegnniveau: en n-gram-model, som fanger korte mønstre, og en RNN (recurrent neural network)-model, som kan lære længere afhængigheder. Vores resultater viser en lille forbedring i konsensusnøjagtighed (nøjagtigheden efter at kombinere flere læsninger), men en tilsvarende forringelse af nøjagtigheden for enkeltlæsninger. Vi vurderer, at faldet skyldes suboptimalt indstillede hyperparametre i afkoderen snarere end selve sprogmodellerne, og vi skitserer mulige justeringer for at løse problemet.

This thesis explores how techniques from Automatic Speech Recognition (ASR) can improve nanopore basecalling—the step that turns raw sensor signals into nucleotide letters. We focus on Bonito, a modern end-to-end basecaller, and extend its architecture with a decoder that uses language model probabilities to refine basecalls. We train and compare two character-level language models: an n-gram model, which captures short patterns, and a recurrent neural network (RNN) model, which can learn longer-range dependencies. Our results show a small increase in consensus accuracy (the accuracy after combining multiple reads), accompanied by a matching decrease in single-read accuracy. We attribute this drop to suboptimally tuned decoder hyperparameters rather than the language models themselves, and we outline potential adjustments to address the issue.

[This abstract was generated with the help of AI]