Palamut - An Expansion of the Bonito basecaller using language models
Authors
Larsen, Andreas Christian Meyer ; Hansen, Magnus Nørhave ; Knudsen, Christian Aae
Term
4. term
Education
Publication year
2020
Submitted on
2020-06-11
Pages
17
Abstract
This thesis explores how techniques from Automatic Speech Recognition (ASR) can improve nanopore basecalling—the step that turns raw sensor signals into nucleotide letters. We focus on Bonito, a modern end-to-end basecaller, and extend its architecture with a decoder that uses language model probabilities to refine basecalls. We train and compare two character-level language models: an n-gram model, which captures short patterns, and a recurrent neural network (RNN) model, which can learn longer-range dependencies. Our results show a small increase in consensus accuracy (the accuracy after combining multiple reads), accompanied by a matching decrease in single-read accuracy. We attribute this drop to suboptimally tuned decoder hyperparameters rather than the language models themselves, and we outline potential adjustments to address the issue.
I denne afhandling undersøger vi, hvordan metoder fra automatisk talegenkendelse (ASR) kan forbedre nanopore-basecalling – processen hvor rå sensorsignaler omsættes til nukleotidbogstaver. Vi fokuserer på Bonito, en moderne end-to-end basecaller, og udvider dens arkitektur med en afkoder (decoder), der kan bruge sandsynligheder fra en sprogmodel til at forfine basecalls. Vi træner og sammenligner to sprogmodeller på tegnniveau: en n-gram-model, som fanger korte mønstre, og en RNN (recurrent neural network)-model, som kan lære længere afhængigheder. Vores resultater viser en lille forbedring i konsensusnøjagtighed (nøjagtigheden efter at kombinere flere læsninger), men en tilsvarende forringelse af nøjagtigheden for enkeltlæsninger. Vi vurderer, at faldet skyldes suboptimalt indstillede hyperparametre i afkoderen snarere end selve sprogmodellerne, og vi skitserer mulige justeringer for at løse problemet.
[This apstract has been rewritten with the help of AI based on the project's original abstract]
