AAU Student Projects - visit Aalborg University's student projects portal
A master's thesis from Aalborg University
Book cover


Statistical Modelling of Next-generation Sequencing Data from Forensic Genetics

Translated title

Statistisk Modellering af Anden Generations Sekvenserings Data fra Retsgenetik

Author

Term

4. term

Publication year

2015

Submitted on

Pages

166

Abstract

Denne afhandling undersøger den statistiske variation i STR-data produceret med NGS til brug i retsgenetik. STR (korte gentagelsessekvenser) er centrale markører i DNA-profiler, og NGS (næstegenerationssekventering) gør det muligt at læse dem i stor skala. Vi præsenterer enkle metoder til DNA-profiler i prøver med én bidragyder og vurderer kvaliteten af de sekvenslæsninger, der ligger til grund. Fejlkilder gennemgås først som systematiske artefakter – stutter (gentagelsesrelaterede biprodukter) og skuldre (små side-toppe) – og dernæst som mere generel baggrundsstøj. Den generelle støj håndteres med en støjgrænse, som kan filtrere svage signaler væk og dermed give allel-frafald (drop-out). Derfor undersøger vi heterozygot ubalance, hvor de to alleler i et locus ikke fremstår lige stærkt, og vi præsenterer en model for fuld dækning. Til sidst forudsiger vi sandsynligheden for allel-frafald.

This thesis examines the statistical variation in STR data generated by NGS for use in forensic genetics. STRs (short tandem repeats) are key markers in DNA profiles, and NGS (next-generation sequencing) enables high-throughput reading of these markers. We introduce simple methods for DNA profiling in single-contributor samples and assess the quality of the underlying sequencing reads. We analyze error sources, beginning with systematic artifacts—stutter (repeat-related byproducts) and shoulders (small side peaks)—and then address broader background noise. General noise is managed with a noise threshold, which can filter out weak signals and lead to allele drop-out. We therefore examine heterozygote imbalance, where the two alleles at a locus are not equally represented, and present a model aimed at full coverage. Finally, we predict the probability of allele drop-out.

[This abstract was generated with the help of AI]