AAU Student Projects - visit Aalborg University's student projects portal
A master's thesis from Aalborg University
Book cover


Y-STR: Haplotype Frequency Estimation and Evidence Calculation

Translated title

Y-STR: Estimation af haplotypefrekvens og evidensberegning

Author

Term

4. term

Publication year

2010

Pages

138

Abstract

At estimere, hvor ofte bestemte Y‑STR haplotyper forekommer i en befolkning, er afgørende for at kunne beregne sandsynligheder i forbindelse med evidens. En Y‑STR haplotype er det samlede mønster af kort gentagelsesmarkører (short tandem repeats) på Y‑kromosomet. I modsætning til STR‑markører på de autosomale (ikke‑køns) kromosomer arves markører på Y‑kromosomet samlet og kan ikke behandles som uafhængige. Derfor kan den samlede sandsynlighed for en haplotype ikke fås ved at gange de enkelte markørers sandsynligheder, og statistiske modeller må tage højde for afhængighed mellem markører. Specialet beskriver først en eksisterende metode, "frequency surveying approach", til at estimere haplotypefrekvenser. Derefter udvikles flere nye modeller: en metode kaldet "ancestral awareness"; tilpasninger af kerneludjævning og modelbaseret klyngeanalyse; samt en klasse af klassifikationsmodeller, herunder beslutningstræer, supportvektormaskiner og ordnet logistisk regression. Vi udvikler metoder til at vurdere, hvor godt hver metode fungerer, og bruger dem til at sammenligne modellerne. Beslutningstræer klarer sig samlet set godt, men har den ulempe, at de ikke indarbejder forudgående biologisk viden, såsom enkeltskridts mutationsmodellen (ideen om, at STR typisk ændres med én gentagelse ad gangen). Ud over frekvensestimering behandles også evidensberegninger i specialet.

Estimating how often particular Y‑STR haplotypes occur is essential for calculating probabilities used in evidence. A Y‑STR haplotype is the combined pattern of short tandem repeat markers on the Y chromosome. Unlike STR markers on the autosomal (non‑sex) chromosomes, markers on the Y chromosome are inherited together and cannot be treated as independent. As a result, the joint probability of a haplotype cannot be obtained by multiplying separate marker probabilities, and statistical models must account for dependence between markers. This thesis first describes an existing "frequency surveying approach" to estimating haplotype frequencies. It then develops several new models: a method called "ancestral awareness"; adaptations of kernel smoothing and model‑based clustering; and a set of classification models, including classification trees, support vector machines, and ordered logistic regression. We design ways to evaluate how well each method works and use them to compare the models. Classification trees perform well overall, but they have the drawback of not incorporating prior biological knowledge, such as the single‑step mutation model (the idea that STRs typically change by one repeat unit at a time). In addition to frequency estimation, the thesis also considers evidence calculations.

[This abstract was generated with the help of AI]