Amplification of DNA mixtures - Missing data approach

Author

Tvedebrink, Torben

Term

4. term

Education

Mathematics, Master

Publication year

2007

Abstract

DNA evidence often contains mixtures from several people. This thesis introduces a statistical model to interpret Short Tandem Repeat (STR) typing results from mixtures by treating the measured peak areas as jointly normally distributed. Drawing on controlled experiments, we use the linear relation between peak heights and areas, and between the means and variances of these measurements. Measurements from alleles unique to one person are used to estimate that person’s contribution to alleles shared in the mixture. Because shared alleles only show combined peaks, we treat the unobserved individual contributions as missing data and recover them with the Expectation–Maximization (EM) algorithm under a compound symmetry correlation model. This setup allows correlations within and between measurement systems and is not tied to specific alleles. Thanks to the factorization of the likelihood and properties of the normal distribution, a standard EM implementation is sufficient. We estimate model parameters on a training dataset and then apply the model to STR data from real crime cases to assess the weight of evidence it provides. The model has important limitations: during estimation we exclude cases with drop-outs (alleles that fail to be detected). These limitations must be addressed before the model can be used routinely in casework and are the focus of ongoing investigation.

DNA-spor indeholder ofte blandinger fra flere personer. Denne afhandling præsenterer en statistisk model til at fortolke resultater fra STR-typning af sådanne blandinger ved at betragte de målte top-arealer som fælles normalfordelte. Med udgangspunkt i kontrollerede forsøg udnytter vi den lineære sammenhæng mellem tophøjder og arealer samt mellem middelværdier og varians af målingerne. Målinger fra alleller, der er unikke for én person, bruges til at anslå denne persons bidrag til alleller, der deles i blandingen. Da delte alleller kun viser samlede toppe, behandles de uobserverede individuelle bidrag som manglende data og genskabes med EM-algoritmen under en compound symmetry-korrelationsmodel. Denne tilgang tillader korrelationer både inden for og mellem målesystemer og er ikke afhængig af specifikke alleller. Takket være faktorisering af likelihood og egenskaber ved normalfordelingen er en standard implementering af EM-algoritmen tilstrækkelig. Vi estimerer modellens parametre på et træningsdatasæt og anvender den derefter på STR-data fra virkelige kriminalsager for at vurdere bevisstyrken. Modellen har væsentlige begrænsninger: under estimeringen udelader vi sager med drop-outs (alleller, der ikke registreres). Disse begrænsninger skal løses, før modellen kan bruges rutinemæssigt i sagsbehandling og er genstand for videre undersøgelse.

[This abstract has been rewritten with the help of AI based on the project's original abstract]

Documents

Download PDF
View record in AAU Student Projects

A master's thesis from Aalborg University

Amplification of DNA mixtures - Missing data approach