CLASSIFICATION OF GENOTYPES WITH POSTERIOR PROBABILITIES IN PROBLEMS OF FORENSIC GENETICS
Student thesis: Master programme thesis
- Sofie Lovise Uhrbrand
- Micklas Visby Christiansen
- Sarah Elise Steen
2. Term (Master), Mathematics and Statistics (Minor subject) (Elective Study or Minor Subject)
The trained model classified future observations into genotypes based on the signal intensities from A and B. In this report, the trained model was based on multinomial logistic regression (MLR) and the percentage of correctly classified observations is defined as the accuracy of the model. To increase the precision of the model, a group called no-call was introduced for data points with maximum posterior probability beneath a given threshold.
When visualizing the clusters corresponding to the three genotypes, it was observed that a symmetrical relationship could exist between the clusters around the identity line. If the genotypes were found to be symmetrical around the identity line, a simplified classification model could be constructed. The simplified model would then be tested to investigate if a similar classification to the MLR model could be obtained. It was tested whether the covariance matrices and means of the homozygous genotypes were symmetrical using Box's M-test and Hotelling's T^2-test.
These tests showed that the covariance matrices were not symmetrical, but the means were symmetrical. When testing whether the heterozygous genotype was symmetrical around the identity line a correlation test and a t-test were conducted. The tests showed that the heterozygous genotype was not symmetrical around the identity line. As the genotypes were shown not to be symmetrical around the identity line, a guide to construct a simplified classification model was made, rather than constructing an actual model.
It can be concluded that a model constructed with multinomial logistic regression classified observations with an accuracy of 96.3% from the signals A and B compared to WGS, and after introduction of no-call, an accuracy of 97.7% was achieved. The MLR model obtained an acceptable accuracy, however the MLR model still misclassified some observations after the introduction of no-call.
As further development, it could be investigated whether the genotypes would become symmetrical around the identity line by manipulating the sizes of the groups.
When visualizing the clusters corresponding to the three genotypes, it was observed that a symmetrical relationship could exist between the clusters around the identity line. If the genotypes were found to be symmetrical around the identity line, a simplified classification model could be constructed. The simplified model would then be tested to investigate if a similar classification to the MLR model could be obtained. It was tested whether the covariance matrices and means of the homozygous genotypes were symmetrical using Box's M-test and Hotelling's T^2-test.
These tests showed that the covariance matrices were not symmetrical, but the means were symmetrical. When testing whether the heterozygous genotype was symmetrical around the identity line a correlation test and a t-test were conducted. The tests showed that the heterozygous genotype was not symmetrical around the identity line. As the genotypes were shown not to be symmetrical around the identity line, a guide to construct a simplified classification model was made, rather than constructing an actual model.
It can be concluded that a model constructed with multinomial logistic regression classified observations with an accuracy of 96.3% from the signals A and B compared to WGS, and after introduction of no-call, an accuracy of 97.7% was achieved. The MLR model obtained an acceptable accuracy, however the MLR model still misclassified some observations after the introduction of no-call.
As further development, it could be investigated whether the genotypes would become symmetrical around the identity line by manipulating the sizes of the groups.
Language | Danish |
---|---|
Publication date | 22 Dec 2022 |
Number of pages | 44 |