Enriching Clinical Sample Analysis With Pathway Knowledge Graphs and GNNs
Author
Shad Bakhsh, Fatemeh
Term
4. term
Education
Publication year
2024
Submitted on
2024-06-10
Abstract
Biological research often works with small datasets when analyzing proteins, which limits the use of traditional statistical methods. Large, curated graph databases such as Reactome and UniProt map known relationships between proteins, but turning this knowledge into insight requires efficient analysis. This thesis introduces Cluster-GAE, a method that combines graph sampling with Graph Neural Networks (GNNs) to learn informative representations from large biological networks. By adapting the cluster-GCN algorithm for graph representation learning, Cluster-GAE reduces computational demands while preserving important network structure. In evaluations comparing sampling strategies—Random Walk, Forest Fire, and no sampling—Cluster-GAE shows better performance in preserving network structure and producing meaningful protein embeddings (compact numerical summaries). Using t-SNE (a visualization technique) and functional enrichment analysis (a test for over-represented biological functions), we show that the method uncovers clear protein clusters and highlights pathways that are over-represented, which may point to new biological mechanisms. Overall, this work provides a robust framework for analyzing biological samples with limited data and improves the interpretability of protein data analysis.
Biologisk forskning kæmper ofte med små datasæt, når proteiner analyseres, hvilket gør traditionelle statistiske metoder mindre pålidelige. Store, kuraterede grafdatabaser som Reactome og UniProt rummer viden om relationer mellem proteiner, men kræver effektive metoder for at kunne udnyttes. Denne afhandling præsenterer Cluster-GAE, en metode der kombinerer grafsampling med grafneurale netværk (GNN’er) for at lære informative repræsentationer fra store biologiske netværk. Vi tilpasser cluster-GCN-algoritmen til grafrepræsentationslæring, så store netværk kan behandles mere effektivt, samtidig med at vigtig struktur bevares. I en evaluering, hvor vi sammenligner samplingstrategierne Random Walk, Forest Fire og ingen sampling, viser Cluster-GAE bedre evne til at bevare netværksstrukturen og til at skabe meningsfulde protein-embeddings (kompakte, numeriske beskrivelser). Med t-SNE (en visualiseringsmetode) og funktionel berigelsesanalyse (en test for overrepræsenterede biologiske funktioner) demonstrerer vi, at metoden kan identificere tydelige proteinklynger og fremhæve biologiske signalveje, der er overrepræsenterede, hvilket kan pege mod nye biologiske mekanismer. Tilsammen etablerer dette en robust ramme til at analysere biologiske prøver med begrænsede data og forbedrer fortolkningen af proteindataanalyse.
[This apstract has been rewritten with the help of AI based on the project's original abstract]
