A network clustering approach to diagnosis codes in electronic medical recrods for unsupervised learning of disease heterogeneity and patient subgroups
Yaomin Xu, PhD Vanderbilt University Medical Center
Unsupervised clustering of patients using high dimensional EMR data could improve our understanding of disease heterogeneity and identify new disease subtypes. The international Statistical Classification of Diseases and Related Health Problems (ICD) is the most commonly used categorization of diseases and is routinely recorded in EMR for classifying diagnoses and describing patient visits. We hypothesize that the low-dimensional latent structure of multivariate ICD patterns in patients could provide useful information about patient characteristics and disease heterogeneity. In this talk, I will present a network-based community detection approach for unsupervised learning of the topological structure of patients based on their shared co-occurrence patterns in the ICDs recorded in a large-scale EMR. We aimed at building an statistically principled approach that is highly robust when applied to real world data. We pursued this by following a two-step strategy: (1) We estimated an consensus graph based on an ensemble of stochastic block model estimations according to bipartite, patient-ICD relationships; (2) We then constructed a hierarchical topological structure of the consensus graph using a top-down recursive partitioning. I will demonstrate a functional interpretation of our approach by applying to a genetic study of a cancer driver mutation in MPN patients and illustrate the findings that recapitulate the existing knowledge as well as those are potentially novel. This work is developed by collaborating with several VUMC clinical scientists and geneticists on real data problems. It is still a work in progress and we eagerly look forward to hearing your feedback.