Department of Biostatistics Seminar/Workshop Series

A Semi-parametric Method for Clustering Mixed Data

Marianthi Markatou, PhD, Associate Chair for Research and Healthcare Informatics, Department of Biostatistics Assistant Director, Institute for Healthcare Informatics, University of Buffalo

Despite the existence of a large number of clustering algorithms, clustering remains a challenging problem. As large datasets become increasingly common in a number of different domains, it is often the case that clustering algorithms must be applied to heterogeneous sets of variables, creating an acute need for robust and scalable clustering methods for mixed continuous and categorical scale data. We show that current clustering methods for mixed-type data suffer from at least one of two central challenges: (1) they are unable to equitably balance the contribution of continuous and categorical variables without strong parametric assumptions; or (2) they are unable to properly handle data sets in which only a subset of variables are related to the underlying cluster structure of interest. We first develop KAMILA (KAy-means for MIxed LArge data), a clustering method that addresses (1) and in many situations (2), without requiring strong parametric assumptions. We next develop MEDEA (Multivariate Eigenvalue Decomposition Error Adjustment), a weighting system that addresses (2) even in the face of a large number of uninformative variables. We study theoretical aspects of our method and demonstrate their superiority in a series of Monte Carlo simulation studies and a set of real-world applications.

Joint work with A. Foss, B. Ray and A. Hetching

Topic revision: r1 - 27 Oct 2016, AshleeBartley
 

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback