Now showing 1 - 4 of 4
No Thumbnail Available
Publication

Model-based clustering with sparse covariance matrices

2019, Fop, Michael, Murphy, Thomas Brendan, Scrucca, Luca

Finite Gaussian mixture models are widely used for model-based clustering of continuous data. Nevertheless, since the number of model parameters scales quadratically with the number of variables, these models can be easily over-parameterized. For this reason, parsimonious models have been developed via covariance matrix decompositions or assuming local independence. However, these remedies do not allow for direct estimation of sparse covariance matrices nor do they take into account that the structure of association among the variables can vary from one cluster to the other. To this end, we introduce mixtures of Gaussian covariance graph models for model-based clustering with sparse covariance matrices. A penalized likelihood approach is employed for estimation and a general penalty term on the graph configurations can be used to induce different levels of sparsity and incorporate prior knowledge. Model estimation is carried out using a structural-EM algorithm for parameters and graph structure estimation, where two alternative strategies based on a genetic algorithm and an efficient stepwise search are proposed for inference. With this approach, sparse component covariance matrices are directly obtained. The framework results in a parsimonious model-based clustering of the data via a flexible model for the within-group joint distribution of the variables. Extensive simulated data experiments and application to illustrative datasets show that the method attains good classification performance and model quality. The general methodology for model-based clustering with sparse covariance matrices is implemented in the R package mixggm, available on CRAN.

No Thumbnail Available
Publication

Variable selection methods for model-based clustering

2018-04-26, Fop, Michael, Murphy, Thomas Brendan

Model-based clustering is a popular approach for clustering multivariate data which has seen applications in numerous fields. Nowadays, high-dimensional data are more and more common and the model-based clustering approach has adapted to deal with the increasing dimensionality. In particular, the development of variable selection techniques has received a lot of attention and research effort in recent years. Even for small size problems, variable selection has been advocated to facilitate the interpretation of the clustering results. This review provides a summary of the methods developed for variable selection in model-based clustering. Existing R packages implementing the different methods are indicated and illustrated in application to two data analysis examples.

No Thumbnail Available
Publication

Variable Selection for Latent Class Analysis with Application to Low Back Pain Diagnosis

2017-12-28, Fop, Michael, Smart, Keith, Murphy, Thomas Brendan

The identification of most relevant clinical criteria related to low back pain disordersis a crucial task for a quick and correct diagnosis of the nature of pain and its treatment.Data concerning low back pain can be of categorical nature, in form of check-list in whicheach item denotes presence or absence of a clinical condition. Latent class analysis is amodel-based clustering method for multivariate categorical responses which can be appliedto such data for a preliminary diagnosis of the type of pain. In this work we propose avariable selection method for latent class analysis applied to the selection of the mostuseful variables in detecting the group structure in the data. The method is based onthe comparison of two different models and allows the discarding of those variables withno group information and those variables carrying the same information as the alreadyselected ones. We consider a swap-stepwise algorithm where at each step the models arecompared through and approximation to their Bayes factor. The method is applied tothe selection of the clinical criteria most useful for the clustering of patients in differentclasses of pain. It is shown to perform a parsimonious variable selection and to give agood clustering performance. The quality of the approach is also assessed on simulateddata

No Thumbnail Available
Publication

mclust 5: Clustering, Classification and Density Estimation Using Gaussian Finite Mixture Models

2016-08-01, Scrucca, Luca, Fop, Michael, Murphy, Thomas Brendan, Raftery, Adrian E.

Finite mixture models are being used increasingly to model a wide variety of random phenomena for clustering, classification and density estimation. mclust is a powerful and popular package which allows modelling of data as a Gaussian finite mixture with different covariance structures and different numbers of mixture components, for a variety of purposes of analysis. Recently, version 5 of the package has been made available on CRAN. This updated version adds new covariance structures, dimension reduction capabilities for visualisation, model selection criteria, initialisation strategies for the EM algorithm, and bootstrap-based inference, making it a full-featured R package for data analysis via finite mixture modelling.