Now showing 1 - 3 of 3
  • Publication
    mclust 5: Clustering, Classification and Density Estimation Using Gaussian Finite Mixture Models
    (R Foundation for Statistical Computing, 2016-08-01) ; ; ;
    Finite mixture models are being used increasingly to model a wide variety of random phenomena for clustering, classification and density estimation. mclust is a powerful and popular package which allows modelling of data as a Gaussian finite mixture with different covariance structures and different numbers of mixture components, for a variety of purposes of analysis. Recently, version 5 of the package has been made available on CRAN. This updated version adds new covariance structures, dimension reduction capabilities for visualisation, model selection criteria, initialisation strategies for the EM algorithm, and bootstrap-based inference, making it a full-featured R package for data analysis via finite mixture modelling.
  • Publication
    Model-based clustering with sparse covariance matrices
    Finite Gaussian mixture models are widely used for model-based clustering of continuous data. Nevertheless, since the number of model parameters scales quadratically with the number of variables, these models can be easily over-parameterized. For this reason, parsimonious models have been developed via covariance matrix decompositions or assuming local independence. However, these remedies do not allow for direct estimation of sparse covariance matrices nor do they take into account that the structure of association among the variables can vary from one cluster to the other. To this end, we introduce mixtures of Gaussian covariance graph models for model-based clustering with sparse covariance matrices. A penalized likelihood approach is employed for estimation and a general penalty term on the graph configurations can be used to induce different levels of sparsity and incorporate prior knowledge. Model estimation is carried out using a structural-EM algorithm for parameters and graph structure estimation, where two alternative strategies based on a genetic algorithm and an efficient stepwise search are proposed for inference. With this approach, sparse component covariance matrices are directly obtained. The framework results in a parsimonious model-based clustering of the data via a flexible model for the within-group joint distribution of the variables. Extensive simulated data experiments and application to illustrative datasets show that the method attains good classification performance and model quality. The general methodology for model-based clustering with sparse covariance matrices is implemented in the R package mixggm, available on CRAN.
      318ScopusĀ© Citations 12
  • Publication
    Investigation of parameter uncertainty in clustering using a Gaussian mixture model via jackknife, bootstrap and weighted likelihood bootstrap
    (Springer Science and Business Media LLC, 2019-05-28) ; ; ;
    Mixture models with (multivariate) Gaussian components are a popular tool in model-based clustering. Such models are often fitted by a procedure that maximizes the likelihood, such as the EM algorithm. At convergence, the maximum likelihood parameter estimates are typically reported, but in most cases little emphasis is placed on the variability associated with these estimates. In part this may be due to the fact that standard errors are not directly calculated in the model-fitting algorithm, either because they are not required to fit the model, or because they are difficult to compute. The examination of standard errors in model-based clustering is therefore typically neglected. Sampling based methods, such as the jackknife (JK), bootstrap (BS) and parametric bootstrap (PB), are intuitive, generalizable approaches to assessing parameter uncertainty in model-based clustering using a Gaussian mixture model. This paper provides a review and empirical comparison of the jackknife, bootstrap and parametric bootstrap methods for producing standard errors and confidence intervals for mixture parameters. The performance of such sampling methods in the presence of small and/or overlapping clusters requires consideration however; here the weighted likelihood bootstrap (WLBS) approach is demonstrated to be effective in addressing this concern in a model-based clustering framework. The JK, BS, PB and WLBS methods are illustrated and contrasted through simulation studies and through the traditional Old Faithful data set and also the Thyroid data set. The MclustBootstrap function, available in the most recent release of the popular R package mclust, facilitates the implementation of the JK, BS, PB and WLBS approaches to estimating parameter uncertainty in the context of model-based clustering. The JK, WLBS and PB approaches to variance estimation are shown to be robust and provide good coverage across a range of real and simulated data sets when performing model-based clustering; but care is advised when using the BS in such settings. In the case of poor model fit (for example for data with small and/or overlapping clusters), JK and BS are found to suffer from not being able to fit the specified model in many of the sub-samples formed. The PB also suffers when model fit is poor since it is reliant on data sets simulated from the model upon which to base the variance estimation calculations. However the WLBS will generally provide a robust solution, driven by the fact that all observations are represented with some weight in each of the sub-samples formed under this approach.
      480ScopusĀ© Citations 15