Now showing 1 - 7 of 7
- PublicationModel Based Clustering for Mixed Data: clustMDA model based clustering procedure for data of mixed type, clustMD, is developed using a latent variable model. It is proposed that a latent variable, following a mixture of Gaussian distributions, generates the observed data of mixed type. The observed data may be any combination of continuous, binary, ordinal or nominal variables. clustMD employs a parsimonious covariance structure for the latent variables, leading to a suite of six clustering models that vary in complexity and provide an elegant and unified approach to clustering mixed data. An expectation maximisation (EM) algorithm is used to estimate clustMD; in the presence of nominal data a Monte Carlo EM algorithm is required. The clustMD model is illustrated by clustering simulated mixed type data and prostate cancer patients, on whom mixed data have been recorded.
Scopus© Citations 46 604
- PublicationClustering South African households based on their asset status using latent variable modelsThe Agincourt Health and Demographic Surveillance System has since 2001 conducted a biannual household asset survey in order to quantify household socio-economic status (SES) in a rural population living in northeast South Africa. The survey contains binary, ordinal and nominal items. In the absence of income or expenditure data, the SES landscape in the study population is explored and described by clustering the households into homogeneous groups based on their asset status. A model-based approach to clustering the Agincourt households, based on latent variable models, is proposed. In the case of modeling binary or ordinal items, item response theory models are employed. For nominal survey items, a factor analysis model, similar in nature to a multinomial probit model, is used. Both model types have an underlying latent variable structure—this similarity is exploited and the models are combined to produce a hybrid model capable of handling mixed data types. Further, a mixture of the hybrid models is considered to provide clustering capabilities within the context of mixed binary, ordinal and nominal response data. The proposed model is termed a mixture of factor analyzers for mixed data (MFA-MD). The MFA-MD model is applied to the survey data to cluster the Agincourt households into homogeneous groups. The model is estimated within the Bayesian paradigm, using a Markov chain Monte Carlo algorithm. Intuitive groupings result, providing insight to the different socio-economic strata within the Agincourt region.
388Scopus© Citations 24
- PublicationClustering Ordinal Data via Latent Variable ModelsItem response modelling is a well established method for analysing ordinal response data. Ordinal data are typically collected as responses to a number of questions or items. The observed data can be viewed as discrete versions of an underlying latent Gaussian variable. Item response models assume that this latent variable (and therefore the observed ordinal response) is a function of both respondent specific and item specific parameters. However, item response models assume a homogeneous population in that the item specific parameters are assumed to be the same for all respondents. Often a population is heterogeneous and clusters of respondents exist; members of different clusters may view the items differently. A mixture of item response models is developed to provide clustering capabilities in the context of ordinal response data. The model is estimated within the Bayesian paradigm and is illustrated through an application to an ordinal response data set resulting from a clinical trial involving self-assessment of arthritis.
Scopus© Citations 11 751
- PublicationClustering high‐dimensional mixed data to uncover sub‐phenotypes: joint analysis of phenotypic and genotypic dataThe LIPGENE-SU.VI.MAX study, like many others, recorded high-dimensional continuous phenotypic data and categorical genotypic data. LIPGENE-SU.VI.MAX focuses on the need to account for both phenotypic and genetic factors when studying the metabolic syndrome (MetS), a complex disorder that can lead to higher risk of type 2 diabetes and cardiovascular disease. Interest lies in clustering the LIPGENE-SU.VI.MAX participants into homogeneous groups or sub-phenotypes, by jointly considering their phenotypic and genotypic data, and in determining which variables are discriminatory. A novel latent variable model that elegantly accommodates high dimensional, mixed data is developed to cluster LIPGENE-SU.VI.MAX participants using a Bayesian finite mixture model. A computationally efficient variable selection algorithm is incorporated, estimation is via a Gibbs sampling algorithm and an approximate BIC-MCMC criterion is developed to select the optimal model. Two clusters or sub-phenotypes ('healthy' and 'at risk') are uncovered. A small subset of variables is deemed discriminatory, which notably includes phenotypic and genotypic variables, highlighting the need to jointly consider both factors. Further, 7 years after the LIPGENE-SU.VI.MAX data were collected, participants underwent further analysis to diagnose presence or absence of the MetS. The two uncovered sub-phenotypes strongly correspond to the 7-year follow-up disease classification, highlighting the role of phenotypic and genotypic factors in the MetS and emphasising the potential utility of the clustering approach in early screening. Additionally, the ability of the proposed approach to define the uncertainty in sub-phenotype membership at the participant level is synonymous with the concepts of precision medicine and nutrition.
Scopus© Citations 14 434
- PublicationPrediction of tool-wear in turning of medical grade cobalt chromium molybdenum alloy (ASTM F75) using non-parametric Bayesian modelsWe present a novel approach to estimating the effect of control parameters on tool wear rates and related changes in the three force components in turning of medical grade Co-Cr-Mo (ASTM F75) alloy. Co-Cr-Mo is known to be a difficult to cut material which, due to a combination of mechanical and physical properties, is used for the critical structural components of implantable medical prosthetics. We run a designed experiment which enables us to estimate tool wear from feed rate and cutting speed, and constrain them using a Bayesian hierarchical Gaussian Process model which enables prediction of tool wear rates for untried experimental settings. However, the predicted tool wear rates are non-linear and, using our models, we can identify experimental settings which optimise the life of the tool. This approach has potential in the future for realtime application of data analytics to machining processes.
368Scopus© Citations 14
- PublicationPrediction of tool-wear in turning of medical grade cobalt chromium molybdenum alloy (ASTM F75) using non-parametric Bayesian modelsWe present a novel approach to estimating the effect of control parameters on tool wear rates and related changes in the three force components in turning of medical grade Co-Cr-Mo (ASTM F75) alloy. Co-Cr-Mo is known to be a difficult to cut material which, due to a combination of mechanical and physical properties,is used for the critical structural components of implantable medical prosthetics. We run a designed experiment which enables us to estimate tool wear from feed rate and cutting speed, and constrain them using a Bayesian hierarchical Gaussian Process model which enables prediction of tool wear rates for untried experimental settings. The predicted tool wear rates are non-linear and, using our models,we can identify experimental settings which optimise the life of the tool. This approach has potential in the future for real time application of data analytics to machining processes.
Scopus© Citations 14 259
- PublicationA Protocol for Improved Precision and Increased Confidence in Nanoparticle Tracking Analysis Concentration Measurements between 50 and 120 nm in Biological FluidsNanoparticle tracking analysis (NTA) can be used to quantitate extracellular vesicles (EVs) in biological samples and is widely considered a useful diagnostic tool to detect disease. However, accurately profiling EVs can be challenging due to their small size and heterogeneity. Here, we aimed to provide a protocol to facilitate high-precision particle quantitation by NTA in plasma, the supernatant of activated purified platelets [the platelet releasate (PR)] and in serum, to increase confidence in NTA particle enumeration. The overall variance and the precision of NTA measurements were quantified by root mean square error and relative standard error. Using a bootstrapping approach, we found that increasing video replicates from 5 s × 60 s to 25 s × 60 s captures led to a reduction in overall variance and a reproducible increase in the precision of NTA particle-concentration quantitation for all three biofluids. We then validated our approach in an extended cohort of 32 healthy donors. Our results indicate that for vesicles sized between 50 and 120 nm, the precision of routine NTA measurements in serum, plasma, and PR can be significantly improved by increasing the number of video replicates captured. Our protocol provides a common platform to statistical compare particle size distribution profiles in the exosomal-vesicle size range across a variety of biofluids and in both healthy donor and patient groups.
367Scopus© Citations 35