Now showing 1 - 10 of 38
  • Publication
    Variable selection methods for model-based clustering
    (The American Statistical Association, the Bernoulli Society, the Institute of Mathematical Statistics, and the Statistical Society of Canada, 2018-04-26) ;
    Model-based clustering is a popular approach for clustering multivariate data which has seen applications in numerous fields. Nowadays, high-dimensional data are more and more common and the model-based clustering approach has adapted to deal with the increasing dimensionality. In particular, the development of variable selection techniques has received a lot of attention and research effort in recent years. Even for small size problems, variable selection has been advocated to facilitate the interpretation of the clustering results. This review provides a summary of the methods developed for variable selection in model-based clustering. Existing R packages implementing the different methods are indicated and illustrated in application to two data analysis examples.
    Scopus© Citations 59  496
  • Publication
    Model-based clustering of longitudinal data
    A new family of mixture models for the model-based clustering of longitudinal data is introduced. The covariance structures of eight members of this new family of models are given and the associated maximum likelihood estimates for the parameters are derived via expectation-maximization (EM) algorithms. The Bayesian information criterion is used for model selection and a convergence criterion based on Aitken’s acceleration is used to determine convergence of these EM algorithms. This new family of models is applied to yeast sporulation time course data, where the models give good clustering performance. Further constraints are then imposed on the decomposition to allow a deeper investigation of correlation structure of the yeast data. These constraints greatly extend this new family of models, with the addition of many parsimonious models.
    Scopus© Citations 79  1393
  • Publication
    Mixed-Membership of Experts Stochastic Blockmodel
    (Cambridge University Press, 2016-03) ;
    Social network analysis is the study of how links between a set of actors are formed. Typically, it is believed that links are formed in a structured manner, which may be due to, for example, political or material incentives, and which often may not be directly observable. The stochastic blockmodel represents this structure using latent groups which exhibit different connective properties, so that conditional on the group membership of two actors, the probability of a link being formed between them is represented by a connectivity matrix. The mixed membership stochastic blockmodel extends this model to allow actors membership to different groups, depending on the interaction in question, providing further flexibility. Attribute information can also play an important role in explaining network formation. Network models which do not explicitly incorporate covariate information require the analyst to compare fitted network models to additional attributes in a post-hoc manner. We introduce the mixed membership of experts stochastic blockmodel, an extension to the mixed membership stochastic blockmodel which incorporates covariate actor information into the existing model. The method is illustrated with application to the Lazega Lawyers dataset. Model and variable selection methods are also discussed.
    Scopus© Citations 9  300
  • Publication
    Mixtures of biased sentiment analysers
    Modelling bias is an important consideration when dealing with inexpert annotations. We are concerned with training a classifier to perform sentiment analysis on news media articles, some of which have been manually annotated by volunteers. The classifier is trained on the words in the articles and then applied to non-annotated articles. In previous work we found that a joint estimation of the annotator biases and the classifier parameters performed better than estimation of the biases followed by training of the classifier. An important question follows from this result: can the annotators be usefully clustered into either predetermined or data-driven clusters, based on their biases? If so, such a clustering could be used to select, drop or otherwise categorise the annotators in a crowdsourcing task. This paper presents work on fitting a finite mixture model to the annotators’ bias. We develop a model and an algorithm and demonstrate its properties on simulated data. We then demonstrate the clustering that exists in our motivating dataset, namely the analysis of potentially economically relevant news articles from Irish online news sources.
      303Scopus© Citations 3
  • Publication
    A Mixture of Experts Latent Position Cluster Model for Social Network Data
    Social network data represent the interactions between a group of social actors. Interactions between colleagues and friendship networks are typical examples of such data. The latent space model for social network data locates each actor in a network in a latent (social) space and models the probability of an interaction between two actors as a function of their locations. The latent position cluster model extends the latent space model to deal with network data in which clusters of actors exist — actor locations are drawn from a finite mixture model, each component of which represents a cluster of actors. A mixture of experts model builds on the structure of a mixture model by taking account of both observations and associated covariates when modeling a heterogeneous population. Herein, a mixture of experts extension of the latent position cluster model is developed. The mixture of experts framework allows covariates to enter the latent position cluster model in a number of ways, yielding different model interpretations. Estimates of the model parameters are derived in a Bayesian framework using a Markov Chain Monte Carlo algorithm. The algorithm is generally computationally expensive — surrogate proposal distributions which shadow the target distributions are derived, reducing the computational burden. The methodology is demonstrated through an illustrative example detailing relationships between a group of lawyers in the USA.
      610Scopus© Citations 29
  • Publication
    Review of Statistical Network Analysis: Models, Algorithms, and Software
    The analysis of network data is an area that is rapidly growing, both within and outside of the discipline of statistics. This review provides a concise summary of methods and models used in the statistical analysis of network data, including the Erdos–Renyi model, the exponential family class of network models, and recently developed latent variable models. Many of the methods and models are illustrated by application to the well-known Zachary karate dataset. Software routines available for implementing methods are emphasized throughout. The aim of this paper is to provide a review with enough detail about many common classes of network models to whet the appetite and to point the way to further reading.
    Scopus© Citations 83  9441
  • Publication
    Exploring Voting Blocs Within the Irish Electorate: A Mixture Modeling Approach
    (Taylor and Francis, 2008-09) ;
    Irish elections use a voting system called proportion representation by means of a single transferable vote(PR-STV). Under this system, voters express their vote by ranking some (or all) of the candidates in order of preference. Which candidates are elected is determined through a series of counts where candidates are eliminated and surplus votes are distributed.The electorate in any election forms a heterogeneous population: that is voters with different political and ideological persuasions would be expected to have different preferences for the candidates. The purpose of this article is to establish the presence of voting bloes in the Irish electorate, to characterize these blocs and to estimate their size.A mixture modeling approach is used to explore the heterogenecity of the Irish electorate and to establish the existence of clearly defined voting blocs. The voting blocs are characterized by thier voting preferences which are described using a ranking data model. In addition the care with which voters choose lower tier preferences is estimated in the model.The methodology is used to explore data from two Irish election. Data from eight opinion polls taken during the six weeks prior to the 1997 Irish presidential election are analyzed. These data reveal the evolution of the structure of the electorate during the election campaign. In addition data that record the votes from the Dublin West constituency of the 2002 Irish general election are analyzed to reveal distinct voting blocs within the electoate these blocs are characterized by party politics, candidate profile and political ideology.
    Scopus© Citations 72  545
  • Publication
    Analysis of Irish third-level college applications data
    The Irish college admissions system involves prospective students listing up to 10 courses in order of preference on their application. Places in third-level educational institutions are subsequently offered to the applicants on the basis of both their preferences and their final second-level examination results. The college applications system is a large area of public debate in Ireland. Detractors suggest that the process creates artificial demand for 'high profile' courses, causing applicants to ignore their vocational callings. Supporters argue that the system is impartial and transparent. The Irish college degree applications data from the year 2000 are analysed by using mixture models based on ranked data models to investigate the types of application behaviour that are exhibited by college applicants. The results of this analysis show that applicants form groups according to both the discipline and the geographical location of their course choices. In addition, there is evidence of the suggested 'points race' for high profile courses. Finally, gender emerges as an influential factor when studying course choice behaviour.
    Scopus© Citations 47  599
  • Publication
    Role of serum response factor expression in prostate cancer biochemical recurrence
    Background: Up to a third of prostate cancer patients fail curative treatment strategiessuch as surgery and radiation therapy in the form of biochemical recurrence (BCR) whichcan be predictive of poor outcome. Recent clinical trials have shown that menexperiencing BCR might benefit from earlier intervention post-radical prostatectomy(RP). Therefore, there is an urgent need to identify earlier prognostic biomarkers whichwill guide clinicians in making accurate diagnosis and timely decisions on the nextappropriate treatment. The objective of this study was to evaluate Serum ResponseFactor (SRF) protein expression following RP and to investigate its association with BCR.Materials and Methods: SRF nuclear expression was evaluated by immunohistochemistry(IHC) in TMAs across three international radical prostatectomy cohorts for a totalof 615 patients. Log-rank test and Kaplan-Meier analyses were used for BCRcomparisons. Stepwise backwards elimination proportional hazard regression analysiswas used to explore the significance of SRF in predicting BCR in the context of otherclinical pathological variables. Area under the curve (AUC) values were generated bysimulating repeated random sub-samples.Results: Analysis of the immunohistochemical staining of benign versus cancer coresshowed higher expression of nuclear SRF protein expression in cancer cores comparedwith benign for all the three TMAs analysed (P < 0.001, n = 615). Kaplan-Meier curves ofthe three TMAs combined showed that patients with higher SRF nuclear expression hada shorter time to BCR compared with patients with lower SRF expression (P < 0.001,n = 215). Together with pathological T stage T3, SRF was identified as a predictor of BCRusing stepwise backwards elimination proportional hazard regression analysis(P = 0.0521). Moreover ROC curves and AUC values showed that SRF was betterthan T stage in predicting BCR at year 3 and 5 following radical prostatectomy, thecombination of SRF and T stage had a higher AUC value than the two taken separately.Conclusions: SRF assessment by IHC following RP could be useful in guiding cliniciansto better identify patients for appropriate follow-up and timely treatment.
      432Scopus© Citations 8
  • Publication
    Standardizing interestingness measures for association rules
    Interestingness measures provide information about association rules. The value of an interestingness measure is often interpreted relative to the overall range of the interestingness measure. However, properties of individual association rules can further restrict what value an interestingness measure can achieve. These additional constraints are not typically taken into account in analysis, potentially misleading the investigator. Considering the value of an interestingness measure relative to this further constrained range provides greater insight than the original range alone and can even alter researchers' impressions of the data. Standardizing interestingness measures takes these additional restrictions into account, resulting in values that provide a relative measure of the attainable values. We explore the impacts of standardizing interestingness measures on real and simulated data.
      539Scopus© Citations 13