Now showing 1 - 10 of 10
  • Publication
    Mixed-Membership of Experts Stochastic Blockmodel
    (Cambridge University Press, 2016-03) ;
    Social network analysis is the study of how links between a set of actors are formed. Typically, it is believed that links are formed in a structured manner, which may be due to, for example, political or material incentives, and which often may not be directly observable. The stochastic blockmodel represents this structure using latent groups which exhibit different connective properties, so that conditional on the group membership of two actors, the probability of a link being formed between them is represented by a connectivity matrix. The mixed membership stochastic blockmodel extends this model to allow actors membership to different groups, depending on the interaction in question, providing further flexibility. Attribute information can also play an important role in explaining network formation. Network models which do not explicitly incorporate covariate information require the analyst to compare fitted network models to additional attributes in a post-hoc manner. We introduce the mixed membership of experts stochastic blockmodel, an extension to the mixed membership stochastic blockmodel which incorporates covariate actor information into the existing model. The method is illustrated with application to the Lazega Lawyers dataset. Model and variable selection methods are also discussed.
    Scopus© Citations 9  300
  • Publication
    Overlapping Stochastic Community Finding
    Community finding in social network analysis is the task of identifying groups of people within a larger population who are more likely to connect to each other than connect to others in the population. Much existing research has focussed on non-overlapping clustering. However, communities in real world social networks do overlap. This paper introduces a new community finding method based on overlapping clustering. A Bayesian statistical model is presented, and a Markov Chain Monte Carlo (MCMC) algorithm is presented and evaluated in comparison with two existing overlapping community finding methods that are applicable to large networks. We evaluate our algorithm on networks with thousands of nodes and tens of thousands of edges.
    Scopus© Citations 3  367
  • Publication
    A finite mixture latent trajectory model for modeling ultrarunners' behavior in a 24-hour race
    A finite mixture latent trajectory model is developed to study the performance and strategy of runners in a 24-h long ultra running race. The model facilitates clustering of runners based on their speed and propensity to rest and thus reveals the strategies used in the race. Inference for the adopted latent trajectory model is achieved using an expectation-maximization algorithm. Fitting the model to data from the 2013 World Championships reveals three clearly separated clusters of runners who exhibit different strategies throughout the race. The strategies show that runners can be grouped in terms of their average moving speed and their propensity to rest during the race. The effect of age and gender on the probability of belonging to each cluster is also investigated.
      397Scopus© Citations 18
  • Publication
    Joint Modelling of Multiple Network Views
    (Taylor and Francis, 2014-11-17) ;
    Latent space models (LSM) for network data were introduced by Holf et al. (2002) under the basic assumption that each node of the network has an unknown position in a D-dimensional Euclidean latent space: generally the smaller the distance between two nodes in the latent space, the greater their probability of being connected. In this paper we propose a variational inference approach to estimate the intractable posterior of the LSM. In many cases, different network views on the same set of nodes are available. It can therefore be useful to build a model able to jointly summarise the information given by all the network views. For this purpose, we introduce the latent space joint model (LSJM) that merges the information given by multiple network views assuming that the probability of a node being connected with other nodes in each network view is explained by a unique latent variable. This model is demonstrated on the analysis of two datasets: an excerpt of 50 girls from 'Teenage Friends and Lifestyle Study' data at three time points and the Saccharomyces cerevisiae genetic and physical protein-protein interactions.
      411
  • Publication
    Exponential family mixed membership models for soft clustering of multivariate data
    For several years, model-based clustering methods have successfully tackled many of the challenges presented by data-analysts. However, as the scope of data analysis has evolved, some problems may be beyond the standard mixture model framework. One such problem is when observations in a dataset come from overlapping clusters, whereby different clusters will possess similar parameters for multiple variables. In this setting, mixed membership models, a soft clustering approach whereby observations are not restricted to single cluster membership, have proved to be an effective tool. In this paper, a method for fitting mixed membership models to data generated by a member of an exponential family is outlined. The method is applied to count data obtained from an ultra running competition, and compared with a standard mixture model approach.
      245Scopus© Citations 2
  • Publication
    Role Analysis in Networks Using Mixtures of Exponential Random Graph Models
    This article introduces a novel and flexible framework for investigating the roles of actors within a network. Particular interest is in roles as defined by local network connectivity patterns, identified using the ego-networks extracted from the network. A mixture of exponential-family random graph models (ERGM) is developed for these ego-networks to cluster the nodes into roles. We refer to this model as the ego-ERGM. An expectation-maximization algorithm is developed to infer the unobserved cluster assignments and to estimate the mixture model parameters using a maximum pseudo-likelihood approximation. We demonstrate the flexibility and utility of the method using examples of simulated and real networks.
    Scopus© Citations 18  397
  • Publication
    Bayesian Nonparametric Plackett-Luce Models for the Analysis of Preferences for College Degree Programmes
    (Institute of Mathematical Statistics, 2014) ; ;
    In this paper we propose a Bayesian nonparametric model for clustering partial ranking data.We start by developing a Bayesian nonparametric extension of the popular Plackett-Luce choice model that can handle an infinite number of choice items. Our framework is based on the theory of random atomic measures, with prior specified by a completely random measure. We characterise the posterior distribution given data, and derive a simple and effective Gibbs sampler for posterior simulation. We then develop a Dirichlet process mixture extension of our model and apply it to investigate the clustering of preferences for college degree programmes amongst Irish secondary school graduates. The existence of clusters of applicants who have similar preferences for degree programmes is established and we determine that subject matter and geographical location of the third level institution characterise these clusters.
    Scopus© Citations 38  312
  • Publication
    BayesLCA : An R Package for Bayesian Latent Class Analysis
    (Foundation for Open Access Statistics, 2014-11-25) ;
    The BayesLCA package for R provides tools for performing latent class analysis within a Bayesian setting. Three methods for fitting the model are provided, incorporating an expectation-maximization algorithm, Gibbs sampling and a variational Bayes approximation. The article briefly outlines the methodology behind each of these techniques and discusses some of the technical difficulties associated with them. Methods to remedy these problems are also described. Visualization methods for each of these techniques are included, as well as criteria to aid model selection.
    Scopus© Citations 46  1233
  • Publication
    mclust 5: Clustering, Classification and Density Estimation Using Gaussian Finite Mixture Models
    (R Foundation for Statistical Computing, 2016-08-01) ; ; ;
    Finite mixture models are being used increasingly to model a wide variety of random phenomena for clustering, classification and density estimation. mclust is a powerful and popular package which allows modelling of data as a Gaussian finite mixture with different covariance structures and different numbers of mixture components, for a variety of purposes of analysis. Recently, version 5 of the package has been made available on CRAN. This updated version adds new covariance structures, dimension reduction capabilities for visualisation, model selection criteria, initialisation strategies for the EM algorithm, and bootstrap-based inference, making it a full-featured R package for data analysis via finite mixture modelling.
      1166
  • Publication
    Evaluation of prediction models for the staging of prostate cancer
    Background: There are dilemmas associated with the diagnosis and prognosis of prostate cancer which has lead to over diagnosis and over treatment. Prediction tools have been developed to assist the treatment of the disease. Methods: A retrospective review was performed of the Irish Prostate Cancer Research Consortium database and 603 patients were used in the study. Statistical models based on routinely used clinical variables were built using logistic regression, random forests and k nearest neighbours to predict prostate cancer stage. The predictive ability of the models was examined using discrimination metrics, calibration curves and clinical relevance, explored using decision curve analysis. The N=603 patients were then applied to the 2007 Partin table to compare the predictions from the current gold standard in staging prediction to the models developed in this study. Results: 30% of the study cohort had non organ-confined disease. The model built using logistic regression illustrated the highest discrimination metrics (AUC=0.622, Sens=0.647, Spec=0.601), best calibration and the most clinical relevance based on decision curve analysis. This model also achieved higher discrimination than the 2007 Partin table (ECE AUC=0.572 & 0.509 for T1c and T2a respectively). However, even the best statistical model does not accurately predict prostate cancer stage. Conclusions: This study has illustrated the inability of the current clinical variables and the 2007 Partin table to accurately predict prostate cancer stage. New biomarker features are urgently required to address the problem clinicians face in identifying the most appropriate treatment for their patients. This paper also demonstrated a concise methodological approach to evaluate novel features or prediction models.
    Scopus© Citations 19  293