Now showing 1 - 10 of 44
Thumbnail Image
Publication

Variational Bayesian inference for the Latent Position Cluster Model

2009-12, Salter-Townshend, Michael, Murphy, Thomas Brendan

Many recent approaches to modeling social networks have focussed on embedding the actors in a latent “social space”. Links are more likely for actors that are close in social space than for actors that are distant in social space. In particular, the Latent Position Cluster Model (LPCM) [1] allows for explicit modelling of the clustering that is exhibited in many network datasets. However, inference for the LPCM model via MCMC is cumbersome and scaling of this model to large or even medium size networks with many interacting nodes is a challenge. Variational Bayesian methods offer one solution to this problem. An approximate, closed form posterior is formed, with unknown variational parameters. These parameters are tuned to minimize the Kullback-Leibler divergence between the approximate variational posterior and the true posterior, which known only up to proportionality. The variational Bayesian approach is shown to give a computationally efficient way of fitting the LPCM. The approach is demonstrated on a number of data sets and it is shown to give a good fit.

Thumbnail Image
Publication

Standardizing interestingness measures for association rules

2018-12, Shaikh, Mateen, McNicholas, Paul D., Antonie, M. Luiza, Murphy, Thomas Brendan

Interestingness measures provide information about association rules. The value of an interestingness measure is often interpreted relative to the overall range of the interestingness measure. However, properties of individual association rules can further restrict what value an interestingness measure can achieve. These additional constraints are not typically taken into account in analysis, potentially misleading the investigator. Considering the value of an interestingness measure relative to this further constrained range provides greater insight than the original range alone and can even alter researchers' impressions of the data. Standardizing interestingness measures takes these additional restrictions into account, resulting in values that provide a relative measure of the attainable values. We explore the impacts of standardizing interestingness measures on real and simulated data.

Thumbnail Image
Publication

Preferences in college applications - a nonparametric Bayesian analysis of top-10 rankings

2010-12-10, Ali, Alnur, Murphy, Thomas Brendan, Meila, Marina, Chen, Harr

Applicants to degree courses in Irish colleges and universities rank up to ten degree courses from a list of over five hundred. These data provide a wealth of information concerning applicant degree choices. A Dirichlet process mixture of generalized Mallows models are used to explore data from a cohort of applicants. We find strong and diverse clusters, which in turn gains us important insights into the workings of the system. No previously tried models or analysis technique are able to model the data with comparable accuracy.

Thumbnail Image
Publication

Semi-supervised linear discriminant analysis

2011-12, Toher, Deirdre, Downey, Gerard, Murphy, Thomas Brendan

Fisher's linear discriminant analysis is one of the most commonly used and studied classification methods in chemometrics. The method finds a projection of multivariate data into a lower dimensional space so that the groups in the data are well separated. The resulting projected values are subsequently used to classify unlabeled observations into the groups. A semi-supervised version of Fisher's linear discriminant analysis is developed, so that the unlabeled observations are also used in the model fitting procedure. This approach is advantageous when few labeled and many unlabeled observations are available. The semi-supervised linear discriminant analysis method is demonstrated on a number of data sets where it is shown to yield better separation of the groups and improved classification over Fisher's linear discriminant analysis.

Thumbnail Image
Publication

A robust approach to model-based classification based on trimming and constraints

2019-08-14, Cappozzo, Andrea, Greselin, Francesca, Murphy, Thomas Brendan

In a standard classification framework a set of trustworthy learning data are employed to build a decision rule, with the final aim of classifying unlabelled units belonging to the test set. Therefore, unreliable labelled observations, namely outliers and data with incorrect labels, can strongly undermine the classifier performance, especially if the training size is small. The present work introduces a robust modification to the Model-Based Classification framework, employing impartial trimming and constraints on the ratio between the maximum and the minimum eigenvalue of the group scatter matrices. The proposed method effectively handles noise presence in both response and exploratory variables, providing reliable classification even when dealing with contaminated datasets. A robust information criterion is proposed for model selection. Experiments on real and simulated data, artificially adulterated, are provided to underline the benefits of the proposed method.

Thumbnail Image
Publication

A Mixture of Experts Latent Position Cluster Model for Social Network Data

2010-05, Gormley, Isobel Claire, Murphy, Thomas Brendan

Social network data represent the interactions between a group of social actors. Interactions between colleagues and friendship networks are typical examples of such data. The latent space model for social network data locates each actor in a network in a latent (social) space and models the probability of an interaction between two actors as a function of their locations. The latent position cluster model extends the latent space model to deal with network data in which clusters of actors exist — actor locations are drawn from a finite mixture model, each component of which represents a cluster of actors. A mixture of experts model builds on the structure of a mixture model by taking account of both observations and associated covariates when modeling a heterogeneous population. Herein, a mixture of experts extension of the latent position cluster model is developed. The mixture of experts framework allows covariates to enter the latent position cluster model in a number of ways, yielding different model interpretations. Estimates of the model parameters are derived in a Bayesian framework using a Markov Chain Monte Carlo algorithm. The algorithm is generally computationally expensive — surrogate proposal distributions which shadow the target distributions are derived, reducing the computational burden. The methodology is demonstrated through an illustrative example detailing relationships between a group of lawyers in the USA.

Thumbnail Image
Publication

Model-Based clustering of microarray expression data via latent Gaussian mixture models

2010-11-01, McNicholas, Paul D., Murphy, Thomas Brendan

In recent years, work has been carried out on clustering gene expression microarray data. Some approaches are developed from an algorithmic viewpoint whereas others are developed via the application of mixture models. In this article, a family of eight mixture models which utilizes the factor analysis covariance structure is extended to 12 models and applied to gene expression microarray data. This modelling approach builds on previous work by introducing a modified factor analysis covariance structure, leading to a family of 12 mixture models, including parsimonious models. This family of models allows for the modelling of the correlation between gene expression levels even when the number of samples is small. Parameter estimation is carried out using a variant of the expectation–maximization algorithm and model selection is achieved using the Bayesian information criterion. This expanded family of Gaussian mixture models, known as the expanded parsimonious Gaussian mixture model (EPGMM) family, is then applied to two well-known gene expression data sets.

Thumbnail Image
Publication

A grade of membership model for rank data

2009-06, Gormley, Isobel Claire, Murphy, Thomas Brendan

A grade of membership (GoM) model is an individual level mixture model which allows individuals have partial membership of the groups that characterize a population. A GoM model for rank data is developed to model the particular case when the response data is ranked in nature. A Metropolis-withinGibbs sampler provides the framework for model fitting, but the intricate nature of the rank data models makes the selection of suitable proposal distributions difficult. 'Surrogate' proposal distributions are constructed using ideas from optimization transfer algorithms. Model fitting issues such as label switching and model selection are also addressed. The GoM model for rank data is illustrated through an analysis of Irish election data where voters rank some or all of the candidates in order of preference. Interest lies in highlighting distinct groups of voters with similar preferences (i.e. 'voting blocs') within the electorate, taking into account the rank nature of the response data, and in examining individuals’ voting bloc memberships. The GoM model for rank data is fitted to data from an opinion poll conducted during the Irish presidential election campaign in 1997.

Thumbnail Image
Publication

Bayesian Nonparametric Plackett-Luce Models for the Analysis of Preferences for College Degree Programmes

2014, Caron, François, Whye Teh, Yee, Murphy, Thomas Brendan

In this paper we propose a Bayesian nonparametric model for clustering partial ranking data.We start by developing a Bayesian nonparametric extension of the popular Plackett-Luce choice model that can handle an infinite number of choice items. Our framework is based on the theory of random atomic measures, with prior specified by a completely random measure. We characterise the posterior distribution given data, and derive a simple and effective Gibbs sampler for posterior simulation. We then develop a Dirichlet process mixture extension of our model and apply it to investigate the clustering of preferences for college degree programmes amongst Irish secondary school graduates. The existence of clusters of applicants who have similar preferences for degree programmes is established and we determine that subject matter and geographical location of the third level institution characterise these clusters.

Thumbnail Image
Publication

Motor insurance claim modelling with factor collapsing and Bayesian model averaging

2018-03-26, Hu, Sen, O'Hagan, Adrian, Murphy, Thomas Brendan

Accidental damage is a typical component of motor insurance claim. Modeling of this nature generally involves analysis of past claim history and different characteristics of the insured objects and the policyholders. Generalized linear models (GLMs) have become the industry’s standard approach for pricing and modeling risks of this nature. However, the GLM approach utilizes a single best model on which loss predictions are based, which ignores the uncertainty among the competing models and variable selection. An additional characteristic of motor insurance datasets is the presence of many categorical variables, within which the number of levels is high. In particular, not all levels of such variables may be statistically significant and rather some subsets of the levels may be merged to give a smaller overall number of levels for improved model parsimony and interpretability. A method is proposed for assessing the optimal manner in which to collapse a factor with many levels into one with a smaller number of levels, then Bayesian model averaging (BMA) is used to blend model predictions from all reasonable models to account for factor collapsing uncertainty. This method will be computationally intensive due to the number of factors being collapsed as well as the possibly large number of levels within factors. Hence a stochastic optimisation is proposed to quickly find the best collapsing cases across the model space.