Now showing 1 - 10 of 12
  • Publication
    Online Trans-dimensional von Mises-Fisher Mixture Models for User Profiles
    (Journal of Machine Learning Research, 2016) ; ;
    The proliferation of online communities has attracted much attention to modelling user behaviour in terms of social interaction, language adoption and contribution activity. Nevertheless, when applied to large-scale and cross-platform behavioural data, existing approaches generally suffer from expressiveness, scalability and generality issues. This paper proposes trans-dimensional von Mises-Fisher (TvMF) mixture models for L2 normalised behavioural data, which encapsulate: (1)a Bayesian framework for vMF mixtures that enables prior knowledge and information sharing among clusters, (2) an extended version of reversible jump MCMC algorithm that allows adaptivechanges in the number of clusters for vMF mixtures when the model parameters are updated, and (3)an online TvMF mixture model that accommodates the dynamics of clusters for time-varying user behavioural data. We develop efficient collapsed Gibbs sampling techniques for posterior inference,which facilitates parallelism for parameter updates. Empirical results on simulated and real-world data show that the proposed TvMF mixture models can discover more interpretable and intuitive clusters than other widely-used models, such as k-means, non-negative matrix factorization (NMF), Dirichlet process Gaussian mixture models (DP-GMM), and dynamic topic models (DTM). Wefurther evaluate the performance of proposed models in real-world applications, such as the churn prediction task, that shows the usefulness of the features generated.
      375
  • Publication
    All or Nothing : protein complexes flip essentiality between distantly related eukaryotes
    In the budding yeast Saccharomyces cerevisiae, the subunits of any given protein complex are either mostly essential or mostly nonessential, suggesting that essentiality is a property of molecular machines rather than individual components. There are exceptions to this rule, however, that is, nonessential genes in largely essential complexes and essential genes in largely nonessential complexes. Here, we provide explanations for these exceptions, showing that redundancy within complexes, as revealed by genetic interactions, can explain many of the former cases, whereas "moonlighting," as revealed by membership of multiple complexes, can explain the latter. Surprisingly, we find that redundancy within complexes cannot usually be explained by gene duplication, suggesting alternate buffering mechanisms. In the distantly related Schizosaccharomyces pombe, we observe the same phenomenon of modular essentiality, suggesting that it may be a general feature of eukaryotes. Furthermore, we show that complexes flip essentiality in a cohesive fashion between the two species, that is, they tend to change from mostly essential to mostly nonessential, or vice versa, but not to mixed patterns. We show that these flips in essentiality can be explained by differing lifestyles of the two yeasts. Collectively, our results support a previously proposed model where proteins are essential because of their involvement in essential functional modules rather than because of specific topological features such as degree or centrality.
      452Scopus© Citations 29
  • Publication
    Hierarchical Modularity and the Evolution of Genetic Interactomes across Species
    To date, cross-species comparisons of genetic interactomes have been restricted to small or functionally related gene sets, limiting our ability to infer evolutionary trends. To facilitate a more comprehensive analysis, we constructed a genome-scale epistasis map (E-MAP) for the fission yeast Schizosaccharomyces pombe, providing phenotypic signatures for ~60% of the nonessential genome. Using these signatures, we generated a catalog of 297 functional modules, and we assigned function to 144 previously uncharacterized genes, including mRNA splicing and DNA damage checkpoint factors. Comparison with an integrated genetic interactome from the budding yeast Saccharomyces cerevisiae revealed a hierarchical model for the evolution of genetic interactions, with conservation highest within protein complexes, lower within biological processes, and lowest between distinct biological processes. Despite the large evolutionary distance and extensive rewiring of individual interactions, both networks retain conserved features and display similar levels of functional crosstalk between biological processes, suggesting general design principles of genetic interactomes.
      633Scopus© Citations 152
  • Publication
    Detecting Voids in 3D Printing Using Melt Pool Time Series Data
    Powder Bed Fusion (PBF) has emerged as an important process in the additive manufacture of metals. However, PBF is sensitive to process parameters and careful management is required to ensure the high quality of parts produced. In PBF, a laser or electron beam is used to fuse powder to the part. It is recognised that the temperature of the melt pool is an important signal representing the health of the process. In this paper, Machine Learning (ML) methods on time-series data are used to monitor melt pool temperature to detect anomalies. In line with other ML research on time-series classification, Dynamic Time Warping and k-Nearest Neighbour classifiers are used. The presented process is effective in detecting voids in PBF. A strategy is then proposed to speed up classification time, an important consideration given the volume of data involved.
      691Scopus© Citations 26
  • Publication
    An empirical evaluation of kernels for time series classification
    There exist a variety of distance measures which operate on time series kernels. The objective of this article is to compare those distance measures in a support vector machine setting. A support vector machine is a state-of-the-art classifier for static (non-time series) datasets and usually outperforms k-Nearest Neighbour, however it is often noted that that 1-NN DTW is a robust baseline for time-series classification. Through a collection of experiments we determine that the most effective distance measure is Dynamic Time Warping and the most effective classifier is kNN. However, a surprising result is that the pairing of kNN and DTW is not the most effective model. Instead we have discovered via experimentation that Dynamic Time Warping paired with the Gaussian Support Vector Machine is the most accurate time series classifier. Finally, with good reason we recommend a slightly inferior (in terms of accuracy) model Time Warp Edit Distance paired with the Gaussian Support Vector Machine as it has a better theoretical basis. We also discuss the reduction in computational cost achieved by using a Support Vector Machine, finding that the Negative Kernel paired with the Dynamic Time Warping distance produces the greatest reduction in computational cost.
      180Scopus© Citations 3
  • Publication
    Integration of multiple network views in Wikipedia
    One of the challenges in network data analysis is the determination of the mostinformative perspective on the network to use in analysis. This is particularlyan issue when the network is dynamic and is defined by events that occur overtime. We present an example of such a scenario in the analysis of edit networks in Wikipedia the networks of editors interacting on Wikipedia pages. We proposethe prediction of article quality as a task that allows us to quantify the informativenessof alternative network views. We present three fundamentally different viewson the data that attempt to capture structural and temporal aspects of the edit networks.We demonstrate that each view captures information that is unique to thatview and propose a strategy for integrating the different sources of information
      595Scopus© Citations 7
  • Publication
    A quantitative evaluation of the relative status of journal and conference publications in computer science
    While it is universally held by computer scientists that conference publications have a higher status in computer science than in other disciplines there is little quantitative evidence in support of this position. The importance of journal publications in academic promotion makes this a big issue since an exclusive focus on journal papers will miss many significant papers published at conferences in computer science. In this paper we set out to quantify the relative importance of journal and conference papers in computer science. We show that computer science papers in leading conferences match the impact of papers in mid-ranking journals and surpass the impact of papers in journals in the bottom half of the ISI rankings - when im- pact is measured by citations in Google Scholar. We also show that there is a poor correlation between this measure of impact and conference acceptance rates. This indicates that conference publication is an inefficient market where venues that are equally challenging in terms of rejection rates offer quite different returns in terms of citations.
      2451Scopus© Citations 111
  • Publication
    Community detection: effective evaluation on large social networks
    (Oxford University Press, 2014) ;
    While many recently proposed methods aim to detect network communities in large datasets, such as those generated by social media and telecommunications services, most evaluation (i.e. benchmarking) of this research is based on small, hand-curated datasets. We argue that these two types of networks differ so significantly that, by evaluating algorithms solely on the smaller networks, we know little about how well they perform on the larger datasets. Recent work addresses this problem by introducing social network datasets annotated with meta-data that is believed to approximately indicate a 'ground truth' set of network communities. While such efforts are a step in the right direction, we find this meta-data problematic for two reasons. First, in practice, the groups contained in such meta-data may only be a subset of a network’s communities. Second, while it is often reasonable to assume that meta-data is related to network communities in some way, we must be cautious about assuming that these groups correspond closely to network communities. Here, we consider these difficulties and propose an evaluation scheme based on a classification task that is tailored to deal with them.
      640Scopus© Citations 19
  • Publication
    Down the (White) Rabbit Hole: The Extreme Right and Online Recommender Systems
    In addition to hosting user-generated video content, YouTube provides recommendation services,where sets of related and recommended videos are presented to users, based on factors such as covisitation count and prior viewing history. This article is specifically concerned with extreme right(ER) video content, portions of which contravene hate laws and are thus illegal in certain countries,which are recommended by YouTube to some users. We develop a categorization of this content based on various schema found in a selection of academic literature on the ER, which is then used to demonstrate the political articulations of YouTubes recommender system, particularly the narrowing of the range of content to which users are exposed and the potential impacts of this. For this purpose, we use two data sets of English and German language ER YouTube channels, along with channels suggested by YouTubes related video service. A process is observable whereby users accessing an ER YouTube video are likely to be recommended further ER content, leading to immersion in an ideological bubble in just a few short clicks. The evidence presented in this article supportsa shift of the almost exclusive focus on users as content creators and protagonists in extremist cyberspaces to also consider online platform providers as important actors in these same spaces.
      768Scopus© Citations 111
  • Publication
    The influence of network structures of Wikipedia discussion pages on the efficiency of WikiProjects
    The proliferation of online communities has attracted much attention to modelling user behaviour in terms of social interaction, language adoption and contribution activity. Nevertheless, when applied to large-scale and cross-platform behavioural data, existing approaches generally suffer from expressiveness, scalability and generality issues. This paper proposes trans-dimensional von Mises-Fisher (TvMF) mixture models for L2 normalised behavioural data, which encapsulate: (1) a Bayesian framework for vMF mixtures that enables prior knowledge and information sharing among clusters, (2) an extended version of reversible jump MCMC algorithm that allows adaptive changes in the number of clusters for vMF mixtures when the model parameters are updated, and (3) an online TvMF mixture model that accommodates the dynamics of clusters for time-varying user behavioural data. We develop efficient collapsed Gibbs sampling techniques for posterior inference, which facilitates parallelism for parameter updates. Empirical results on simulated and real-world data show that the proposed TvMF mixture models can discover more interpretable and intuitive clusters than other widely-used models, such as k-means, non-negative matrix factorization (NMF), Dirichlet process Gaussian mixture models (DP-GMM), and dynamic topic models (DTM). We further evaluate the performance of proposed models in real-world applications, such as the churn prediction task, that shows the usefulness of the features generated.
      585Scopus© Citations 14