Now showing 1 - 7 of 7
  • Publication
    Model-Based and Nonparametric Approaches to Clustering for Data Compression in Actuarial Applications
    (Taylor and Francis, 2016) ;
    Clustering is used by actuaries in a data compression process to make massive or nested stochastic simulations practical to run. A large data set of assets or liabilities is partitioned into a user-defined number of clusters, each of which is compressed to a single representative policy. The representative policies can then simulate the behavior of the entire portfolio over a large range of stochastic scenarios. Such processes are becoming increasingly important in understanding product behavior and assessing reserving requirements in a big-data environment. This article proposes a variety of clustering techniques that can be used for this purpose. Initialization methods for performing clustering compression are also compared, including principal components, factor analysis and segmentation. A variety of methods for choosing a cluster's representative policy is considered. A real data set comprised of variable annuity policies, provided by Milliman, is used to test the proposed methods. It is found that the compressed data sets produced by the new methods, namely model-based clustering, Ward's minimum variance hierarchical clustering and k-medoids clustering, can replicate the behavior of the uncompressed (seriatim) data more accurately than those obtained by the existing Milliman method. This is verified within sample, by examining location variable totals of the representative policies versus the uncompressed data at the five levels of compression of interest. More crucially it is also verified out of sample by comparing the distributions of the present values of several variables after 20 years across 1,000 simulated scenarios based on the compressed and seriatim data, using Kolmogorov-Smirnov goodness-of-fit tests and weighted sums of squared differences.
    Scopus© Citations 6  514
  • Publication
    Motor insurance claim modelling with factor collapsing and Bayesian model averaging
    Accidental damage is a typical component of motor insurance claim. Modeling of this nature generally involves analysis of past claim history and different characteristics of the insured objects and the policyholders. Generalized linear models (GLMs) have become the industry’s standard approach for pricing and modeling risks of this nature. However, the GLM approach utilizes a single best model on which loss predictions are based, which ignores the uncertainty among the competing models and variable selection. An additional characteristic of motor insurance datasets is the presence of many categorical variables, within which the number of levels is high. In particular, not all levels of such variables may be statistically significant and rather some subsets of the levels may be merged to give a smaller overall number of levels for improved model parsimony and interpretability. A method is proposed for assessing the optimal manner in which to collapse a factor with many levels into one with a smaller number of levels, then Bayesian model averaging (BMA) is used to blend model predictions from all reasonable models to account for factor collapsing uncertainty. This method will be computationally intensive due to the number of factors being collapsed as well as the possibly large number of levels within factors. Hence a stochastic optimisation is proposed to quickly find the best collapsing cases across the model space.
    Scopus© Citations 5  658
  • Publication
    Computational Aspects of Fitting Mixture Models via the Expectation-Maximization Algorithm
    The Expectation–Maximization (EM) algorithm is a popular tool in a wide variety of statistical settings, in particular in the maximum likelihood estimation of parameters when clustering using mixture models. A serious pitfall is that in the case of a multimodal likelihood function the algorithm may become trapped at a local maximum, resulting in an inferior clustering solution. In addition, convergence to an optimal solution can be very slow. Methods are proposed to address these issues: optimizing starting values for the algorithm and targeting maximization steps efficiently. It is demonstrated that these approaches can produce superior outcomes to initialization via random starts or hierarchical clustering and that the rate of convergence to an optimal solution can be greatly improved.
    Scopus© Citations 34  652
  • Publication
    AI-based modeling and data-driven evaluation for smart manufacturing processes
    Smart manufacturing refers to optimization techniques that are implemented in production operations by utilizing advanced analytics approaches. With the widespread increase in deploying industrial internet of things (IIOT) sensors in manufacturing processes, there is a progressive need for optimal and effective approaches to data management. Embracing machine learning and artificial intelligence to take advantage of manufacturing data can lead to efficient and intelligent automation. In this paper, we conduct a comprehensive analysis based on evolutionary computing and neural network algorithms toward making semiconductor manufacturing smart. We propose a dynamic algorithm for gaining useful insights about semiconductor manufacturing processes and to address various challenges. We elaborate on the utilization of a genetic algorithm and neural network to propose an intelligent feature selection algorithm. Our objective is to provide an advanced solution for controlling manufacturing processes and to gain perspective on various dimensions that enable manufacturers to access effective predictive technologies.
    Scopus© Citations 168  63
  • Publication
    Investigation of parameter uncertainty in clustering using a Gaussian mixture model via jackknife, bootstrap and weighted likelihood bootstrap
    (Springer Science and Business Media LLC, 2019-05-28) ; ; ;
    Mixture models with (multivariate) Gaussian components are a popular tool in model-based clustering. Such models are often fitted by a procedure that maximizes the likelihood, such as the EM algorithm. At convergence, the maximum likelihood parameter estimates are typically reported, but in most cases little emphasis is placed on the variability associated with these estimates. In part this may be due to the fact that standard errors are not directly calculated in the model-fitting algorithm, either because they are not required to fit the model, or because they are difficult to compute. The examination of standard errors in model-based clustering is therefore typically neglected. Sampling based methods, such as the jackknife (JK), bootstrap (BS) and parametric bootstrap (PB), are intuitive, generalizable approaches to assessing parameter uncertainty in model-based clustering using a Gaussian mixture model. This paper provides a review and empirical comparison of the jackknife, bootstrap and parametric bootstrap methods for producing standard errors and confidence intervals for mixture parameters. The performance of such sampling methods in the presence of small and/or overlapping clusters requires consideration however; here the weighted likelihood bootstrap (WLBS) approach is demonstrated to be effective in addressing this concern in a model-based clustering framework. The JK, BS, PB and WLBS methods are illustrated and contrasted through simulation studies and through the traditional Old Faithful data set and also the Thyroid data set. The MclustBootstrap function, available in the most recent release of the popular R package mclust, facilitates the implementation of the JK, BS, PB and WLBS approaches to estimating parameter uncertainty in the context of model-based clustering. The JK, WLBS and PB approaches to variance estimation are shown to be robust and provide good coverage across a range of real and simulated data sets when performing model-based clustering; but care is advised when using the BS in such settings. In the case of poor model fit (for example for data with small and/or overlapping clusters), JK and BS are found to suffer from not being able to fit the specified model in many of the sub-samples formed. The PB also suffers when model fit is poor since it is reliant on data sets simulated from the model upon which to base the variance estimation calculations. However the WLBS will generally provide a robust solution, driven by the fact that all observations are represented with some weight in each of the sub-samples formed under this approach.
    Scopus© Citations 21  620
  • Publication
    Clustering with the multivariate normal inverse Gaussian distribution
    Many model-based clustering methods are based on a finite Gaussian mixture model. The Gaussian mixture model implies that the data scatter within each group is elliptically shaped. Hence non-elliptical groups are often modeled by more than one component, resulting in model over-fitting. An alternative is to use a mean–variance mixture of multivariate normal distributions with an inverse Gaussian mixing distribution (MNIG) in place of the Gaussian distribution, to yield a more flexible family of distributions. Under this model the component distributions may be skewed and have fatter tails than the Gaussian distribution. The MNIG based approach is extended to include a broad range of eigendecomposed covariance structures. Furthermore, MNIG models where the other distributional parameters are constrained is considered. The Bayesian Information Criterion is used to identify the optimal model and number of mixture components. The method is demonstrated on three sample data sets and a novel variation on the univariate Kolmogorov–Smirnov test is used to assess goodness of fit.
      17524Scopus© Citations 60
  • Publication
    What are you grouping for? Insurance claims forecasting with cluster analysis
    (Institute and Faculty of Actuaries, 2020-07-08) ;
    Machine learning has increasingly become a tool for actuaries in the era of big data, and the idea of actuaries teaming up with data scientists has been continuously debated by industry leaders.
      13