Now showing 1 - 4 of 4
  • Publication
    A Systematic Comparison and Evaluation of k-Anonymization Algorithms for Practitioners
    The vast amount of data being collected about individuals has brought new challenges in protecting their privacy when this data is disseminated. As a result, Privacy-Preserving Data Publishing has become an active research area, in which multiple anonymization algorithms have been proposed. However, given the large number of algorithms available and limited information regarding their performance, it is difficult to identify and select the most appropriate algorithm given a particular publishing scenario, especially for practitioners. In this paper, we perform a systematic comparison of three well-known k-anonymization algorithms to measure their efficiency (in terms of resources usage) and their effectiveness (in terms of data utility). We extend the scope of their original evaluation by employing a more comprehensive set of scenarios: different parameters, metrics and datasets. Using publicly available implementations of those algorithms, we conduct a series of experiments and a comprehensive analysis to identify the factors that influence their performance, in order to guide practitioners in the selection of an algorithm. We demonstrate through experimental evaluation, the conditions in which one algorithm outperforms the others for a particular metric, depending on the input dataset and privacy requirements. Our findings motivate the necessity of creating methodologies that provide recommendations about the best algorithm given a particular publishing scenario.
      1837
  • Publication
    Enhancing the Utility of Anonymized Data by Improving the Quality of Generalization Hierarchies
    The dissemination of textual personal information has become an important driver of innovation. However, due to the possible content of sensitive information, this data must be anonymized. A commonly-used technique to anonymize data is generalization. Nevertheless, its effectiveness can be hampered by the Value Generalization Hierarchies (VGHs) used as poorly-specified VGHs can decrease the usefulness of the resulting data. To tackle this problem, in our previous work we presented the Generalization Semantic Loss (GSL), a metric that captures the quality of categorical VGHs in terms of semantic consistency and taxonomic organization. We validated the accuracy of GSL using an intrinsic evaluation with respect to a gold standard ontology. In this paper, we extend our previous work by conducting an extrinsic evaluation of GSL with respect to the performance that VGHs have in anonymization (using data utility metrics). We show how GSL can be used to perform an a priori assessment of the VGHs¿ effectiveness for anonymization. In this manner, data publishers can quantitatively compare the quality of various VGHs and identify (before anonymization) those that better retain the semantics of the original data. Consequently, the utility of the anonymized datasets can be improved without sacrificing the privacy goal. Our results demonstrate the accuracy of GSL, as the quality of VGHs measured with GSL strongly correlates with the utility of the anonymized data. Results also show the benefits that an a priori VGH assessment strategy brings to the anonymization process in terms of time-savings and a reduction in the dependency on expert knowledge. Finally, GSL also proved to be lightweight in terms of computational resources.
      351
  • Publication
    Ontology-Based Quality Evaluation of Value Generalization Hierarchies for Data Anonymization
    In privacy-preserving data publishing, approaches using Value Generalization Hierarchies (VGHs) form an important class of anonymization algorithms. VGHs play a key role in the utility of published datasets as they dictate how the anonymization of the data occurs. For categorical attributes, it is imperative to preserve the semantics of the original data in order to achieve a higher utility. Despite this, semantics have not being formally considered in the specification of VGHs. Moreover, there are no methods that allow the users to assess the quality of their VGH. In this paper, we propose a measurement scheme, based on ontologies, to quantitatively evaluate the quality of VGHs, in terms of semantic consistency and taxonomic organization, with the aim of producing higher-quality anonymizations. We demonstrate, through a case study, how our evaluation scheme can be used to compare the quality of multiple VGHs and can help to identify faulty VGHs.
      226
  • Publication
    Synthetic Data Generation using Benerator Tool
    (University College Dublin. School of Computer Science and Informatics, 2013-10-29) ; ; ;
    Datasets of different characteristics are needed by the research community for experimental purposes. However, real data may be difficult to obtain due to privacy concerns. Moreover, real data may not meet specific characteristics which are needed to verify new approaches under certain conditions. Given these limitations, the use of synthetic data is a viable alternative to complement the real data. In this report, we describe the process followed to generate synthetic data using Benerator, a publicly available tool. The results show that the synthetic data preserves a high level of accuracy compared to the original data. The generated datasets correspond to microdata containing records with social, economic and demographic data which mimics the distribution of aggregated statistics from the 2011 Irish Census data.
      186