Now showing 1 - 7 of 7
- PublicationSentence-Level Event Classification in Unstructured TextsThe ability to correctly classify sentences that describe events is an important task for many natural language applications such as Question Answering (QA) and Text Summarisation. In this paper, we treat event detection as a sentence level text classification problem. We compare the performance of two approaches to this task: a Support Vector Machine (SVM) classifier and a Language Modeling (LM) approach. We also investigate a rule-based method that uses hand-crafted lists of ‘trigger’ terms derived from WordNet. We use two datasets in our experiments and test each approach using six different event types, i.e, Die, Attack, Injure, Meet, Transport and Charge-Indict. Our experimental results indicate that although the trained SVM classifier consistently outperforms the language modeling approach, our rule-based system marginally outperforms the trained SVM classifier on three of our six event types. We also observe that overall performance is greatly affected by the type of corpus used to train the algorithms. Specifically, we have found that a homogeneous training corpus that contains many instances of a specific event type (i.e., Die events in the recent Iraqi war) produces a poorer performing classifier than one trained on a heterogeneous dataset containing more diverse instances of the event (i.e.,Die events in many different settings, for example, traffic accidents, natural disasters etc.). Our heterogeneous dataset is provided by the ACE (Automatic Content Extraction) initiative, while our novel homogeneous dataset consists of news articles and annotated Die events from the Iraq Body Count (IBC) database. Overall, our results show that the techniques presented here are effective solutions to the event classification task described in this paper, where F1 scores of over 90% are achieved.
- PublicationForensic analysis of Exfat ArtefactsAlthough keeping some basic concepts inherited from FAT32, the exFAT file system introduces many differences, such as the new mapping scheme of directory entries. The combination of exFAT mapping scheme with the allocation of bitmap files and the use of FAT leads to new forensic possibilities. The recovery of deleted files, including fragmented ones and carving becomes more accurate compared with former forensic processes. Nowadays, the accurate and sound forensic analysis is more than ever needed, as there is a high risk of erroneous interpretation. Indeed, most of the related work in the literature on exFAT structure and forensics, is mainly based on reverse engineering research, and only few of them cover the forensic interpretation. In this paper, we propose a new methodology using of exFAT file systems features to improve the interpretation of inactive entries by using bitmap file analysis and recover the file system metadata information for carved files. Experimental results show how our approach improves the forensic interpretation accuracy.
- PublicationADMIRE framework: Distributed Data Mining on Data Grid platformsIn this paper, we present the ADMIRE architecture; a new framework for developing novel and innovative data mining techniques to deal with very large and distributed heterogeneous datasets in both commercial and academic applications. The main ADMIRE components are detailed as well as its interfaces allowing the user to efficiently develop and implement their data mining applications techniques on a Grid platform such as Globus ToolKit, DGET, etc.
- PublicationElectronic Evidence Discovery, Identification and Preservation: Role of the First Responder and related capacity building challengesThe integrity of electronic evidence is essential for judicial proceedings. In this context, the role of the First Responder for discovery, identification and preservation is considered to be one of the short-term most critical challenge. While the number of devices to be collected was reasonably small and the items were easily identifiable in the past, it is not the case anymore. Many initiatives aim at harmonising technical and legal standards to facilitate electronic evidence exchange, although a consistent approach in basic equipment and training of the field police officer is still missing. Hence, in this paper, we study how synergies between different international organisations create and deploy an innovative and sustainable approach to address capacity building challenges related to the tasks assigned to the First Responder.
- PublicationOnline Social Media in the Syria Conflict: Encompassing the Extremes and the In-BetweensThe Syria conflict has been described as the most socially mediated in history, with online social media playing a particularly important role. At the same time, the ever-changing landscape of the conflict leads to difficulties in applying analysis approaches taken by other studies of online political activism. In this paper, we propose an approach motivated by the Grounded Theory method, which is used within the social sciences to perform analysis in situations where key prior assumptions or the proposal of an advance hypothesis may not be possible. We apply this method to analyze Twitter and YouTube activity of a range of protagonists to the conflict in an attempt to reveal additional insights into the relationships between them. By means of a network representation that combines multiple data views, we uncover communities of accounts falling into four categories that broadly reflect the situation on the ground in Syria. A detailed analysis of selected communities within the anti-regime categories is provided, focusing on their central actors, preferred online platforms, and activity surrounding real world events. Our findings indicate that social media activity in Syria is considerably more convoluted than reported in many other studies of online political activism, suggesting that alternative analysis approaches can play an important role in this type of scenario.
469Scopus© Citations 20
- PublicationDown the (White) Rabbit Hole: The Extreme Right and Online Recommender SystemsIn addition to hosting user-generated video content, YouTube provides recommendation services,where sets of related and recommended videos are presented to users, based on factors such as covisitation count and prior viewing history. This article is specifically concerned with extreme right(ER) video content, portions of which contravene hate laws and are thus illegal in certain countries,which are recommended by YouTube to some users. We develop a categorization of this content based on various schema found in a selection of academic literature on the ER, which is then used to demonstrate the political articulations of YouTubes recommender system, particularly the narrowing of the range of content to which users are exposed and the potential impacts of this. For this purpose, we use two data sets of English and German language ER YouTube channels, along with channels suggested by YouTubes related video service. A process is observable whereby users accessing an ER YouTube video are likely to be recommended further ER content, leading to immersion in an ideological bubble in just a few short clicks. The evidence presented in this article supportsa shift of the almost exclusive focus on users as content creators and protagonists in extremist cyberspaces to also consider online platform providers as important actors in these same spaces.
734Scopus© Citations 100
- PublicationAn Analysis of the Coherence of Descriptors in Topic ModelingIn recent years, topic modeling has become an established method in the analysis of text corpora, with probabilistic techniques such as latent Dirichlet allocation (LDA) commonly employed for this purpose. However, it might be argued that adequate attention is often not paid to the issue of topic coherence, the semantic interpretability of the top terms usually used to describe discovered topics. Nevertheless, a number of studies have proposed measures for analyzing such coherence, where these have been largely focused on topics found by LDA, with matrix decomposition techniques such as Non-negative Matrix Factorization (NMF) being somewhat overlooked in comparison. This motivates the current work, where we compare and analyze topics found by popular variants of both NMF and LDA in multiple corpora in terms of both their coherence and associated generality, using a combination of existing and new measures, including one based on distributional semantics. Two out of three coherence measures find NMF to regularly produce more coherent topics, with higher levels of generality and redundancy observed with the LDA topic descriptors. In all cases, we observe that the associated term weighting strategy plays a major role. The results observed with NMF suggest that this may be a more suitable topic modeling method when analyzing certain corpora, such as those associated with niche or non-mainstream domains.
2686Scopus© Citations 202