OD-seq: outlier detection in multiple sequence alignments

Files in This Item:
File Description SizeFormat 
s12859-015-0702-1.pdf2.65 MBAdobe PDFDownload
Title: OD-seq: outlier detection in multiple sequence alignments
Authors: Jehl, Peter
Sievers, Fabian
Higgins, D. (Des)
Permanent link: http://hdl.handle.net/10197/7310
Date: 25-Aug-2015
Abstract: Background: Multiple sequence alignments (MSA) are widely used in sequence analysis for a variety of tasks. Outlier sequences can make downstream analyses unreliable or make the alignments less accurate while they are being constructed. This paper describes a simple method for automatically detecting outliers and accompanying software called OD-seq. It is based on finding sequences whose average distance to the rest of the sequences in a dataset, is anomalous. Results: The software can take a MSA, distance matrix or set of unaligned sequences as input. Outlier sequences are found by examining the average distance of each sequence to the rest. Anomalous average distances are then found using the interquartile range of the distribution of average distances or by bootstrapping them. The complexity of any analysis of a distance matrix is normally at least O(N2 ) for N sequences. This is prohibitive for large N but is reduced here by using the mBed algorithm from Clustal Omega. This reduces the complexity to O(N log(N)) which makes even very large alignments easy to analyse on a single core. We tested the ability of OD-seq to detect outliers using artificial test cases of sequences from Pfam families, seeded with sequences from other Pfam families. Using a MSA as input, OD-seq is able to detect outliers with very high sensitivity and specificity. Conclusion: OD-seq is a practical and simple method to detect outliers in MSAs. It can also detect outliers in sets of unaligned sequences, but with reduced accuracy. For medium sized alignments, of a few thousand sequences, it can detect outliers in a few seconds.
Funding Details: Science Foundation Ireland
Type of material: Journal Article
Publisher: BMC Informatics
Journal: BMC Bioinformatics
Volume: 16
Issue: 269
Start page: 1
End page: 11
Copyright (published version): 2015 the Authors
Keywords: OutlierMultiple sequence alignment
DOI: 10.1186/s12859-015-0702-1
Language: en
Status of Item: Peer reviewed
Appears in Collections:Conway Institute Research Collection
Medicine Research Collection

Show full item record

Citations 50

Last Week
Last month
checked on Nov 12, 2018

Google ScholarTM



This item is available under the Attribution-NonCommercial-NoDerivs 3.0 Ireland. No item may be reproduced for commercial purposes. For other possible restrictions on use please refer to the publisher's URL where this is made available, or to notes contained in the item itself. Other terms may apply.