Repository logo
  • Log In
    New user? Click here to register.Have you forgotten your password?
University College Dublin
    Colleges & Schools
    Statistics
    All of DSpace
  • Log In
    New user? Click here to register.Have you forgotten your password?
  1. Home
  2. College of Health and Agricultural Sciences
  3. School of Medicine
  4. Medicine Research Collection
  5. OD-seq: outlier detection in multiple sequence alignments
 
  • Details
Options

OD-seq: outlier detection in multiple sequence alignments

Author(s)
Jehl, Peter  
Sievers, Fabian  
Higgins, Desmond G  
Uri
http://hdl.handle.net/10197/7310
Date Issued
2015-08-25
Date Available
2015-12-16T10:36:35Z
Abstract
Background: Multiple sequence alignments (MSA) are widely used in sequence analysis for a variety of tasks. Outlier sequences can make downstream analyses unreliable or make the alignments less accurate while they are being constructed. This paper describes a simple method for automatically detecting outliers and accompanying software called OD-seq. It is based on finding sequences whose average distance to the rest of the sequences in a dataset, is anomalous. Results: The software can take a MSA, distance matrix or set of unaligned sequences as input. Outlier sequences are found by examining the average distance of each sequence to the rest. Anomalous average distances are then found using the interquartile range of the distribution of average distances or by bootstrapping them. The complexity of any analysis of a distance matrix is normally at least O(N2 ) for N sequences. This is prohibitive for large N but is reduced here by using the mBed algorithm from Clustal Omega. This reduces the complexity to O(N log(N)) which makes even very large alignments easy to analyse on a single core. We tested the ability of OD-seq to detect outliers using artificial test cases of sequences from Pfam families, seeded with sequences from other Pfam families. Using a MSA as input, OD-seq is able to detect outliers with very high sensitivity and specificity. Conclusion: OD-seq is a practical and simple method to detect outliers in MSAs. It can also detect outliers in sets of unaligned sequences, but with reduced accuracy. For medium sized alignments, of a few thousand sequences, it can detect outliers in a few seconds.
Sponsorship
Science Foundation Ireland
Type of Material
Journal Article
Publisher
BMC Informatics
Journal
BMC Bioinformatics
Volume
16
Issue
269
Start Page
1
End Page
11
Copyright (Published Version)
2015 the Authors
Subjects

Outlier

Multiple sequence ali...

DOI
10.1186/s12859-015-0702-1
Language
English
Status of Item
Peer reviewed
This item is made available under a Creative Commons License
https://creativecommons.org/licenses/by-nc-nd/3.0/ie/
File(s)
Loading...
Thumbnail Image
Name

s12859-015-0702-1.pdf

Size

2.59 MB

Format

Adobe PDF

Checksum (MD5)

06ec96636bba65410c82e333824021d4

Owning collection
Medicine Research Collection
Mapped collections
Conway Institute Research Collection

Item descriptive metadata is released under a CC-0 (public domain) license: https://creativecommons.org/public-domain/cc0/.
All other content is subject to copyright.

For all queries please contact research.repository@ucd.ie.

Built with DSpace-CRIS software - Extension maintained and optimized by 4Science

  • Cookie settings
  • Privacy policy
  • End User Agreement