OD-seq: outlier detection in multiple sequence alignments

DC FieldValueLanguage
dc.contributor.authorJehl, Peter-
dc.contributor.authorSievers, Fabian-
dc.contributor.authorHiggins, Desmond G-
dc.date.accessioned2015-12-16T10:36:35Z-
dc.date.available2015-12-16T10:36:35Z-
dc.date.copyright2015 the Authorsen_US
dc.date.issued2015-08-25-
dc.identifier.citationBMC Bioinformaticsen_US
dc.identifier.urihttp://hdl.handle.net/10197/7310-
dc.description.abstractBackground: Multiple sequence alignments (MSA) are widely used in sequence analysis for a variety of tasks. Outlier sequences can make downstream analyses unreliable or make the alignments less accurate while they are being constructed. This paper describes a simple method for automatically detecting outliers and accompanying software called OD-seq. It is based on finding sequences whose average distance to the rest of the sequences in a dataset, is anomalous. Results: The software can take a MSA, distance matrix or set of unaligned sequences as input. Outlier sequences are found by examining the average distance of each sequence to the rest. Anomalous average distances are then found using the interquartile range of the distribution of average distances or by bootstrapping them. The complexity of any analysis of a distance matrix is normally at least O(N2 ) for N sequences. This is prohibitive for large N but is reduced here by using the mBed algorithm from Clustal Omega. This reduces the complexity to O(N log(N)) which makes even very large alignments easy to analyse on a single core. We tested the ability of OD-seq to detect outliers using artificial test cases of sequences from Pfam families, seeded with sequences from other Pfam families. Using a MSA as input, OD-seq is able to detect outliers with very high sensitivity and specificity. Conclusion: OD-seq is a practical and simple method to detect outliers in MSAs. It can also detect outliers in sets of unaligned sequences, but with reduced accuracy. For medium sized alignments, of a few thousand sequences, it can detect outliers in a few seconds.en_US
dc.description.sponsorshipScience Foundation Irelanden_US
dc.language.isoenen_US
dc.publisherBMC Informaticsen_US
dc.subjectOutlieren_US
dc.subjectMultiple sequence alignmenten_US
dc.titleOD-seq: outlier detection in multiple sequence alignmentsen_US
dc.typeJournal Articleen_US
dc.internal.authorcontactotherfabian.sievers@ucd.ieen_US
dc.statusPeer revieweden_US
dc.identifier.volume16en_US
dc.identifier.issue269en_US
dc.identifier.startpage1en_US
dc.identifier.endpage11en_US
dc.identifier.doi10.1186/s12859-015-0702-1-
dc.neeo.contributorJehl|Peter|aut|-
dc.neeo.contributorSievers|Fabian|aut|-
dc.neeo.contributorHiggins|Desmond G|aut|-
dc.internal.rmsid538835266-
dc.date.updated2015-11-17T15:32:39Z-
dc.rights.licensehttps://creativecommons.org/licenses/by-nc-nd/3.0/ie/en
item.fulltextWith Fulltext-
item.grantfulltextopen-
Appears in Collections:Conway Institute Research Collection
Medicine Research Collection
Files in This Item:
 File SizeFormat
Downloads12859-015-0702-1.pdf2.65 MBAdobe PDF
Show simple item record

SCOPUSTM   
Citations 20

13
Last Week
0
Last month
checked on Sep 5, 2020

Page view(s) 50

1,736
Last Week
4
Last month
16
checked on Dec 5, 2022

Download(s) 50

291
checked on Dec 5, 2022

Google ScholarTM

Check

Altmetric


If you are a publisher or author and have copyright concerns for any item, please email research.repository@ucd.ie and the item will be withdrawn immediately. The author or person responsible for depositing the article will be contacted within one business day.