Large Scale Identification and Categorization of Protein Sequences Using Structured Logistic Regression

Files in This Item:
File Description SizeFormat 
insight_publication.pdf182.96 kBAdobe PDFDownload
Title: Large Scale Identification and Categorization of Protein Sequences Using Structured Logistic Regression
Authors: Pederson, Bjørn P.
Ifrim, Georgiana
Liboriussen, Poul
et al.
Permanent link: http://hdl.handle.net/10197/8371
Date: 20-Jan-2014
Abstract: Background: Structured Logistic Regression (SLR) is a newly developed machine learning tool first proposed in the context of text categorization. Current availability of extensive protein sequence databases calls for an automated method to reliably classify sequences and SLR seems well-suited for this task. The classification of P-type ATPases, a large family of ATP-driven membrane pumps transporting essential cations, was selected as a test-case that would generate important biological information as well as provide a proof-of-concept for the application of SLR to a large scale bioinformatics problem. Results: Using SLR, we have built classifiers to identify and automatically categorize P-type ATPases into one of 11 pre-defined classes. The SLR-classifiers are compared to a Hidden Markov Model approach and shown to be highly accurate and scalable. Representing the bulk of currently known sequences, we analysed 9.3 million sequences in the UniProtKB and attempted to classify a large number of P-type ATPases. To examine the distribution of pumps on organisms, we also applied SLR to 1,123 complete genomes from the Entrez genome database. Finally, we analysed the predicted membrane topology of the identified P-type ATPases. Conclusions: Using the SLR-based classification tool we are able to run a large scale study of P-type ATPases. This study provides proof-of-concept for the application of SLR to a bioinformatics problem and the analysis of P-type ATPases pinpoints new and interesting targets for further biochemical characterization and structural analysis.
Funding Details: Science Foundation Ireland
Type of material: Journal Article
Publisher: Public Library of Science
Copyright (published version): 2014 the Authors
Keywords: Semantic webGenome analysisSequence analysis
DOI: 10.1371/journal.pone.0085139
Language: en
Status of Item: Peer reviewed
Appears in Collections:Insight Research Collection

Show full item record

SCOPUSTM   
Citations 50

7
Last Week
0
Last month
checked on Aug 17, 2018

Download(s) 50

16
checked on May 25, 2018

Google ScholarTM

Check

Altmetric


This item is available under the Attribution-NonCommercial-NoDerivs 3.0 Ireland. No item may be reproduced for commercial purposes. For other possible restrictions on use please refer to the publisher's URL where this is made available, or to notes contained in the item itself. Other terms may apply.