Wikiomics:Protein function prediction

From OpenWetWare
Jump to navigationJump to search

There are now plenty of proteins which have a totally unknown function. Automated function prediction is an active research field, with a growing community of bioinformaticians as observed at the AFP-SIG that took place at the ISMB 2005 conference, and at University of California San Diego in 2006.

Most often, only the sequence of the protein is known, but there are also hundreds of protein structures of unknown function which are provided by the structural genomics centers. Sometimes the proteins come from prokaryotes where the operons make it possible to infer the function of a protein from its genomic context, but this is more complicated in eukaryotes. And more generally, it is easier to guess right when a given protein has well-described homologs than when it belongs to a family of unknown biological role.

Of course, the notion of protein function is pretty broad and cannot easily be encoded without relying on a complex vocabulary. For that matter, the Gene Ontology aka GO provides hierarchical set of keywords called GO terms which describe different aspects of protein function with different levels of precision. GO is currently imposing itself as a standard for proteome annotation and function prediction of proteins.

Among the current software tools that exist today, several main strategies can be distinguished:

  • homology search and transfer of annotations:
    • sequence alignment
    • structure alignment
  • function inference by genomic context
  • phylogenomic approaches
  • prediction from structure using similarities that are not homology-based:
    • local sequence patterns
    • physico-chemical sequence features
    • 3D local sites
    • 3D physico-chemical features

Servers which competed at the AFP-SIG 2005

See also the short summaries by the authors themselves at the official site of AFP-SIG 2005.

These servers are based on transfer of function based on homology:

And the other servers are:

  • SpearMint and RuleBase (not public yet) [6]
  • PhydBac [7, 8, 9, 10, 11] analyzes bacterial proteins using genomic context.
  • ProKnow [12] searches for known 3D folds, sequences, motifs, and functional linkages

Basic tools

  • Wikiomics:BLAST and Wikiomics:PSI-BLAST [13, 14] are commonly used to search for homologous protein sequences by sequence alignment.
  • Prosite [15, 16] is a searchable database of sequence patterns that are associated with some biological functions.

Other protein function prediction servers

JAFA is a meta-server for function prediction of proteins: it produces a prediction based on an aggregate from other servers. You might want to start with JAFA since it queries 5 servers (GOFigure, GOblet, InterproScan, GOtcha, PhydBac) and shows you where their results agree and differ.

Miscellaneous servers:

  • Protein Function Prediction Server - Protein function predictions from PDB structures [17]. An enzyme/non-enzyme predictor, and an enzyme class predictor are available.
  • GoFigure [18] predicts the function of a gene or protein
  • ProFunc [19, 20] performs predictions from a protein structure

See also Wikiomics:Searching for 3D functional sites in a protein structure.

Methods using non-sequential sequence features:

These methods are based on function transfer after homology searches:

Phylogenomic approaches:

See also


  1. Martí-Renom MA, Ilyin VA, and Sali A. DBAli: a database of protein structure alignments. Bioinformatics. 2001 Aug;17(8):746-7. DOI:10.1093/bioinformatics/17.8.746 | PubMed ID:11524379 | HubMed [dbali]
  2. Hawkins T, Luban S, and Kihara D. Enhanced automated function prediction using distantly related sequences and contextual association by PFP. Protein Sci. 2006 Jun;15(6):1550-6. DOI:10.1110/ps.062153506 | PubMed ID:16672240 | HubMed [pfp]
  3. Szafron D, Lu P, Greiner R, Wishart DS, Poulin B, Eisner R, Lu Z, Anvik J, Macdonell C, Fyshe A, and Meeuwis D. Proteome Analyst: custom predictions with explanations in a web-based tool for high-throughput proteome annotations. Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W365-71. DOI:10.1093/nar/gkh485 | PubMed ID:15215412 | HubMed [pa]
  4. Lu P, Szafron D, Greiner R, Wishart DS, Fyshe A, Pearcy B, Poulin B, Eisner R, Ngo D, and Lamb N. PA-GOSUB: a searchable database of model organism protein sequences with their predicted Gene Ontology molecular function and subcellular localization. Nucleic Acids Res. 2005 Jan 1;33(Database issue):D147-53. DOI:10.1093/nar/gki120 | PubMed ID:15608166 | HubMed [pa-gosub]
  5. Vinayagam A, König R, Moormann J, Schubert F, Eils R, Glatting KH, and Suhai S. Applying Support Vector Machines for Gene Ontology based gene function prediction. BMC Bioinformatics. 2004 Aug 26;5:116. DOI:10.1186/1471-2105-5-116 | PubMed ID:15333146 | HubMed [gopet]
  6. Wieser D, Kretschmann E, and Apweiler R. Filtering erroneous protein annotation. Bioinformatics. 2004 Aug 4;20 Suppl 1:i342-7. DOI:10.1093/bioinformatics/bth938 | PubMed ID:15262818 | HubMed [wieser2004]
  7. Enault F, Suhre K, Abergel C, Poirot O, and Claverie JM. Annotation of bacterial genomes using improved phylogenomic profiles. Bioinformatics. 2003;19 Suppl 1:i105-7. DOI:10.1093/bioinformatics/btg1013 | PubMed ID:12855445 | HubMed [enault2003]
  8. Enault F, Suhre K, Poirot O, Abergel C, and Claverie JM. Phydbac (phylogenomic display of bacterial genes): An interactive resource for the annotation of bacterial genomes. Nucleic Acids Res. 2003 Jul 1;31(13):3720-2. DOI:10.1093/nar/gkg603 | PubMed ID:12824402 | HubMed [phydbac]
  9. Enault F, Suhre K, Poirot O, Abergel C, and Claverie JM. Phydbac2: improved inference of gene function using interactive phylogenomic profiling and chromosomal location analysis. Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W336-9. DOI:10.1093/nar/gkh365 | PubMed ID:15215406 | HubMed [phydbac2]
  10. Suhre K and Claverie JM. FusionDB: a database for in-depth analysis of prokaryotic gene fusion events. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D273-6. DOI:10.1093/nar/gkh053 | PubMed ID:14681411 | HubMed [fusiondb]
  11. Enault F, Suhre K, and Claverie JM. Phydbac "Gene Function Predictor": a gene annotation tool based on genomic context analysis. BMC Bioinformatics. 2005 Oct 12;6:247. DOI:10.1186/1471-2105-6-247 | PubMed ID:16221304 | HubMed [phydbac2005]
  12. Pal D and Eisenberg D. Inference of protein function from protein structure. Structure. 2005 Jan;13(1):121-30. DOI:10.1016/j.str.2004.10.015 | PubMed ID:15642267 | HubMed [proknow]
  13. Altschul SF, Gish W, Miller W, Myers EW, and Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403-10. DOI:10.1016/S0022-2836(05)80360-2 | PubMed ID:2231712 | HubMed [blast]
  14. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, and Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997 Sep 1;25(17):3389-402. DOI:10.1093/nar/25.17.3389 | PubMed ID:9254694 | HubMed [psiblast]
  15. Bucher P and Bairoch A. A generalized profile syntax for biomolecular sequence motifs and its function in automatic sequence interpretation. Proc Int Conf Intell Syst Mol Biol. 1994;2:53-61. PubMed ID:7584418 | HubMed [prosite_first]
  16. Hulo N, Sigrist CJ, Le Saux V, Langendijk-Genevaux PS, Bordoli L, Gattiker A, De Castro E, Bucher P, and Bairoch A. Recent improvements to the PROSITE database. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D134-7. DOI:10.1093/nar/gkh044 | PubMed ID:14681377 | HubMed [prosite_last]
  17. Dobson PD and Doig AJ. Distinguishing enzyme structures from non-enzymes without alignments. J Mol Biol. 2003 Jul 18;330(4):771-83. DOI:10.1016/s0022-2836(03)00628-4 | PubMed ID:12850146 | HubMed [dobson-doig]
  18. Khan S, Situ G, Decker K, and Schmidt CJ. GoFigure: automated Gene Ontology annotation. Bioinformatics. 2003 Dec 12;19(18):2484-5. DOI:10.1093/bioinformatics/btg338 | PubMed ID:14668239 | HubMed [gofigure]
  19. Laskowski RA, Watson JD, and Thornton JM. ProFunc: a server for predicting protein function from 3D structure. Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W89-93. DOI:10.1093/nar/gki414 | PubMed ID:15980588 | HubMed [profunc_a]
  20. Laskowski RA, Watson JD, and Thornton JM. Protein function prediction using local 3D templates. J Mol Biol. 2005 Aug 19;351(3):614-26. DOI:10.1016/j.jmb.2005.05.067 | PubMed ID:16019027 | HubMed [profunc_b]
  21. Jensen LJ, Gupta R, Blom N, Devos D, Tamames J, Kesmir C, Nielsen H, Staerfeldt HH, Rapacki K, Workman C, Andersen CA, Knudsen S, Krogh A, Valencia A, and Brunak S. Prediction of human protein function from post-translational modifications and localization features. J Mol Biol. 2002 Jun 21;319(5):1257-65. DOI:10.1016/S0022-2836(02)00379-0 | PubMed ID:12079362 | HubMed [protfun2002]
  22. Jensen LJ, Gupta R, Staerfeldt HH, and Brunak S. Prediction of human protein function according to Gene Ontology categories. Bioinformatics. 2003 Mar 22;19(5):635-42. DOI:10.1093/bioinformatics/btg036 | PubMed ID:12651722 | HubMed [protfun2003]
  23. Hobohm U and Sander C. A sequence property approach to searching protein databases. J Mol Biol. 1995 Aug 18;251(3):390-9. DOI:10.1006/jmbi.1995.0442 | PubMed ID:7650738 | HubMed [propsearch]
  24. Conesa A, Götz S, García-Gómez JM, Terol J, Talón M, and Robles M. Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics. 2005 Sep 15;21(18):3674-6. DOI:10.1093/bioinformatics/bti610 | PubMed ID:16081474 | HubMed [blast2go]
  25. Zehetner G. OntoBlast function: From sequence similarities directly to potential functional annotations by ontology terms. Nucleic Acids Res. 2003 Jul 1;31(13):3799-803. DOI:10.1093/nar/gkg555 | PubMed ID:12824422 | HubMed [ontoblast]
  26. Hennig S, Groth D, and Lehrach H. Automated Gene Ontology annotation for anonymous sequence data. Nucleic Acids Res. 2003 Jul 1;31(13):3712-5. DOI:10.1093/nar/gkg582 | PubMed ID:12824400 | HubMed [goblet2003]
  27. Groth D, Lehrach H, and Hennig S. GOblet: a platform for Gene Ontology annotation of anonymous sequence data. Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W313-7. DOI:10.1093/nar/gkh406 | PubMed ID:15215401 | HubMed [goblet2004]
  28. Martin DM, Berriman M, and Barton GJ. GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes. BMC Bioinformatics. 2004 Nov 18;5:178. DOI:10.1186/1471-2105-5-178 | PubMed ID:15550167 | HubMed [gotcha]
  29. Pazos F and Sternberg MJ. Automated prediction of protein function and detection of functional sites from structure. Proc Natl Acad Sci U S A. 2004 Oct 12;101(41):14754-9. DOI:10.1073/pnas.0404569101 | PubMed ID:15456910 | HubMed [phunctioner]
  30. Storm CE and Sonnhammer EL. Automated ortholog inference from phylogenetic trees and calculation of orthology reliability. Bioinformatics. 2002 Jan;18(1):92-9. DOI:10.1093/bioinformatics/18.1.92 | PubMed ID:11836216 | HubMed [orthostrapper]
  31. Zmasek CM and Eddy SR. RIO: analyzing proteomes by automated phylogenomics using resampled inference of orthologs. BMC Bioinformatics. 2002 May 16;3:14. DOI:10.1186/1471-2105-3-14 | PubMed ID:12028595 | HubMed [rio]
  32. Engelhardt BE, Jordan MI, Muratore KE, and Brenner SE. Protein molecular function prediction by Bayesian phylogenomics. PLoS Comput Biol. 2005 Oct;1(5):e45. DOI:10.1371/journal.pcbi.0010045 | PubMed ID:16217548 | HubMed [sifter]
  33. Gouret P, Vitiello V, Balandraud N, Gilles A, Pontarotti P, and Danchin EG. FIGENIX: intelligent automation of genomic annotation: expertise integration in a new software platform. BMC Bioinformatics. 2005 Aug 5;6:198. DOI:10.1186/1471-2105-6-198 | PubMed ID:16083500 | HubMed [figenix]
  34. Friedberg I. Automated protein function prediction--the genomic challenge. Brief Bioinform. 2006 Sep;7(3):225-42. DOI:10.1093/bib/bbl004 | PubMed ID:16772267 | HubMed [afpreview]

All Medline abstracts: PubMed | HubMed



  • Martin Jambon: introduction plus the initial list of tools and papers, put together after the AFP-SIG 2005 conference (at ISMB 2005)
  • other Wikiomics authors