BioSysBio:abstracts/2007/Manuel Corpas

From OpenWetWare
Jump to: navigation, search
  • Add or delete the sections that you require.

PFF – An integrated database of residues and fragments critical for protein folding

Author(s): Corpas M., Sinnott J., Thorne D., Pettifer S., and Attwood T., on behalf of the PFF consortium
Affiliations: Faculty of Life Sciences and Computer Science, University of Manchester
Contact: email: corpas at bioinf point man point ac point uk
Keywords: 'Protein folding' 'Protein Data Bank' 'Folding Nucleus' 'Residue Stability'


Despite decades of work, understanding how proteins fold remains a major research challenge. The fruits of this massive research effort have been: development of (i) methods for predicting the likely structures that protein sequences will adopt, or for simulating the folding process itself; and (ii) databases of structural information (e.g., containing 3D coordinates, fold classifications, structure summary data, and so on). As part of the ongoing endeavour to understand the principles of protein folding, we have been involved in the development of a new, integrated structure information resource, based on a small subset of the PDB (1). The resource contains information derived from a combination of sequence analysis tools, structure analysis software and fold simulation algorithms; to make the contents more accessible to the wider community, we have also developed a user-friendly front-end for visualising the integrated data. The motivation for combining data from these various approaches is to offer insights into the role of particular types of residues and fragments in protein folding, and hence to improve our understanding of factors that are critical to the folding process in general.


To inititiate the study, we created a test-set of 116 representative folds from the PDB. For each fold, information such as the locations of tightened-end fragments (2), foldons (3), most interacting residues (4), topohydrophobic residues (5), fingerprints (6), and stability data derived from PoPMusic (7) and Fold-X (8) was calculated. Results were stored in an integrated resource, which was later augmented by tools from the UTOPIA project – these were adapted to interactively visualise the PFF annotations on their respective 3D structures. The visual toolkit includes features for searching and browsing the dataset, and for displaying the relationships between annotated 3D structures and multiple sequence alignments.


Using results from the methods mentioned above, an integrated database of structural information was created (PFF). In addition, for each entry, both the sequence from Swiss-Prot (9) and its corresponding nucleotide sequence are included; secondary structure assignment derived from DSSP, and atomic and internal coordinates (including pseudo dihedral and valence angles) are also provided.

From an initial analysis of the data, we found, not surprisingly, that certain results were strongly correlated: e.g., residue accessibility values (denoting the degree of internal constraint on flexibility), Fold-X scores (denoting the stabilising contributions to the fold), Popmusic values (denoting destabilising contributions), and lattice simulations (denoting the number of close neighbours or interaction partners within the fold). We used these values to synthesise a ‘folding score’ for each residue of each of the respresentative folds on the data-set.

Coupled with the degree of conservation of residues, determined from their parent sequence alignments, we used the folding score to delineate regions that are likely to contribute to (i) the stability of the fold (and hence may contribute to the folding nucleus), and (ii) the function of the protein. We present here a simple case-study to illustrate how the combined data can be used to pin-point such motifs with potential structural and functional roles – see Figure 1.

Fig.1 Chloramphenicol Acetyltransferase Type III (PDB code: 3cla). The folding score is shown in purple. Folding score troughs (highlighted by blue bars) pinpoint regions that are likely to be important in stabilising the fold, correlating well to regions rich in topohydrophobic residues ('T'). Conservation scores (red line) are calculated from the parent alignment using the Scorecons server (9). Folding score peaks (highlighted by green bars) corresponding to regions of high conservation indicate potential functional regions. Manually selected motifs are denoted by black bars; other highly conserved regions are denoted by grey bars. The conservation scores are normalised values, so their values are percentages ranging from 0 to 100. The horizontal axis denotes the sequence.


A goal of the PFF consortium was to create a consensus "prediction" tool combining the strengths of different sequence and structure analysis methods. We found that integration of different methods indeed added value over individual approaches. Generating a combined ‘folding score’ able to to pinpoint likely folding regions (troughs) and likely functional regions (peaks) offers a means of automatic motif detection; this can be used for protein family characterisation and functional/structural annotation of evolutionarily conserved regions.


Version 1.0 of the PFF dataset is accessible in a DSSP-flat-file format from; it is also available in an XML format through the UTOPIA toolkit, together with the UTOPIA visualisation tools for OS X, Windows and Linux at The Web resource for calculating combined folding scores is accessible at For a more detailed explanation on the meaning and biological implications of the folding score please refer to


1. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N. and Bourne, P.E. (2000) Nucleic Acids Res, 28, 235-242.

2. Lamarine, M., Mornon, J.P., Berezovsky, N. and Chomilier, J. (2001) Cell Mol Life Sci, 58, 492-498.

3. Maity, H., Maity, M., Krishna, M.M., Mayne, L. and Englander, S.W. (2005) Proc Natl Acad Sci U S A, 102, 4741-4746.

4. Papandreou, N., Berezovsky, I.N., Lopes, A., Eliopoulos, E. and Chomilier, J. (2004) Eur J Biochem, 271, 4762-4768.

5. Poupon, A. and Mornon, J.P. (1998) Proteins, 33, 329-342.

6. Attwood, T.K., Bradley, P., Flower, D.R., Gaulton, A., Maudling, N., Mitchell, A.L., Moulton, G., Nordle, A., Paine, K., Taylor, P. et al. (2003) Nucleic Acids Res, 31, 400-402.

7. Gilis, D. and Rooman, M. (2000) Protein Eng, 13, 849-856.

8. Schymkowitz, J.W., Rousseau, F., Martins, I.C., Ferkinghoff-Borg, J., Stricher, F. and Serrano, L. (2005) Proc Natl Acad Sci U S A, 102, 10147-10152.

9. Bairoch, A., Apweiler, R., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M. et al. (2005) Nucleic Acids Res, 33 Database Issue, D154-159.

10. Valdar, W.S., (2002) Proteins, 48, 227-241.