Wikiomics:Percentage identity

How to compute the percentage identity between a pair of sequences?
The percentage identity for two sequences may take many different values. It is dependent on:
 * 1) The method  used to align the sequences. e.g. BLAST, FASTA, Smith-Waterman implemented in different programs, Global alignment (implemented in different programs), structural alignment from 3D comparison. etc. etc. etc.
 * 2) The parameters used by the alignment method. Local vs global alignment and all variations on this. Pair-score matrix used: e.g. BLOSUM62, PET91 etc. gap-penalty:  e.g. functional form and constants.
 * 3) Having got the alignment by some method above, there are many different ways of calculating percentage identity (PID).  For example divide the number of identities by:
 * 4) length of shortest sequence.
 * 5) length of alignment.
 * 6) mean length of sequence.
 * 7) number of non-gap positions.
 * 8) number of equivalenced positions excluding overhangs.
 * 9) PID is also strongly length dependent, so, the shorter a pair of sequences is, the higher the PID you might expect by chance.

Clearly, factors 1-3 can affect the final number reported as "percentage identity", so it is very important that anyone who quotes a percentage identity says how it is calculated. Unfortunately, this is rarely done.

A few years ago (1997), G.P.S Raghava and I looked systematically at the effect of calculating PID in different ways (some of the options shown in 3) for a large set of structurally aligned protein pairs. We found that the reported PID could differ by up to 11.5% depending on the method used to calculate it, and by up to 14.6% depending on the algorithm used to calculate the alignment. Combining these two effects gave a PID variation of up to 22%. We also looked at the difference in PID seen between structural alignments and sequence alignments of the same pair of sequences. PID for structural alignments is almost always lower than for sequence alignment since when doing sequence alignment one is optimising the alignment against a score (the BLOSUM matrix) that has a benefit in aligning identical residues.

In ASTRAL astral2004, the sequences will have been aligned pair-wise, PID calculated, then some form of clustering applied to group sequences together that share PID above some threshold. Representative sequences from each group are then provided as a set. This is a way of removing obvious redundancy from a large set of sequences, but redundancy at some level will always remain. Whether the redundancy filtering in ASTRAL is good enough for what you are doing, will depend on the use you plan for the set of sequences. We use the ASTRAL sets for some things and also those from Ronald Dunbrack, both are very useful resources.

Overall, the message about PID is that it is a very crude method for scoring sequence similarity. It is much better to use a method that takes account of the length and composition of the sequences as well as including scores for non-identical amino acids. I normally use Z-scores as calculated by my old AMPS package of programs for pair-wise clustering of sequences. In my experience on hundreds of protein families, this approach appears quite robust. If necessary, the Z-scores can be converted to probabilities by following the work of Webber and Barton webber2001, though for clustering this is not necessary.

Credits

 * Geoff Barton wrote most of the text as a message to the PDB mailing-list
 * the original formatting for the wiki was done by Martin Jambon