# Difference between revisions of "User:Marshall Hampton/ScoringMatrices"

Line 9: | Line 9: | ||

where <math>p_i</math> and <math>q_i</math> are the background frequencies of the amino acids from two sets of proteins. In the vast majority of current treatments, the evolution of amino acid frequencies is considered symmetric in time and it is assumed that <math>p_i = q_i</math> for each amino acid i. In this article we will not make that assumption in order to develop scoring matrices for organisms in which the amino acid frequencies seem biased in time. | where <math>p_i</math> and <math>q_i</math> are the background frequencies of the amino acids from two sets of proteins. In the vast majority of current treatments, the evolution of amino acid frequencies is considered symmetric in time and it is assumed that <math>p_i = q_i</math> for each amino acid i. In this article we will not make that assumption in order to develop scoring matrices for organisms in which the amino acid frequencies seem biased in time. | ||

− | As an example we will construct an asymmetric scoring matrix for the organisms ''Plasmodium falciparum'' and ''Saccharomyces cerevisiae''. | + | As an example we will construct an asymmetric scoring matrix for the organisms ''Plasmodium falciparum'' and ''Saccharomyces cerevisiae''. The former organism, which is responsible for the most dangerous forms of malaria, has an extremely unusual genome. |

A good reference for the mathematics and statistics involved here is the article "Amino Acid Substitution Matrices from an Information Theoretic Perspective", J. Mol. Biol. 219, 555-565, 1991. | A good reference for the mathematics and statistics involved here is the article "Amino Acid Substitution Matrices from an Information Theoretic Perspective", J. Mol. Biol. 219, 555-565, 1991. |

## Revision as of 10:47, 9 September 2009

This will eventually be an article about constructing custom amino-acid scoring matrices using biopython. At the moment it is far from done.

## Introduction

A log-odds scoring matrix is constructed from some empirically found frequencies of single letter alignments **Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle f_{ij}}**
from the formula

**Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle s_{ij} = \lambda log(\frac{f_{ij}}{p_i q_j})}**

where **Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle p_i}**
and **Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle q_i}**
are the background frequencies of the amino acids from two sets of proteins. In the vast majority of current treatments, the evolution of amino acid frequencies is considered symmetric in time and it is assumed that **Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle p_i = q_i}**
for each amino acid i. In this article we will not make that assumption in order to develop scoring matrices for organisms in which the amino acid frequencies seem biased in time.

As an example we will construct an asymmetric scoring matrix for the organisms *Plasmodium falciparum* and *Saccharomyces cerevisiae*. The former organism, which is responsible for the most dangerous forms of malaria, has an extremely unusual genome.

A good reference for the mathematics and statistics involved here is the article "Amino Acid Substitution Matrices from an Information Theoretic Perspective", J. Mol. Biol. 219, 555-565, 1991.