# User:Jarle Pahr/Bioinformatics

Links and notes regarding bioinformatics.

**Learning resources:'**

Open Bioinformatics foundation: http://www.open-bio.org/wiki/News

BTI Plant bioinformatics course: http://btiplantbioinfocourse.wordpress.com/2012-course/core-program/

http://www.molecularevolution.org/resources

http://www.bio.brandeis.edu/InterpGenes/Project/menu.htm

Bioinformatics tools:

http://ipython.org/notebook.html

DNA stability and secondary structure prediction:

RNAfold: http://rna.tbi.univie.ac.at/cgi-bin/RNAfold.cgi

DINAmelt - quickfold: http://mfold.rna.albany.edu/?q=DINAMelt/Quickfold

Mfold: http://mfold.rna.albany.edu/?q=mfold

Expressed Sequence Tags (EST):

Serial Analysis of Gene Expression (SAGE):

Expression quantitive trait loci (eQTLs)

## Contents

# Sequence alignment

## Scoring matrices

See also http://en.wikipedia.org/wiki/Substitution_matrix

PAM (MDM) substitution matrices:

Point Accepted Mutations (PAM) matrix / Mutation Data Matrix (MDM) matrices were developed by Margaret Dayhoff et al. from analysis of multiple alignments within protein families.

A mutation probability matrix, M, is defined where each element M(a,b) gives the probability that a residue of type b will have been replaced by one of type a after a given amount of evolutionary time [Zvelebil & Baum].

The unit PAM (Point Accepted Mutations) measures the number of retained mutations in a sequence.

The PAM matrix number thus indicates evolutionary distance. PAM250 indicates 250 Point Accepted Mutations per 100 residues (an average of more than one mutation per residue, indicating that many bases have changed more than once). 250 PAM is at the limit of detection of evolutionary relationships [Zvelebil & Baum]. A PAM250 matrix is obtained by raising the PAM-1 matrix to the 250th power [Zvelebil & Baum]. This is based on a model of evolution as a Markov process [Zvelebil & Baum].

PAM250: http://rosalind.info/glossary/pam250/

Block Substitution Matrices (BLOSUM):

Developed in the 1990s using local multiple alignments. A set of aligned highly conserved short regions are generated, and clustered into groups according to similarity. Sequences are grouped together if they exceed a specified percentage similarity treshold. Substitution frequencies for all possible pairs of amino acids are then calculated between the clustered groups [Zvelebil & Baum].

The BLOSUM matrices are based on data from the BLOCK database published in 1991. The BLOCKS database contains ungapped multiple local alignments of protein conserved regions. Various BLOSUM matrices can be generated by varying the percentage-cutoff for similarity group clustering. [Zvelebil & Baum] The BLOSUM-62 matrix was generated by using a treshold of 62 % identity. For the sequences used to produce the original Dayhoff PAM matrices, the treshold giving a single cluster is 85 %, indicating that those sequences were more similiar.

BLOSUM62 http://rosalind.info/glossary/blosum62/

http://en.wikipedia.org/wiki/BLOSUM

Article: S Henikoff and J G Henikoff, 1992. Amino acid substitution matrices from protein blocks: http://www.pnas.org/content/89/22/10915.abstract

## Gap scoring

Linear gap penalty:

The simplest method for scoring gaps is to assign a penalty g for every residue aligned to a gap.

g = - E (E a positive number)

g(n_gap) = -n_gap * E

To better account for the observed pattern of fewer, longer gaps, a combination of a high gap opening penalty and a lower gap extension penalty can be used:

Gap opening penalty (GOP): The gap opening penalty, designated I, is the score penalty (amount score reduction) which is associated with introducing a gap in the alignment.

Gap extension penalty (GEP): The GEP, designated E, is the score penalty for each base aligned to a gap after the initial base. (That is, a GEP is not assigned for a single-residue gap).

Using the combination of a gap opening penalty and gap extension penalty gives the affine gap penalty formula:

g(n_gap) = -I -/(n_gap - 1)E

Typical values for I and E in protein alignment applications are 7-15 and 0.5-2, respectively [Zvelebil & Blaum].

## Log odds ratios

A log odds value is the logarithm of an odds ratio.

See also http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2904766/

See also http://www.biostars.org/p/14855/

http://www.bio.brandeis.edu/InterpGenes/Project/align16.htm

# Gene prediction

GLIMMER: http://www.cbcb.umd.edu/software/glimmer/

GeneMark: http://exon.gatech.edu/

# Promoter prediction

Bprom: http://linux1.softberry.com/berry.phtml?topic=bprom&group=programs&subgroup=gfindb

# Protein structure prediction

PHYRE2 protein fold recognition server: http://www.sbg.bio.ic.ac.uk/phyre2/html/page.cgi?id=index

Fold recognition (threading): Protein fold recognition (threading) is a method for modelling proteins which have one or more folds in common with proteins with known structure. Threading is distinct from homology modelling. There is not a clear boundary, as both threading and homology modelling are template-based methods. Homology modelling can be used when the structure of a protein homologous to the modelling target is known, while threading is used if only protein structures with fold-level similarity are known.

Structure predictions are made by "threading" (aligning) each amino acid in the target sequence to one of several templates and evaluating the fit of each template. The structure model is then based on the alignment with the best-fitting template.

See also http://en.wikipedia.org/wiki/Threading_%28protein_sequence%29

PHYRE/PHYRE2 - threading server: http://www.sbg.bio.ic.ac.uk/phyre2/html/page.cgi?id=index

# Hidden Markov Models

http://en.wikipedia.org/wiki/Hidden_Markov_model

A Hidden Markov Model (HMM) is a probabilistic method than can be used to analyze biological sequences and other sequential data [Zvelebil & Baum].

Profile HMMs: A profile HMM represents the common features of a set of sequences and is used to perform alignments of further sequences to that set [Zvelebil & Baum}.

Null model:

In the context of profile HMMs, the null model mrepresents sequences which are not related to the profile sequences [Zvelebil & Baum]. The choice of null model will affect the results gained by using a profile HMM.

# Annotation

BASYS bacterial annotation system: http://basys.ca/

# Challenges

Critical Assessment of Genome Interpretation (CAGI): https://genomeinterpretation.org/

Critical assessment of methods of protein structure prediction (CASP):

http://predictioncenter.org/ http://www.ncbi.nlm.nih.gov/pubmed/14579322

# Software

FASTA package: http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml

http://www.petercollingridge.co.uk/python-bioinformatics-tools/

BWA:

Bowtie:

Tophat:

Cufflinks, RNAseq analysis tool: http://cufflinks.cbcb.umd.edu/manual.html

# Databases

Sequence databases:

NCBI Sequence Read Archive (SRA): http://www.ncbi.nlm.nih.gov/sra
Stores raw sequencing data from high-throughput sequencing.

Structural databases:

http://www.rcsb.org/pdb/home/home.do

http://www.brenda-enzymes.info/

# File formats

NCBI file format guide: http://www.ncbi.nlm.nih.gov/books/NBK47537/

SRA