User:Jarle Pahr/Bioinformatics

From OpenWetWare
Jump to navigationJump to search

Links and notes regarding bioinformatics.

The elements of bioinformatics:

Bioinformatics for newbies


Tips and advice




Education and training

CUBELP - Cranfield University Bioinformatics Electronic Learning Platform :

4273π: Bioinformatics education on low cost ARM hardware:

An online bioinformatics curriculum:

Best practices for bioinformatics training:

Titus Brown's list of bioinformatics courses:

UC Davis Bioinformatics Training Program:

UC Riverside Bioinformatics Manuals:

Bioconductor Course Materials:

Bioinformatic Training links by Stephen Turner:

Conference on Bioinformatics education:

Bioplanet Bioinformatics FAQ:


Open Bioinformatics foundation:

BTI Plant bioinformatics course:

Bioinformatics tools:

DNA stability and secondary structure prediction:


DINAmelt - quickfold:


Expressed Sequence Tags (EST):

Serial Analysis of Gene Expression (SAGE):

Expression quantitive trait loci (eQTLs)


Generic Model Organism Database project.


Sequence otology:

Gene ontology:

Sequence alignment

Scoring matrices

See also

PAM (MDM) substitution matrices:

Point Accepted Mutations (PAM) matrix / Mutation Data Matrix (MDM) matrices were developed by Margaret Dayhoff et al. from analysis of multiple alignments within protein families.

A mutation probability matrix, M, is defined where each element M(a,b) gives the probability that a residue of type b will have been replaced by one of type a after a given amount of evolutionary time [Zvelebil & Baum].

The unit PAM (Point Accepted Mutations) measures the number of retained mutations in a sequence.

The PAM matrix number thus indicates evolutionary distance. PAM250 indicates 250 Point Accepted Mutations per 100 residues (an average of more than one mutation per residue, indicating that many bases have changed more than once). 250 PAM is at the limit of detection of evolutionary relationships [Zvelebil & Baum]. A PAM250 matrix is obtained by raising the PAM-1 matrix to the 250th power [Zvelebil & Baum]. This is based on a model of evolution as a Markov process [Zvelebil & Baum].


Block Substitution Matrices (BLOSUM):

Developed in the 1990s using local multiple alignments. A set of aligned highly conserved short regions are generated, and clustered into groups according to similarity. Sequences are grouped together if they exceed a specified percentage similarity treshold. Substitution frequencies for all possible pairs of amino acids are then calculated between the clustered groups [Zvelebil & Baum].

The BLOSUM matrices are based on data from the BLOCK database published in 1991. The BLOCKS database contains ungapped multiple local alignments of protein conserved regions. Various BLOSUM matrices can be generated by varying the percentage-cutoff for similarity group clustering. [Zvelebil & Baum] The BLOSUM-62 matrix was generated by using a treshold of 62 % identity. For the sequences used to produce the original Dayhoff PAM matrices, the treshold giving a single cluster is 85 %, indicating that those sequences were more similiar.


Article: S Henikoff and J G Henikoff, 1992. Amino acid substitution matrices from protein blocks:

Gap scoring

Linear gap penalty:

The simplest method for scoring gaps is to assign a penalty g for every residue aligned to a gap.

g = - E (E a positive number)

g(n_gap) = -n_gap * E

To better account for the observed pattern of fewer, longer gaps, a combination of a high gap opening penalty and a lower gap extension penalty can be used:

Gap opening penalty (GOP): The gap opening penalty, designated I, is the score penalty (amount score reduction) which is associated with introducing a gap in the alignment.

Gap extension penalty (GEP): The GEP, designated E, is the score penalty for each base aligned to a gap after the initial base. (That is, a GEP is not assigned for a single-residue gap).

Using the combination of a gap opening penalty and gap extension penalty gives the affine gap penalty formula:

g(n_gap) = -I -/(n_gap - 1)E

Typical values for I and E in protein alignment applications are 7-15 and 0.5-2, respectively [Zvelebil & Blaum].

Log odds ratios

A log odds value is the logarithm of an odds ratio.

See also

See also

Gene prediction



Transcription factor binding prediction

Promoter prediction


Protein structure prediction

PHYRE2 protein fold recognition server:

Fold recognition (threading): Protein fold recognition (threading) is a method for modelling proteins which have one or more folds in common with proteins with known structure. Threading is distinct from homology modelling. There is not a clear boundary, as both threading and homology modelling are template-based methods. Homology modelling can be used when the structure of a protein homologous to the modelling target is known, while threading is used if only protein structures with fold-level similarity are known.

Structure predictions are made by "threading" (aligning) each amino acid in the target sequence to one of several templates and evaluating the fit of each template. The structure model is then based on the alignment with the best-fitting template.

See also

PHYRE/PHYRE2 - threading server:

Hidden Markov Models

A Hidden Markov Model (HMM) is a probabilistic method than can be used to analyze biological sequences and other sequential data [Zvelebil & Baum].

Profile HMMs: A profile HMM represents the common features of a set of sequences and is used to perform alignments of further sequences to that set [Zvelebil & Baum}.

Null model:

In the context of profile HMMs, the null model mrepresents sequences which are not related to the profile sequences [Zvelebil & Baum]. The choice of null model will affect the results gained by using a profile HMM.



BASYS bacterial annotation system:


Critical Assessment of Genome Interpretation (CAGI):

Critical assessment of methods of protein structure prediction (CASP):


FASTA package:




Cufflinks, RNAseq analysis tool:




Bioinformatics software on SourceForge:

Courses and conferences

Bioinformatics Open Source Conference (BOSC):



Sequence databases:

NCBI Sequence Read Archive (SRA): Stores raw sequencing data from high-throughput sequencing.

Structural databases:

File formats

NCBI file format guide:




Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform.:

Blogs & Websites



See also

Discussion sites:

Bioinformatics in Norway


Briefings in Bioinformatics:


See also

Journal of Computational Biology: