BioMicroCenter:IlluminaDataPipeline: Difference between revisions

Revision as of 13:10, 3 February 2009

Basics

We are currently using the Illumina Pipeline version 1.0. We will be upgrading to version 1.3 as soon as it is released (expected early 2009).

The full pipeline manual is available HERE

The Genome Analyzer Pipeline Software (Pipeline) is a set of utilities designed to perform a complete offline data analysis of a sequencing run. It is supplied as source code and scripts. Data analysis consists of three steps: image analysis, base calling, and sequence analysis.

1. Image analysis (Firecrest)—Uses the raw TIF files to locate clusters on the image, and outputs the cluster intensity, X,Y positions, and an estimate of the noise for each cluster. The output from image analysis provides the input for base calling.

2. Base calling (Bustard)—Uses cluster intensities and noise estimate to output the sequence of bases read from each cluster, along with a confidence level for each base.

3. Sequence analysis (Gerald and Eland)—Allows for alignment to a reference sequence, filtering of data based on predefined criteria, and visualization of the result.

Base Calling

Firecrest is the module used for image analysis. Firecrest identifies cluster positions and extracts intensities. Through image filtering, it sharpens and enhances clusters, removes background noise, and detects clusters based on morphological features on the image. Firecrest also adjusts the scale and registration of an image. Firecrest is currently performed in real time with the sequencing process on a dedicated IPARR server as part of the 1.0 pipeline.

Bustard is the module used for base calling. Bustard deconvolves the signal from the clusters and applies correction for cross-talk, phasing, and prephasing.

Frequency cross-talk—The Genome Analyzer uses two lasers and four filters to detect four dyes attached to the four types of nucleotide, respectively. The frequency emission of these four dyes overlaps so that the four images are not independent. The frequency cross-talk is deconvolved using a frequency cross-talk matrix.

Phasing/Prephasing—Depending on the efficiency of the fluidics and the sequencing reactions, a small number of molecules in each cluster may run ahead (prephasing) or fall behind (phasing) of the current incorporation cycle. This effect is mitigated by applying corrections during the base calling step.

All of these corrections are based on an assumption of equal base frequency. For this reason, Illumina recommends the inclusion of the PhiX control sample in all runs (7+1). Work is currently under way at the [Broad Institute] to create defined spiked in DNA templates with equal base frequency that can act as control reads, however, this has not yet been incorporated into the pipeline.

Alignment

Generation of Recursive Analyses Linked by Dependency (GERALD) is the module used for sequence alignment, data visualization, filtering, and alignment. The following two alignment programs work within the GERALD module:

Efficient Large-Scale Alignment of Nucleotide Databases (ELAND) is very fast and aligns for up to two errors from a reference for the first 32 bases. This algorithm is used for any reference larger than 100 kb.

PhageAlign does an exhaustive alignment (all possible alignments up to arbitrary edit distances), but is slow.

Software Options

While ELAND is extremely fast, it suffers from some significant deficiencies. The largest is the lack of tolerance of errors. Read failure is typically a function of length and it is likely that many nucleotides will have been successfully read before the phasing/pre-phasing or some other error becomes large enough to cause read failure. However, because ELAND is an all or nothing algorithm, it is incapable of handling 'short' reads. Numerous researches have made significant efforts to create improved versions of ELAND. Some of the options:

MAQ http://maq.sourceforge.net/ - Maq maps short reads to a reference genome and calls the genotypes from the alignment. It is specificially designed for Illumina-Solexa/AB-SOLiD reads, not for 454 or capillary ones. Key facts about Maq: 1) Maq maps a repeat read randomly, and 2) it gives a probability score (mapping quality) to each alignment. More information is available on the MAQ page.

Iterative ELAND - Created as a collaboration between Sumeet Gupta in WICMT and Stuart Levine, this algorithm iteratively checks shorter reads to attempt to recover non-matching reads. This is a brute force technique but can increase mapped read counts by 50% or more. Repeat reads are annotated and NOT placed in the genome (as with normal ELAND).

Bowtie - The Burge Lab is currently analyzing the quality of reads mapped using Bowtie. Bowtie is believed to be MUCH faster then ELAND.

Output Files

_sequence.txt
Fastq format sequence information Example File:

@HWI-EAS413_4:1:5:1004:56            <= Sequence name - lane:read(1|2):tile:X:Y
GTTTCATTTCTAAACCTGTTTCATTACAAAATGCC  <= Sequence
+HWI-EAS413_4:1:5:1004:56            <= repeat of name
ZVZZVZZZZZVZSZZLZFZZZVVZLSZSZZQVLVT  <= ASCII quality score. A = low, Z(or other symbol) = high.
@HWI-EAS413_4:1:5:1350:118
GACCTTTTGCTTGTTTTGAGAGTGAGGGAAAAGGA
+HWI-EAS413_4:1:5:1350:118
ZZZZZZZZZZZZZZZZZYZZZUZZZYZUZUVVVSV

_eland_results.txt
Alignment information for reads.

Sequence ID                     Sequence Read                           SCORE   #0mm    #1mm    #2mm   CHR HIT    POS HIT         STRND     unk.   MISMATCH

>HWI-EAS413_4:1:100:825:1989    CTAGAAGCAGAAGCAGGTATTTGGGGGGAGGGTTG     R0      3       0       0
>HWI-EAS413_4:1:100:1076:1671   AACTGCTTTGAGATAGGGTCTCTCTTGTTCACTTT     NM      0       0       0
>HWI-EAS413_4:1:100:573:1957    TCGAGACGTAAACTAGCTAACCTACATTATCCCCT     NM      0       0       0
>HWI-EAS413_4:1:100:1784:660    AATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA     R0      204     255     255
>HWI-EAS413_4:1:100:133:987     CGCGATGATGTCTCAATACACCCCCCCGCTACCAG     NM      0       0       0
>HWI-EAS413_4:1:100:1361:1636   CATGTCATGCGCTCTAATCTCTGGGCATCTTGAGA     NM      0       0       0
>HWI-EAS413_4:1:100:1733:932    CCGAACTTCTGACAGGTTTGAGCCTTCTGCTCAAG     U1      0       1       0       chr9.fa     110761807       F       DD      13A
>HWI-EAS413_4:1:100:992:1902    CAATTAAATAATAATAAACTAACACACAATACAAA     NM      0       0       0
>HWI-EAS413_4:1:100:1230:1718   TCAGCAAACAAACCCCCAACATAAAATCCATTATG     NM      0       0       0
>HWI-EAS413_4:1:100:324:130     TCATCGAGAGGGGACTGAAGTGGAAGCTAGTCAGC     U0      1       0       0       chr14.fa    33191761        F       DD

Sequence ID: From the sequence.txt file
Sequence Read: self explanatory
Score: 8 possible scores:

U0,U1,U2 - unique hit in the genome with 0, 1, or 2 mismatches
R0,R1,R2 - multiple hit in the genome with 0, 1, or 2 mismatches
NM - No match
QC - failed quality score

0,1,2mm: Number of times the sequence was found in the genome with 0, 1 or 2 mismatches. max=255
Chr/Pos Hit: Location of the read in the genome (low number based on 36nt). Only reported for U0, U1, U2.
Strand: Strand hit (F or R) Note: a hit a pos 1000 with a R actually has its 5' end at 1035!
Mismatch: Position of the mismatch (eg 13A = pos13 should be an A). listed as in sequence.

_realign.txt
Final quality-filtered sequence alignments

sequence read                       score #hit  Chr:Pos      Strd  best match                         next best score
CCACAAAGTTCAGATGAGGGTGGGTGTTGCTTGTT 17500 1     chrX:116488401  F CCACAAAGTTCAGATGAGGGTGGGTGTTGCTTGTT 14359
AGACTGGCCTGCCTCTGCCTCCTGAGTACTGAGAC 17500 1     chr8:73831045   R GTCTCAGTACTCAGGAGGCAGAGGCAGGCCAGTCT 16453
CTTGGGCTACCAGGCCTGGTTAGCCCACTCCCGCT 17500 1     chrX:95524949   R AGCGGGAGTGGGCTAACCAGGCCTGGTAGCCCAAG 14359
CAGAAAGGAGCAAGAATCCGCTAAGCAGGGCCGGG 17500 1     chr11:119057429 R CCCGGCCCTGCTTAGCGGATTCTTGCTCCTTTCTG 14359
ACTATTTCACAAAATAGAAGTAGAAGGTACTCTAC 17500 255
GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTAGC 0     0
CTATTGTGAATCATACTTCAATGAACATGAGTATG 0     0
CTAGAAGCAGAAGCAGGTATTTGGGGGGAGGGTTG 17500 3
CCGAACTTCTGACAGGTTTGAGCCTTCTGCTCAAG 16453 1     chr9:110761807  F CCGAACTTCTGAAAGGTTTGAGCCTTCTGCTCAAG 14359
TCATCGAGAGGGGACTGAAGTGGAAGCTAGTCAGC 17500 1     chr14:33191761  F TCATCGAGAGGGGACTGAAGTGGAAGCTAGTCAGC 14359

Executing the Pipeline

Parameter File

The parameter file contains information about the pipeline for GERALD. This is an example file:

 12345678:ANALYSIS eland
 123:ELAND_GENOME /data/Genome/Eland/RefGenome_Eland_sacCersgd
 5:ELAND_GENOME /data/Genome/Eland/RefGenome_Eland_PhiX
 4678:ELAND_GENOME /data/Genome/Eland/RefGenome_Eland_mm9
 USE_BASES Y*n
 ELAND_MULTIPLE_INSTANCES 8

ANALYSIS TYPES = eland, eland_extended (>32nt), eland_paired, eland_tag, sequence (no alignment), or none (failed lanes only)
USE_BASE = Y*n is the standard, using every base except the last but is mutable. For example, if you must omit the first two bases, you could insert nnY*n instead.

Pipeline Execution

The following commands will execute the pipeline on the BMC servers.

 $ cd /mnt/bmc-store/current/run<1-4>/<DATE_HWI->/
 $ /usr/local/pkg/GAPipeline/GAPipeline-1.0/Goat/goat_pipeline.py --GERALD=<parameter file> --control-lane=<phiX> Images/ or IPARR/
      or 
 $ /usr/local/pkg/GAPipeline/GAPipeline-1.3.2/bin/goat_pipeline.py --GERALD=<parameter file> --control-lane=<phiX> Images/ or IPARR/

Use "--control-lane=" to select a lane <n> that is to be used to estimate phasing and matrix correction for all other lanes. Control lanes are necessary for samples with skewed base compositions.

if you execute without errors, proced to make the pipeline

 $ /usr/local/pkg/GAPipeline/GAPipeline-1.0/Goat/goat_pipeline.py --GERALD=<parameter file> --control-lane=<phiX> --make Images/ or IPARR/

Move to the Firecrest Directory:

 $ cd Data/C1-36_Firecrest....
 $ make recursive -j4 >& 081113_Analysis_Logfile.txt &

.

BioMicroCenter:IlluminaDataPipeline: Difference between revisions

Revision as of 13:10, 3 February 2009

Contents

Basics

Base Calling

Alignment

Software Options

Output Files

Executing the Pipeline

Parameter File

Pipeline Execution

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools