BioMicroCenter:IlluminaDataPipeline

Basics
We are currently using the Illumina Pipeline version RTA 2.8.0 / OLB 1.8.0 / CASAVA 1.7.0

An older version of the pipeline manual is available (v1.0)[[Media:Pipeline v1.0 User Guide.pdf | HERE]].

The Genome Analyzer Pipeline Software (Pipeline) is a set of utilities designed to perform a complete data analysis of a sequencing run. It is supplied as source code and scripts. Data analysis consists of three steps: image analysis, base calling, and sequence analysis.

1. Image analysis (Firecrest)—Uses the raw TIF files to locate clusters on the image, and outputs the cluster intensity, X,Y positions, and an estimate of the noise for each cluster. The output from image analysis provides the input for base calling.

2. Base calling (Bustard)—Uses cluster intensities and noise estimate to output the sequence of bases read from each cluster, along with a confidence level for each base.

3. Sequence analysis (Gerald and Eland)—Allows for alignment to a reference sequence, filtering of data based on predefined criteria, and visualization of the result.

Base Calling
Firecrest is the module used for image analysis. Firecrest identifies cluster positions and extracts intensities. Through image filtering, it sharpens and enhances clusters, removes background noise, and detects clusters based on morphological features on the image. Firecrest also adjusts the scale and registration of an image. Firecrest is currently performed in real time with the sequencing process on a dedicated IPARR server as part of the 1.0 pipeline.

Bustard is the module used for base calling. Bustard deconvolves the signal from the clusters and applies correction for cross-talk, phasing, and prephasing.


 * Frequency cross-talk—The Genome Analyzer uses two lasers and four filters to detect four dyes attached to the four types of nucleotide, respectively. The frequency emission of these four dyes overlaps so that the four images are not independent. The frequency cross-talk is deconvolved using a frequency cross-talk matrix.


 * Phasing/Prephasing—Depending on the efficiency of the fluidics and the sequencing reactions, a small number of molecules in each cluster may run ahead (prephasing) or fall behind (phasing) of the current incorporation cycle. This effect is mitigated by applying corrections during the base calling step.


 * All of these corrections are based on an assumption of equal base frequency. For this reason, Illumina recommends the inclusion of the PhiX control sample in all runs (7+1). Work is currently under way at the [Broad Institute] to create defined spiked in DNA templates with equal base frequency that can act as control reads, however, this has not yet been incorporated into the pipeline.

Alignment
Generation of Recursive Analyses Linked by Dependency (GERALD) is the module used for sequence alignment, data visualization, filtering, and alignment. The following two alignment programs work within the GERALD module:


 * Efficient Large-Scale Alignment of Nucleotide Databases (ELAND) is very fast and aligns for up to two errors from a reference for the first 32 bases. This algorithm is used for any reference larger than 100 kb.


 * PhageAlign does an exhaustive alignment (all possible alignments up to arbitrary edit distances), but is slow.

Software Options
While ELAND is extremely fast, it suffers from some significant deficiencies. The largest is the lack of tolerance of errors. Read failure is typically a function of length and it is likely that many nucleotides will have been successfully read before the phasing/pre-phasing or some other error becomes large enough to cause read failure. However, because ELAND is an all or nothing algorithm, it is incapable of handling 'short' reads. Numerous researchers have made significant efforts to create improved versions of ELAND. Some of the options:


 * MAQ http://maq.sourceforge.net/ - Maq maps short reads to a reference genome and calls the genotypes from the alignment. It is specificially designed for Illumina-Solexa/AB-SOLiD reads, not for 454 or capillary ones. Key facts about Maq: 1) Maq maps a repeat read randomly, and 2) it gives a probability score (mapping quality) to each alignment. More information is available on the MAQ page.


 * Bowtie - The Burge Lab is currently analyzing the quality of reads mapped using Bowtie. Bowtie is believed to be MUCH faster then ELAND.

Summary.htm
The Summary.htm file is the basic quality control file associated with your run. The information in the lane results summary section is the best place to begin. The key pieces of informaiton are:
 * Clusters (raw): The number of clusters detected by the image analysis module of the Pipeline software
 * Clusters (PF): The number of clusters detected that meet the filtering parameters. These filtering parameters are adjustable *** add info on filtering parameters ***
 * % Align (PF): The percentage of filtered reads that were uniquely aligned to the reference genome.
 * % Error Rate (PF): The percentage of called bases in aligned reads that do not match the reference. This does not mean the base calls are incorrect, simply that they are not aligned.

Note that the intensity scores have lost most of their utility with the 1.3.2 Pipeline software.w

If you are running a paired end run, some additional parameters are useful further down.
 * Relative orientation statistics. Here, you want to look at the > and < signs. "><" reads are pointing to each other while all other configurations (<<,>>,<>) suggest some type of rearangement has caused the reads to point either in parallel or become divergent.
 * Insert statistics: The software calculates the median insert size (of the >< reads) and the std. deviation. The software then calculates out reads that are statistically too large or too small to be expected from the distribution. these reads have the possibility of representing IN/DEL regions.
 * Reads that fail these thresholds are reported in the s_N_anomaly.txt file

We will be working on additional QC metrics that can be reported in the coming months.

Output Files
_sequence.txt Fastq format sequence information Example File:

@HWI-EAS413_4:1:5:1004:56           <= Sequence name - lane:read(1|2):tile:X:Y GTTTCATTTCTAAACCTGTTTCATTACAAAATGCC <= Sequence +HWI-EAS413_4:1:5:1004:56           <= repeat of name ZVZZVZZZZZVZSZZLZFZZZVVZLSZSZZQVLVT <= ASCII quality score. A = low, Z(or other symbol) = high. @HWI-EAS413_4:1:5:1350:118 GACCTTTTGCTTGTTTTGAGAGTGAGGGAAAAGGA +HWI-EAS413_4:1:5:1350:118 ZZZZZZZZZZZZZZZZZYZZZUZZZYZUZUVVVSV

_export.txt and _sorted.txt Export.txt files include all clusters. Sorted.txt file includes only reads passing filter and mapping uniquely to the genome.

HWI-EAS413     24      8       82      1319    1073    0       1       AACTGCCAGTTGTGCCAGCTCCATTTTGAAAAAGG     a`a`_``W]Y^_Y]a][\_]__^\^_\^VY^Z\^Y     chr10.fa3004659 F       35      5 HWI-EAS413     24      8       30      1221    382     0       1       TCCACAAAATAGAAGCAGAAGGGACTCTACCCTTC     abbbbbabbb_b^`bbab`_a`aabaaabaaaaaa     chr10.fa3005954 R       35      11 HWI-EAS413     24      8       66      1593    1942    0       1       TCCAGAGCCTGATGCTCGTCACAACTCTGCTCACT     abab[`[aa_aa[Ua[_PTaaaa\\YZTRXU`U     chr10.fa3012198 R       35      112 HWI-EAS413      24      8       74      745     561     0       1       AGTCTCTGCTCCATTTTCATCCCTGCATTTCCTTT     abbbbbbbbbbaZabbba^aaababaaabaaaaab     chr10.fa3020132 R       35      15 HWI-EAS413      24      8       65      1587    440     0       1       AGTCATCCATCTGAGAACACCTTCCAGGTCCTGGA     `]`baabba`ba_a^_Za_ba_aaa][XUba_TZY     chr10.fa3023984 R       35      67 HWI-EAS413      24      8       99      224     363     0       1       TGTGTAATGCCCAAATATCAGTTCTAAATAGGTCA     aa`a``_aaaaa^^^`Z``^a_`_`YWW_W``]_X     chr10.fa3026498 F       35      113 HWI-EAS413      24      8       15      1636    1926    0       1       AAACATTTATGAAATTGTCAAAGAACAAAAAACAA     aaaaaaaaaaaaaaaaa_a__^a_a``^_a````X     chr10.fa3026790 F       35      52 HWI-EAS413      24      8       33      622     1685    0       1       AAACATTTATGAAATTGTCAAAGAACAAAAAACAA     aababaaabaaaaaaaa^aaaaa`a`a`aaa``aa     chr10.fa3026790 F       35      53 HWI-EAS413      24      8       84      237     1877    0       1       AAACATTTATGAAATTGTCAAAGAACAAAAAACAA     ^aaababa_aabaaaaaaaa^R^`aaaaaaaa_aa     chr10.fa3026790 F       35      39 HWI-EAS413      24      8       60      1470    635     0       1       TCGTGACGAGTAAAGCAGTCTCTCTTAGCTCTGTC     aba^`bb``a`_P`^]XX[aaa`a`U``a\TT_     chr10.fa3031823 F       35      113

FIELDS
 * 1) Machine (Parsed from Run Folder name)
 * 2) Run Number (Parsed from Run Folder name)
 * 3) Lane
 * 4) Tile
 * 5) X Coordinate of cluster
 * 6) Y Coordinate of cluster
 * 7) Index string (Bland for a non-indexed run)
 * 8) Read number (1 or 2 for paired-read analysis, blank for a single-read analysis)
 * 9) Read
 * 10) Quality string—In symbolic ASCII format (ASCII character code = quality value + 64. B-F = low)
 * 11) Match chromosome—Name of chromosome match OR code indicating why no match resulted
 * 12) Match Contig—Gives the contig name if there is a match and the match chromosome is split into contigs (Blank if no match found)
 * 13) Match Position—Always with respect to forward strand, numbering starts at 1 (Blank if no match found)
 * 14) Match Strand—“F” for forward, “R” for reverse (Blank if no match found)
 * 15) Match Descriptor—Concise description of alignment (Blank if no match found). A numeral denotes a run of matching bases and a letter denotes substitution of a nucleotide. For a 35 base read, “35” denotes an exact match and “32C2” denotes substitution of a “C” at the 33rd position
 * 16) Single-Read Alignment Score—Alignment score of a single-read match, or for a paired read, alignment score of a read if it were treated as a single read. Blank if no match found; any scores less than 4 should be considered as aligned to a repeat
 * 17) Paired-Read Alignment Score—Alignment score of a paired read and its partner, taken as a pair. Blank if no match found; any scores less than 4 should be considered as aligned to a repeat
 * 18) Partner Chromosome—Name of the chromosome if the read is paired and its partner aligns to another chromosome (Blank for single-read analysis)
 * 19) Partner Contig—Not blank if read is paired and its partner aligns to another chromosome and that partner is split into contigs (Blank for single-read analysis)
 * 20) Partner Offset—If a partner of a paired read aligns to the same chromosome and contig, this number, added to the Match Position, gives the alignment position of the partner (Blank for single-read analysis)
 * 21) Partner Strand—To which strand did the partner of the paired read align? “F” for forward, “R” for reverse (Blank if no match found, blank for single-read analysis)
 * 22) Filtering—Did the read pass quality filtering? “Y” for yes, “N” for no

_eland_results.txt obsolete! Alignment information for reads.

Sequence ID                    Sequence Read                           SCORE   #0mm    #1mm    #2mm   CHR HIT    POS HIT         STRND     unk. MISMATCH >HWI-EAS413_4:1:100:825:1989   CTAGAAGCAGAAGCAGGTATTTGGGGGGAGGGTTG     R0      3       0       0 >HWI-EAS413_4:1:100:1076:1671  AACTGCTTTGAGATAGGGTCTCTCTTGTTCACTTT     NM      0       0       0 >HWI-EAS413_4:1:100:573:1957   TCGAGACGTAAACTAGCTAACCTACATTATCCCCT     NM      0       0       0 >HWI-EAS413_4:1:100:1784:660   AATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA     R0      204     255     255 >HWI-EAS413_4:1:100:133:987    CGCGATGATGTCTCAATACACCCCCCCGCTACCAG     NM      0       0       0 >HWI-EAS413_4:1:100:1361:1636  CATGTCATGCGCTCTAATCTCTGGGCATCTTGAGA     NM      0       0       0 >HWI-EAS413_4:1:100:1733:932   CCGAACTTCTGACAGGTTTGAGCCTTCTGCTCAAG     U1      0       1       0       chr9.fa     110761807       F       DD      13A >HWI-EAS413_4:1:100:992:1902   CAATTAAATAATAATAAACTAACACACAATACAAA     NM      0       0       0 >HWI-EAS413_4:1:100:1230:1718  TCAGCAAACAAACCCCCAACATAAAATCCATTATG     NM      0       0       0 >HWI-EAS413_4:1:100:324:130    TCATCGAGAGGGGACTGAAGTGGAAGCTAGTCAGC     U0      1       0       0       chr14.fa    33191761        F       DD

Sequence ID: From the sequence.txt file Sequence Read: self explanatory Score: 8 possible scores: 0,1,2mm: Number of times the sequence was found in the genome with 0, 1 or 2 mismatches. max=255 Chr/Pos Hit: Location of the read in the genome (low number based on 36nt). Only reported for U0, U1, U2. Strand: Strand hit (F or R) Note: a hit a pos 1000 with a R actually has its 5' end at 1035! Mismatch: Position of the mismatch (eg 13A = pos13 should be an A). listed as in sequence.
 * U0,U1,U2 - unique hit in the genome with 0, 1, or 2 mismatches
 * R0,R1,R2 - multiple hit in the genome with 0, 1, or 2 mismatches
 * NM - No match
 * QC - failed quality score

_realign.txtobsolete! Final quality-filtered sequence alignments

sequence read                      score #hit  Chr:Pos      Strd  best match                         next best score CCACAAAGTTCAGATGAGGGTGGGTGTTGCTTGTT 17500 1    chrX:116488401  F CCACAAAGTTCAGATGAGGGTGGGTGTTGCTTGTT 14359 AGACTGGCCTGCCTCTGCCTCCTGAGTACTGAGAC 17500 1    chr8:73831045   R GTCTCAGTACTCAGGAGGCAGAGGCAGGCCAGTCT 16453 CTTGGGCTACCAGGCCTGGTTAGCCCACTCCCGCT 17500 1    chrX:95524949   R AGCGGGAGTGGGCTAACCAGGCCTGGTAGCCCAAG 14359 CAGAAAGGAGCAAGAATCCGCTAAGCAGGGCCGGG 17500 1    chr11:119057429 R CCCGGCCCTGCTTAGCGGATTCTTGCTCCTTTCTG 14359 ACTATTTCACAAAATAGAAGTAGAAGGTACTCTAC 17500 255 GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTAGC 0    0 CTATTGTGAATCATACTTCAATGAACATGAGTATG 0    0 CTAGAAGCAGAAGCAGGTATTTGGGGGGAGGGTTG 17500 3 CCGAACTTCTGACAGGTTTGAGCCTTCTGCTCAAG 16453 1    chr9:110761807  F CCGAACTTCTGAAAGGTTTGAGCCTTCTGCTCAAG 14359 TCATCGAGAGGGGACTGAAGTGGAAGCTAGTCAGC 17500 1    chr14:33191761  F TCATCGAGAGGGGACTGAAGTGGAAGCTAGTCAGC 14359

Parameter File
The parameter file contains information about the pipeline for GERALD. This is an example file:

12345678:ANALYSIS eland_extended 12345678:ELAND_SEED_LENGTH 25 123:ELAND_GENOME /data/Genome/Eland/RefGenome_Eland_sacCersgd 5:ELAND_GENOME /data/Genome/Eland/RefGenome_Eland_PhiX 4678:ELAND_GENOME /data/Genome/Eland/RefGenome_Eland_mm9 USE_BASES Y*n ELAND_MULTIPLE_INSTANCES 4

ANALYSIS TYPES = eland_extended (>32nt), eland_paired, eland_tag, sequence (no alignment), or none (failed lanes only)
 * "eland" is no longer supported in 1.3.2, so keep 1.0 around if you want to keep iterative ELAND.

USE_BASE = Y*n is the standard, using every base except the last but is mutable. For example, if you must omit the first two bases, you could insert nnY*n instead.
 * Multiplex: Using multiplex can be done with the I tag. For example Y*nI6n would take all bases upto the final 8 with the middle 6 being used as the index.

ELAND_MULTIPLE_INSTANCES can only be 1,2,4,8. Using any other value will default to 1.

Pipeline Execution
The following commands will execute the pipeline on the BMC servers. $ cd /mnt/bmc-store/current/run<1-4>// $ /data/apps/pkg/GAPipeline/GAPipeline-1.0/Goat/goat_pipeline.py --GERALD= --control-lane= Images/ or IPARR/ or  $ /data/apps/pkg/GAPipeline/GAPipeline-1.3.2/bin/goat_pipeline.py --GERALD= --control-lane= Images/ or  $ /data/apps/pkg/GAPipeline/GAPipeline-1.3.2/bin/bustard.py --GERALD= --control-lane= Data/IPAR_1.3/

Use "--control-lane=" to select a lane  that is to be used to estimate phasing and matrix correction for all other lanes. Control lanes are necessary for samples with skewed base compositions.

if you execute without errors, proced to make the pipeline $ /data/apps/pkg/GAPipeline/GAPipeline-1.0/Goat/goat_pipeline.py --GERALD= --control-lane= --make Images/ or IPARR/ or  $ /data/apps/pkg/GAPipeline/GAPipeline-1.3.2/bin/goat_pipeline.py --GERALD= --control-lane= --make Images/ or  $ /data/apps/pkg/GAPipeline/GAPipeline-1.3.2/bin/bustard.py --GERALD= --control-lane= Data/IPAR_1.3/

Move to the Firecrest Directory (for 1.0) or Bustard directory (for 1.3.2): $ cd Data/C1-36_Firecrest.... or $ cd Data/IPAR_1.3/Bustard... then $ make recursive -j8 >& _Analysis.log &

.

BioMicroCenter:DataArchiving