BioMicroCenter:CoverageCalculations: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
 
Line 2: Line 2:


==Determining ideal read length and depth of coverage==
==Determining ideal read length and depth of coverage==
The BioMicro Center offers a wide variety of read lengths, both in single-end and paired-end formats. The Illumina GAIIx is capable of single-end and paired-end sequencing from 36nt to a maximum of 150nt in units of 36, while the HiSeq2000 is capable of single-end and paired-end sequencing from 40nt to 100nt in units of 40.  
The BioMicro Center offers a wide variety of read lengths, both in single-end and paired-end formats. Often, it is useful to calculate the expected average coverage. <BR><BR>


As a flowcell is being run, reads undergo internal quality control, filtering out unreliable reads.  The GAIIx averages ~25 million reliable reads (clusters) per lane, while the HiSeq2000 averages ~100 million.  Determining the ideal parameters for a sequencing run requires knowledge of the genome being sequenced.
Coverage for genomic samples can be calculated as:


For example, say a 150Mbp genome needs to be sequenced at 5X coverage, requiring 750Mbp of data output (150Mbp*5). This can be reliably sequenced using a standard 36bp single-end lane on the Illumina GAIIx, which produces ~900Mbp of data on average (36bp/read * 25M reads). If a larger genome is being sequenced, for example, one that is 300Mbp, 1.5Gbp is the target data output (300Mbp*5), so a standard +36bp single-end lane may not be sufficient. However, a 72nt single-end lane(72bp/read * 25M reads = 1.8Gbp), or a 36nt paired-end run (36bp/read * 2 * 25M reads = 1.8Gbp) would be fine.
  no.reads(1/2) * readlength * no.cluster
  ---------------------------------------
              genome size


Multiplexing is useful for applications requiring a lower data output per sample.  Sequencing Saccharomyces, which has a ~12.5Mbp genome, at 5X coverage requires 62.5Mbp of data.  Multiplexing 10 samples on one lane in a 36nt single read flowcell would require 625Mbp of output to achieve the desired coverage.  As stated above, the average output for one lane is ~900Mbp of data, so multiplexing the 10 samples into one lane provides sufficient coverage while reducing cost. It is important to note that while the multiplexing process adds 6-bp barcodes to the libraries, they are read separately from the main read and therefore do not affect read length.
For ChIP samples, the following modified formula can be used:


Paired-end runs sequence DNA in both the forward and reverse directions from the two ends of the same DNA fragments, allowing for the use of long-range sequence information during alignment of the genomePaired-end, long read (>80nt) runs are preferred for some applications such as de novo sequencing.
      no.reads * readlength * no.cluster
  -----------------------------------------
  no.sites * site.length / % reads in sites
 
Some standard genome sizes:
{| border=1
|Species
|Length
|-
|E coli
|4 Mbp
|-
|S.cerevisiae
|12.5Mbp
  |-
|C.elegans / Drosophila
|100-150Mbp
|-
|Human / Mouse / Rat
|3Gbp
|}

Latest revision as of 19:21, 10 January 2013

HOME -- SEQUENCING -- LIBRARY PREP -- HIGH-THROUGHPUT -- COMPUTING -- OTHER TECHNOLOGY

Determining ideal read length and depth of coverage

The BioMicro Center offers a wide variety of read lengths, both in single-end and paired-end formats. Often, it is useful to calculate the expected average coverage.

Coverage for genomic samples can be calculated as:

 no.reads(1/2) * readlength * no.cluster
 ---------------------------------------
             genome size 

For ChIP samples, the following modified formula can be used:

      no.reads * readlength * no.cluster
  -----------------------------------------
  no.sites * site.length / % reads in sites

Some standard genome sizes:

Species Length
E coli 4 Mbp
S.cerevisiae 12.5Mbp
C.elegans / Drosophila 100-150Mbp
Human / Mouse / Rat 3Gbp