Difference between revisions of "User:Jarle Pahr/Sequencing"
(→De Bruijn graph)
|(One intermediate revision by the same user not shown)|
|Line 135:||Line 135:|
Revision as of 14:27, 8 April 2013
Nature focus issue - sequencing technology: http://www.nature.com/nbt/journal/v30/n11/index.html
- 1 Technologies
- 1.1 Sanger sequencing (chain termination method)
- 1.2 Pyrosequencing ("454 sequencing")
- 1.3 Illumina (Solexa) sequencing
- 1.4 Ion semiconductor sequencing
- 1.5 Nanopore sequencing
- 1.6 Single molecule real time sequencing (Pacific Biosciences)
- 1.7 SOLiD sequencing (Applied Biosystems)
- 1.8 DNA nanoball sequencing
- 2 Concepts
- 2.1 De Bruijn graph
- 2.2 Bridge amplification
- 2.3 RNA-Seq
- 2.4 Genotyping by Sequencing (GBS)
- 2.5 ROC
- 2.6 Edit distance
- 2.7 Color Space/2-base encoding
- 2.8 Targeted sequencing
- 2.9 Scaffolding
- 2.10 Paired-end reads
- 2.11 N50 Statistic
- 2.12 Haplotypes
- 2.13 Loss of Heterozygosity
- 2.14 Copy number variants (CNVs)
- 2.15 Short Tandem Repeats (STRs)
- 3 Databases
- 4 Sequence alignment/Assembly
- 5 Sequencing services
- 6 Sequencing-based techniques
- 7 Sequencing/genomics centres
- 8 Primers
- 9 Software
- 10 Sequencing projects
- 11 File formats
- 12 Bibliography
- 13 Commentary
- 14 Misc
For a comparison of next-generation sequencing methods, see http://en.wikipedia.org/wiki/Dna_sequencing#Next-generation_methods
SeqAnswers.com Tech summaries: http://seqanswers.com/index.php?pageid=summaries
Sanger sequencing (chain termination method)
Pyrosequencing ("454 sequencing")
Pyrosequencing is a "sequence by synthesis" method developed by Mostafa Ronaghi and Pål Nyrén at the Royal Institute of Technology, Stockholm. Sequences are determined by observation of light emission upon addition of a nucleotide complementary to the first unpaired nucleotide of the template.
Quote from Wikipedia:Pyrosequencing:
"ssDNA template is hybridized to a sequencing primer and incubated with the enzymes DNA polymerase, ATP sulfurylase, luciferase and apyrase, and with the substrates adenosine 5´ phosphosulfate (APS) and luciferin."
Sequencing proceeds as follows:
- Addition of one of the four dNTPs (dATPαS is substituted for ATP, as the former is not a substrate for luciferase). If the dNTP is complementary, DNA polyerase incorporates the nucleotide, releasing pyrophosphate (PPi).
- ATP sulfurylase catalyzes reaction of PPi and adenosine 5' phosphosulfate to create ATP
- ATP fuels luciferase-catalyzed conversion of luciferin to oxyluceferin, generating visible light.
- Unincorporated nucleotides and ATP are degraded by apyrase.
454 sequencing performs massively parallel pyrosequencing. Library DNA containing adapter sequences are adsorbed to DNA-capturing beads. The DNA bound to each bead is then amplified by emulsion-PCR, in which the beads with bound DNA are mixed with PCR reagents and emulsion oil to create a water-in-oil emulsion containing many "microreactors" consisting of beads sorrounded by water. Following PCR amplification, the DNA-binding beads are isolated and deposited into the wells of a microtiter plate. Beads with pyrosequencing enzymes are then added to the plate. Finally, the pyrosequencing is performed, processing the plate in a sequencing machine. 400 000+ DNA fragments/beads can be processed per plate.
Using "multiplex identifiers", different genomic libraries can be bar-coded, facilitating sequencing of several libraries in the same sequencing run.
|Platform||Throughput (bases/run)||Time per run||Average (a)/mode (m) read length (nt)||Accuracy||Introduced (year)|
|GS FLX+||700 Mbp||23h||Up to 1000||700 bp (m)|
|GS Junior||35Mbp||12 h||400||400 bp (a) at Phred20/read|
Introductory paper, 454 sequencing: http://www.ncbi.nlm.nih.gov/pubmed/16056220?dopt=Abstract&holding=npg
Overview of 454 sequencing: http://classes.soe.ucsc.edu/bme215/Spring09/PPT/BME%20215-5.pdf
Illumina (Solexa) sequencing
|Platform||Throughput (bases/run) (maximum)||Time per run||Read length (nt)||Accuracy||Features||Introduced (year)|
|MiSeq Personal Sequencer||Up to 8.5 gbp||4 - 48 h||250||>70% bases higher than Q30 at read length 2 x 300 bp|
|HiSeq 2500/1500||600 Gb||2 x 100||>80 % higher than Q30|
|HiSeq 2000/1000||300 Gb||2 x 100||>80 % higher than Q30|
|Genome Analyzer IIx||95 Gb||2 x 150||>80 % higher than Q30|
Side by side comparison of Illumina sequencers: http://www.illumina.com/systems/sequencing.ilmn
Illumina - an introduction to NGS: http://www.illumina.com/Documents/products/Illumina_Sequencing_Introduction.pdf
Ion semiconductor sequencing
Ion Torrent: http://www.invitrogen.com/site/us/en/home/brands/Ion-Torrent.html?cid=fl-iontorrent Platforms:
|Platform||Throughput (bases/run)||Time per run||Typical read length||Accuracy||Introduced (year)|
|Ion PGM sequencer||10 Mb to 1Gb||90 min+||35-400 bp|
|Ion Proton sequencer||1 human genome||2h+||100 bp|
Oxford Nanopore: http://www.nanoporetech.com/
Single molecule real time sequencing (Pacific Biosciences)
Microscopical wells on a chip (zero-mode waveguides) each contain a single DNA polymerase enzyme bound to the bottom of the well, which accept a single DNA molecule as template. Fluorescent labelled dNTPs are used for DNA synthesis. Upon incorporation of a dNTP, the fluorescence tag is cleaved from the nucleotide and diffuses from the observation area within the ZMW. The sequence is determined optically by observing incorporation events.
SOLiD sequencing (Applied Biosystems)
DNA nanoball sequencing
De Bruijn graph
See also Compeaou et al. 2001, Nature Biotechnology - How to apply de Bruijn graphs to genome assembly: http://www.nature.com/nbt/journal/v29/n11/full/nbt.2023.html
Genotyping by Sequencing (GBS)
Color Space/2-base encoding
Targeted "capturing kits" may be used to sequence a subset of genomic DNA. The human exome (as defined by the Consensus CDS (CCDS) project) totals about 38 Mb, covering about 1.22 % of the human genome
N50 length: In a collection of contigs, the longest length for which the subset of contigs consisting of all contigs with that length or longer contains at least half of the total of the length of the contig collection.
NG50: As N50, except that the goal is half of the total of the genome size.
Loss of Heterozygosity
Copy number variants (CNVs)
Short Tandem Repeats (STRs)
Genotyping of STRs is used to produce forensic DNA profiles. See http://massgenomics.org/2013/01/identifying-samples-genomic-data.html
Sequence Read Archive: http://www.ncbi.nlm.nih.gov/sra
European Nucleotide Archive: http://www.ebi.ac.uk/ena/
Compendium of HTS mappers: http://wwwdev.ebi.ac.uk/fg/hts_mappers/
Comparison of assemblers: http://lh3lh3.users.sourceforge.net/alnROC.shtml
Bowtie - An ultrafast memory-efficient short read aligner:' http://bowtie-bio.sourceforge.net/index.shtml
Primers and reviews:
NCBI primer on genome assembly methods: http://www.ncbi.nlm.nih.gov/projects/genome/assembly/assembly.shtml
Nature Biotechnology Primer - How to map billions of short reads onto genomes: http://www.nature.com/nbt/journal/v27/n5/full/nbt0509-455.html
Bioinformatics, 2012: Tools for mapping high-throughput sequencing data: http://bioinformatics.oxfordjournals.org/content/28/24/3169
A survey of sequence alignment algorithms for next-generation sequencing: http://bib.oxfordjournals.org/content/11/5/473.full
De novo assembly:
Optimal Assembly for High Throughput Shotgun Sequencing: http://arxiv.org/abs/1301.0068
Counter-intuitevely, too high coverage can be problematic: http://seqanswers.com/forums/showthread.php?t=24965
|Service||Sample specification||Primer specification||Ship to||Link|
|GATC LightRun||Add 5 uL DNA (80-100 ng/uL plasmid or 20-80 ng/uL purified PCR product) + 5 uL 5uM (5 pmol/uL) primer to the same tube||Tm 52-58 C, 17-19 bp, (8-9 G+C for 18-mer) G or C at 3' end (max 3 Gs or Cs), maximum 4bp run.||GATC Biotech AG. European Custom Sequencing Centre. Gotrfied-Hagen-Strasse 20. 51105 Köln.||http://www.gatc-biotech.com/en/lp4/new-lightrun-sequencing.html|
|Macrogen Single-pass||Add 20 uL DNA (100 ng/uL plasmid or 50 ng/uL purified PCR product) to one tube. Add 20µl primer (10 pmol/uL) to a separate tube.||18-25 bp, 40-60 % GC, Tm 55-60||Macrogen Europe,
IWO, Kamer IA3-195, Meibergdreef 39,1105 AZ Amsterdam Zuid-oost. Netherlands. Attention: J.S .Park.
New York Genome Center: http://nygenome.org/
See also: http://omicsmaps.com/
|Name||Length (bp)||Sequence||Tm (C) [calculated]||Tm (C) [Analytical]||GC (% / bp)||Comment|
|pJP-1_seq5||18||CAGCGTGCGAGTGATTAT||53.9/60.6 (2)/52.6 (3)||50||Binds upstream of XylS region in pSB-M1g|
|pJP-1_seq6||18||AGACCACATGGTCCTTCT||57.5° (2)/52.8 ºC(3)||53.9||50||Binds near end of GFPmut3 in pSB-M1g|
|SeqMG1||AGCAGATCCACATCCTTGAA||62.7 (2)/53.7 (3)||Binds at nt 5672 of pSB-M1g, upstream of AgeI site. Designed to Macrogen sequencing primer criteria.|
|pSB-SeqA||18||TGCAAGAAGCGGATACAG||56 / 60.7°C (2)/52.3 ºC (3)||50||Binds at nt 7729 of pSB-M1g, upstream of Pm promoter and PciI site.|
http://www.generi-biotech.com/sequencing-universal-seguencing-primers/ http://www.synthesisgene.com/tools/Universal-Primers.pdf http://www.genewiz.com/public/universalprimers.aspx https://secure.eurogentec.com/product/research-universal-primers.html
Tm calculations: 1: CloneManager 2: Thermo Scientific 3: IDT Oligoanalyzer
A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers http://www.biomedcentral.com/1471-2164/13/341
Chromatogram viewers: http://www.dnaseq.co.uk/chrom_view.html
CodonCode aligner: http://www.codoncode.com/aligner/
About SCF (sequence chromatogram format) files: http://staden.sourceforge.net/manual/formats_unix_2.html
High-throughput sequencing tools:
SAM tools: http://samtools.sourceforge.net/
Burrows-Wheeler Aligner (BWA): http://bio-bwa.sourceforge.net/
Sequencing quality and standards:
Sequence Alignment/Map (SAM) format: "A generic format for storing large nucleotide sequence alignments". Tab-delimited text format consisting of a header section (optional) and an alignment section.
Binary Compressed Sam format/Binary Alignment Format (BAM): Binary, compressed file format containing the same information as SAM files.
From https://wiki.nci.nih.gov/display/TCGA/Binary+Alignment+Map : "Centers align sequence reads to a reference genome to produce a Sequence Alignment Map (SAM) format file. The SAM file is then converted into a binary form, or Binary-sequence Alignment Format (BAM) file"
Variant Call Format (VCF):
Standard created by the 1000 Genomes Project.
"The VCF format is a tab delimited format for storing variant calls and and individual genotypes. It is able to store all variant calls from single nucleotide variants to large scale insertions and deletions."
ABI (Applied Biosystems) format:
FASTQ files encode identified nucleotides together with their corresponding quality scores. The interpretation of the quality scores may vary depending on the source of the sequence, but the most used is the "Sanger format" (Phred quality scores).
GAGE: A critical evaluation of genome assemblies and assembly algorithms: http://genome.cshlp.org/content/22/3/557
The state of NGS variant calling - Don't panic: http://blog.goldenhelix.com/?p=1725
Assemblies: The good, the bad and the ugly: http://www.nature.com/nmeth/journal/v8/n1/full/nmeth0111-59.html
A tale of three next generation sequencers: http://www.biomedcentral.com/content/pdf/1471-2164-13-341.pdf
SEQanswers wiki: http://seqanswers.com/wiki/SEQanswers
SEQansers - how to: http://seqanswers.com/wiki/How-to
Genome Reference Consortium: http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/
List of NGS blogs: http://seqanswers.com/forums/showthread.php?t=5024
NGS Necropolis: http://blueseq.com/knowledgebank/ngs-necropolis/