RRedon:Protocols/Variation pipeline
get Reference Genome
See main article: RRedon:Protocols/Variation_pipeline/Reference_genome
Tools
- SAMTOOLS http://samtools.sourceforge.net/
- MAQ
- BWA
Align FASTQs vs the reference
Mapping quality =30 = GOOD Mapping quality is
p(alignment is incorrect)* phred_qual)
function of
- repeats on refseq
- base qual of the read
- alignment parameters
- paired read are better
BWA vs MAQ
- repeat : (citing Biostarts) "Maq maps a repeat read randomly"
- Mapping quality: (citing [ Biostars]):"If you want to find the SNPs, you do not really need to care about this. Maq will consider the mapping quality in genotype calling. If you want to pinpoint the structual variations with paired end reads, you should only pick up abnormal pairs with high mapping qualities (30, for example). If you are analysing ChIP-Seq data, setting a threshold on mapping quality may also be necessary."
- dans le papier de Li and Durbin "Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform" il est dit que MAQ surestime la mapping quality = proba(alignement incorrect) . BWA=a true hit can always be found. ← Fix this! english
With BWA
See main article: BWA
With MAQ
See main article about MAQ
Recalibrate
(citing)"After recalibration, the quality scores in the QUAL field in each read in the output BAM are more accurate in that the reported quality score is closer to its actual probability of mismatching the reference genome." ← Fix this! parameters, command line ?↓TODO
Remove Duplicates
(From Biostars:)Removing duplicates refers to multiple reads that match at the same position in the genome. This is different than one read (or read pair) mapping to multiple genome locations. MarkDuplicates finds sequence pairs that map to the same position, marking or removing the duplicates so you can work with unique pairs in downstream analyses. If you want them removed, use the REMOVE_DUPLICATES=true flag when running the program:
← Fix this! command line
java -jar MarkDuplicates.jar I=chr1.sorted.bam O=chr.markdup.bam METRICS_FILE=jeter.metrics
Coverage
← Fix this! With GATK ?
← Fix this! 'min depth': how do we use it , when ?
SNP Calling
↓TODO
samtools
samtools pileup -vcf ${HG18}.fa file.bam |\ samtools/misc/samtools.pl varFilter -d 4 -D 1200 -Q 25
- d: Minimum depth is 4x (around 8x is recommended)
- D: Maximum depth is 1200x (but no more than around 3x mean is recommended)
- Genotype quality score >= 20 ← Fix this! in varFilter ?
- Snp quality score >= 25 ← Fix this! in varFilter ?
- Q: RMS mapping quality >= 25 (Root Mean Square)
varFilter can also be used to keep one SNP in a 10pb window: cf. option in samtools/misc/samtools.pl varFilter
Varscan
see http://varscan.sourceforge.net/using-varscan.html
Create the VCF
java -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -I markdup.bam -R hg18.fa -varout markdup.bam.vcf -vf VCF -pl SOLEXA
Indels
- http://www.broadinstitute.org/gsa/wiki/index.php/Indel_Genotyper_V2.0
- PINDEL
- Dindel http://sites.google.com/site/keesalbers/soft/dindel
View the content of a BAM
samtools view output.sorted.bam chr1 | more
Consequences
Simple tool developed by Pierre.
Internal Sanger tool.
KG
PhdSNP
Panther
Ensembl
- problems with track at UCSC: see https://lists.soe.ucsc.edu/pipermail/genome/2010-May/022391.html
SIFT
See main article SIFT
Polyphen
See main article Polyphen
With SamTools
samtools pileup -vcf hg18.fa markdup.bam
Abbreviations
- PTR: Primary target region: exons. Regions that we wanted to target
- CTR: Capture target region (baits). Regions actually covered by baits
- PCCR : Union of CTR and PTR regions
- CTRplus :Capture Target Regions ± 100bp flank
References
InDels
- ParMap, an algorithm for the identification of small genomic insertions and deletions in nextgen sequencing data. Khiabanian H, Van Vlierberghe P, Palomero T, Ferrando AA, Rabadan R. BMC Res Notes. 2010 May 27;3(1):147. PMID: 20507604