User:Timothee Flutre/Notebook/Postdoc/2013/12/01: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
(→‎Entry title: first version)
(fix raw html notebook nav)
 
(18 intermediate revisions by one other user not shown)
Line 2: Line 2:
|-
|-
|style="background-color: #EEE"|[[Image:owwnotebook_icon.png|128px]]<span style="font-size:22px;"> Project name</span>
|style="background-color: #EEE"|[[Image:owwnotebook_icon.png|128px]]<span style="font-size:22px;"> Project name</span>
|style="background-color: #F2F2F2" align="center"|<html><img src="/images/9/94/Report.png" border="0" /></html> [[{{#sub:{{FULLPAGENAME}}|0|-11}}|Main project page]]<br />{{#if:{{#lnpreventry:{{FULLPAGENAME}}}}|<html><img src="/images/c/c3/Resultset_previous.png" border="0" /></html>[[{{#lnpreventry:{{FULLPAGENAME}}}}{{!}}Previous entry]]<html>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</html>}}{{#if:{{#lnnextentry:{{FULLPAGENAME}}}}|[[{{#lnnextentry:{{FULLPAGENAME}}}}{{!}}Next entry]]<html><img src="/images/5/5c/Resultset_next.png" border="0" /></html>}}
|style="background-color: #F2F2F2" align="center"|[[File:Report.png|frameless|link={{#sub:{{FULLPAGENAME}}|0|-11}}]][[{{#sub:{{FULLPAGENAME}}|0|-11}}|Main project page]]<br />{{#if:{{#lnpreventry:{{FULLPAGENAME}}}}|[[File:Resultset_previous.png|frameless|link={{#lnpreventry:{{FULLPAGENAME}}}}]][[{{#lnpreventry:{{FULLPAGENAME}}}}{{!}}Previous entry]]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}}{{#if:{{#lnnextentry:{{FULLPAGENAME}}}}|[[{{#lnnextentry:{{FULLPAGENAME}}}}{{!}}Next entry]][[File:Resultset_next.png|frameless|link={{#lnnextentry:{{FULLPAGENAME}}}}]]}}
|-
|-
| colspan="2"|
| colspan="2"|
Line 8: Line 8:
==One-liners for high-throughput sequencing data==
==One-liners for high-throughput sequencing data==


* '''softwares''': see BWA, Bowtie, MOSAIK, etc; they take fastq files as input and return bam files as output
In this document, I'll focus on paired-end reads using, as an example, data from a GBS (RAD-seq) experiment in grapevine.
 
* '''mapping softwares''': see BWA, Bowtie, MOSAIK, etc; they (usually) take fastq files as input and return bam files as output


     IN1="reads_R1.fastq.gz"
     IN1="reads_R1.fastq.gz"
Line 14: Line 16:
     OUT="alignments.bam"
     OUT="alignments.bam"


* '''total number of reads in the fastq file''':
* '''number of input pairs''': there are 2,307,772 pairs and 4,615,544 reads
 
    nbLines=$(zcat $IN1 | wc -l); echo "scale=0; "${nbLines}"/4" | bc -l # 2307772
    nbLines=$(zcat $IN2 | wc -l); echo "scale=0; "${nbLines}"/4" | bc -l # 2307772
 
* '''understand one read pair in the Fastq format''':


     nbLines=$(zcat $IN1 | wc -l); echo "scale=0; "${nbLines}"/4" | bc -l
     zcat $IN1 | head -4
    zcat $IN2 | head -4


* '''total number of reads in the bam file''': should be equal to the nb of reads in the fastq file if no filtering was made
* '''understand one alignment in the SAM format''':


     samtools view $OUT | cut -f1 | sort | uniq | wc -l
     samtools view $OUT | head -1 | tr "\t" "\n" | nl -n ln


* '''flag statistics in SAM/BAM''':
* '''global alignment statistics''':


     samtools flagstat $OUT
     samtools flagstat $OUT
Line 28: Line 36:
which returns something like:
which returns something like:


     4635834 + 0 in total (QC-passed reads + QC-failed reads)
     line  1:  4635834 + 0 in total (QC-passed reads + QC-failed reads)
     20290 + 0 secondary
     line  2:  20290 + 0 secondary
     0 + 0 supplimentary
     line  3:  0 + 0 supplimentary
     0 + 0 duplicates
     line  4:  0 + 0 duplicates
     4443270 + 0 mapped (95.85%:-nan%)
     line  5:  4443270 + 0 mapped (95.85%:-nan%)
     4615544 + 0 paired in sequencing
     line  6:  4615544 + 0 paired in sequencing
     2307772 + 0 read1
     line  7:  2307772 + 0 read1
     2307772 + 0 read2
     line  8:  2307772 + 0 read2
     4299122 + 0 properly paired (93.14%:-nan%)
     line  9:  4299122 + 0 properly paired (93.14%:-nan%)
     4412810 + 0 with itself and mate mapped
     line 10:  4412810 + 0 with itself and mate mapped
     10170 + 0 singletons (0.22%:-nan%)
     line 11:  10170 + 0 singletons (0.22%:-nan%)
     57898 + 0 with mate mapped to a different chr
     line 12:  57898 + 0 with mate mapped to a different chr
     44330 + 0 with mate mapped to a different chr (mapQ>=5)
     line 13:  44330 + 0 with mate mapped to a different chr (mapQ>=5)
 
* '''list occurring BAM flags''': along with their number of occurrences
 
    samtools view $OUT | cut -f2 | sort | uniq -c | sort -k1,1nr -k2,2n
    # | awk '{sum+=$1}END{print sum}' # <- same as line 1
 
[http://www.htslib.org/doc/sam.html list] of bits used to form the flags
 
[http://broadinstitute.github.io/picard/explain-flags.html this] website is useful to interpret flags
 
for instance, if there are N entries with flag 99, there should also be N entries with flag 147; idem for 83 and 163; idem for 77 and 141; etc
 
* '''entries in the BAM file''': same as line 1; I use "entries" instead of "alignments" because the BAM file can contain unmapped reads (see below)
 
    samtools view -c $OUT # 4635834
 
* '''mapped entries''': same as line 5
 
    samtools view -F 0x0004 -c $OUT # 4443270
    echo "scale=3; 100 * 4443270 / 4635834" | bc -l # 95.846%
 
* '''unmapped entries''': same as line 1 - line 5
 
    samtools view -f 0x0004 -c $OUT # 192564
 
* '''pairs in the BAM file''': should be equal to the nb of input pairs
 
    samtools view $OUT | cut -f1 | sort | uniq | wc -l # 2307772
 
* '''not-primary/supplementary entries''': see [http://seqanswers.com/forums/showthread.php?t=40239 details]
 
    samtools view -f 0x0100 -c $OUT # 20290 <- same as line 2
    samtools view -f 0x0800 -c $OUT # 0 <- same as line 3
 
* '''R1 and R2 entries''':
 
    nbE1=$(samtools view -f 0x0040 -c $OUT) # 2316858
    nbE2=$(samtools view -f 0x0080 -c $OUT) # 2318976
    echo "$nbE1 + $nbE2" | bc -l # 4635834 <- same as line 1
 
* '''R1 and R2 reads''':
 
    samtools view -f 0x0040 $OUT | cut -f1 | sort | uniq -c > tmp1
    wc -l < tmp1 # 2307772 <- same as line 7
    samtools view -f 0x0080 $OUT | cut -f1 | sort | uniq -c > tmp2
    wc -l < tmp2 # 2307772 <- same as line 8
    echo "$(wc -l < tmp1) + $(wc -l < tmp2)" | bc -l # 4615544 <- same as line 6
    cat tmp1 | awk '{print $1}' | sort | uniq -c # some input reads can have multiple entries
    cat tmp2 | awk '{print $1}' | sort | uniq -c # idem
 
* '''entries properly paired in sequencing''': irrespective of mapping
 
    samtools view -f 0x0001 -c $OUT # 4635834 <- same as line 1 if all input reads are paired
    samtools view -f 0x0001 -F 0x0100 -c $OUT # 4615544 <- same as line 6
 
* '''entries properly paired in mapping''': same as line 9
 
    samtools view -f 0x0002 -F 0x0100 -c $OUT # 4299122
    echo "scale=3; 100 * 4299122 / (4635834 - 20290)" | bc -l # 93.144%


* '''total number of entries in the bam file''': same as line 1
* '''pairs, properly paired in mapping, in primary alignment''': to be compared to the number of input pairs (see above)


     samtools view $OUT | wc -l
     samtools view -f 0x0002 -F 0x0100 $OUT | cut -f1 | sort | uniq | wc -l # 2149561
    echo "scale=3; 100 * 2149561 / 2307772" | bc -l # 93.144


* '''list of different flags in the bam file''': along with their number of occurrences
* '''singletons''': same as line 11


     samtools view $OUT | cut -f2 | sort | uniq -c
     samtools view -f 0x0008 -F 0x0004 -F 0x0100 -c $OUT # 10170


* '''total number of mapped entries''': same as line 5
* '''different chromosomes''': same as lines 12 and 13


     samtools view -F 4 $OUT | wc -l
     samtools view -F 0x0004 -F 0x0008 -F 0x0100 $OUT | cut -f7 | sort | uniq -c # to list all possible values the 7-th field is taking
    samtools view -F 0x0004 -F 0x0008 -F 0x0100 $OUT | awk '($7 != "=")' | wc -l # 57898
    samtools view -q 5 -F 0x0004 -F 0x0008 -F 0x0100 $OUT | awk '($7 != "=")' | wc -l # 44330


* '''total number of unmapped entries''': same as line 1 - line 5
* '''strand''': get all alignments on the reverse strand (use <code>-F 0x10</code> for the forward)


     samtools view -f 4 $OUT | wc -l
     samtools view -f 0x10 $OUT | cut -f2 | sort | uniq -c
    samtools flags 113 # idem for 117, 121, etc


<!-- ##### DO NOT edit below this line unless you know what you are doing. ##### -->
<!-- ##### DO NOT edit below this line unless you know what you are doing. ##### -->

Latest revision as of 01:02, 27 September 2017

Project name Main project page
Next entry

One-liners for high-throughput sequencing data

In this document, I'll focus on paired-end reads using, as an example, data from a GBS (RAD-seq) experiment in grapevine.

  • mapping softwares: see BWA, Bowtie, MOSAIK, etc; they (usually) take fastq files as input and return bam files as output
   IN1="reads_R1.fastq.gz"
   IN2="reads_R2.fastq.gz"
   OUT="alignments.bam"
  • number of input pairs: there are 2,307,772 pairs and 4,615,544 reads
   nbLines=$(zcat $IN1 | wc -l); echo "scale=0; "${nbLines}"/4" | bc -l # 2307772
   nbLines=$(zcat $IN2 | wc -l); echo "scale=0; "${nbLines}"/4" | bc -l # 2307772
  • understand one read pair in the Fastq format:
   zcat $IN1 | head -4
   zcat $IN2 | head -4
  • understand one alignment in the SAM format:
   samtools view $OUT | head -1 | tr "\t" "\n" | nl -n ln
  • global alignment statistics:
   samtools flagstat $OUT

which returns something like:

   line  1:  4635834 + 0 in total (QC-passed reads + QC-failed reads)
   line  2:  20290 + 0 secondary
   line  3:  0 + 0 supplimentary
   line  4:  0 + 0 duplicates
   line  5:  4443270 + 0 mapped (95.85%:-nan%)
   line  6:  4615544 + 0 paired in sequencing
   line  7:  2307772 + 0 read1
   line  8:  2307772 + 0 read2
   line  9:  4299122 + 0 properly paired (93.14%:-nan%)
   line 10:  4412810 + 0 with itself and mate mapped
   line 11:  10170 + 0 singletons (0.22%:-nan%)
   line 12:  57898 + 0 with mate mapped to a different chr
   line 13:  44330 + 0 with mate mapped to a different chr (mapQ>=5)
  • list occurring BAM flags: along with their number of occurrences
   samtools view $OUT | cut -f2 | sort | uniq -c | sort -k1,1nr -k2,2n
   # | awk '{sum+=$1}END{print sum}' # <- same as line 1

list of bits used to form the flags

this website is useful to interpret flags

for instance, if there are N entries with flag 99, there should also be N entries with flag 147; idem for 83 and 163; idem for 77 and 141; etc

  • entries in the BAM file: same as line 1; I use "entries" instead of "alignments" because the BAM file can contain unmapped reads (see below)
   samtools view -c $OUT # 4635834
  • mapped entries: same as line 5
   samtools view -F 0x0004 -c $OUT # 4443270
   echo "scale=3; 100 * 4443270 / 4635834" | bc -l # 95.846%
  • unmapped entries: same as line 1 - line 5
   samtools view -f 0x0004 -c $OUT # 192564
  • pairs in the BAM file: should be equal to the nb of input pairs
   samtools view $OUT | cut -f1 | sort | uniq | wc -l # 2307772
  • not-primary/supplementary entries: see details
   samtools view -f 0x0100 -c $OUT # 20290 <- same as line 2
   samtools view -f 0x0800 -c $OUT # 0 <- same as line 3
  • R1 and R2 entries:
   nbE1=$(samtools view -f 0x0040 -c $OUT) # 2316858
   nbE2=$(samtools view -f 0x0080 -c $OUT) # 2318976
   echo "$nbE1 + $nbE2" | bc -l # 4635834 <- same as line 1
  • R1 and R2 reads:
   samtools view -f 0x0040 $OUT | cut -f1 | sort | uniq -c > tmp1
   wc -l < tmp1 # 2307772 <- same as line 7
   samtools view -f 0x0080 $OUT | cut -f1 | sort | uniq -c > tmp2
   wc -l < tmp2 # 2307772 <- same as line 8
   echo "$(wc -l < tmp1) + $(wc -l < tmp2)" | bc -l # 4615544 <- same as line 6
   cat tmp1 | awk '{print $1}' | sort | uniq -c # some input reads can have multiple entries
   cat tmp2 | awk '{print $1}' | sort | uniq -c # idem
  • entries properly paired in sequencing: irrespective of mapping
   samtools view -f 0x0001 -c $OUT # 4635834 <- same as line 1 if all input reads are paired
   samtools view -f 0x0001 -F 0x0100 -c $OUT # 4615544 <- same as line 6
  • entries properly paired in mapping: same as line 9
   samtools view -f 0x0002 -F 0x0100 -c $OUT # 4299122
   echo "scale=3; 100 * 4299122 / (4635834 - 20290)" | bc -l # 93.144%
  • pairs, properly paired in mapping, in primary alignment: to be compared to the number of input pairs (see above)
   samtools view -f 0x0002 -F 0x0100 $OUT | cut -f1 | sort | uniq | wc -l # 2149561
   echo "scale=3; 100 * 2149561 / 2307772" | bc -l # 93.144
  • singletons: same as line 11
   samtools view -f 0x0008 -F 0x0004 -F 0x0100 -c $OUT # 10170
  • different chromosomes: same as lines 12 and 13
   samtools view -F 0x0004 -F 0x0008 -F 0x0100 $OUT | cut -f7 | sort | uniq -c # to list all possible values the 7-th field is taking
   samtools view -F 0x0004 -F 0x0008 -F 0x0100 $OUT | awk '($7 != "=")' | wc -l # 57898
   samtools view -q 5 -F 0x0004 -F 0x0008 -F 0x0100 $OUT | awk '($7 != "=")' | wc -l # 44330
  • strand: get all alignments on the reverse strand (use -F 0x10 for the forward)
   samtools view -f 0x10 $OUT | cut -f2 | sort | uniq -c
   samtools flags 113 # idem for 117, 121, etc