User:Timothee Flutre/Notebook/Postdoc/2013/12/01: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
Line 108: Line 108:
     samtools view -f 0x0008 -F 0x0004 -F 0x0100 -c $OUT
     samtools view -f 0x0008 -F 0x0004 -F 0x0100 -c $OUT


* '''about strand''': get all alignments on the reverse strand (use <code>-F 0x10</code> for the forward)
* '''different chromosomes''': same as lines 12 and 13
 
    samtools view -F 0x0004 -F 0x0008 -F 0x0100 $OUT | cut -f7 | sort | uniq -c # to list all possible values the 7-th field is taking
    samtools view -F 0x0004 -F 0x0008 -F 0x0100 $OUT | awk '($7 != "=")' | wc -l
    samtools view -q 5 -F 0x0004 -F 0x0008 -F 0x0100 $OUT | awk '($7 != "=")' | wc -l
 
* '''strand''': get all alignments on the reverse strand (use <code>-F 0x10</code> for the forward)


     samtools view -f 0x10 $OUT | cut -f2 | sort | uniq -c
     samtools view -f 0x10 $OUT | cut -f2 | sort | uniq -c

Revision as of 03:26, 14 July 2015

Project name <html><img src="/images/9/94/Report.png" border="0" /></html> Main project page
Next entry<html><img src="/images/5/5c/Resultset_next.png" border="0" /></html>

One-liners for high-throughput sequencing data

  • softwares: see BWA, Bowtie, MOSAIK, etc; they take fastq files as input and return bam files as output
   IN1="reads_R1.fastq.gz"
   IN2="reads_R2.fastq.gz"
   OUT="alignments.bam"
  • total number of reads in a fastq file:
   nbLines=$(zcat $IN1 | wc -l); echo "scale=0; "${nbLines}"/4" | bc -l
  • understand one alignment in the SAM format:
   samtools view $OUT | head -1 | tr "\t" "\n" | nl -n ln
  • flag statistics in SAM/BAM:
   samtools flagstat $OUT

which returns something like:

   line  1:  4635834 + 0 in total (QC-passed reads + QC-failed reads)
   line  2:  20290 + 0 secondary
   line  3:  0 + 0 supplimentary
   line  4:  0 + 0 duplicates
   line  5:  4443270 + 0 mapped (95.85%:-nan%)
   line  6:  4615544 + 0 paired in sequencing
   line  7:  2307772 + 0 read1
   line  8:  2307772 + 0 read2
   line  9:  4299122 + 0 properly paired (93.14%:-nan%)
   line 10:  4412810 + 0 with itself and mate mapped
   line 11:  10170 + 0 singletons (0.22%:-nan%)
   line 12:  57898 + 0 with mate mapped to a different chr
   line 13:  44330 + 0 with mate mapped to a different chr (mapQ>=5)
  • list of different flags in the bam file: along with their number of occurrences
   samtools view $OUT | cut -f2 | sort | uniq -c | sort -k1,1nr -k2,2n
   # | awk '{sum+=$1}END{print sum}' # <- same as line 1

list of bits used to form the flags

this website is useful to interpret flags

for instance, if there are N entries with flag 99, there should also be N entries with flag 147; idem for 83 and 163; idem for 77 and 141; etc

  • total number of entries in the bam file: same as line 1
   samtools view -c $OUT
  • total number of mapped entries: same as line 5
   samtools view -F 0x0004 -c $OUT
  • total number of unmapped entries: same as line 1 - line 5
   samtools view -f 0x0004 -c $OUT
  • total number of reads in the bam file: should be equal to the nb of reads in the fastq file if no filtering was made
   samtools view $OUT | cut -f1 | sort | uniq | wc -l
  • not-primary/supplementary entries: see details
   samtools view -f 0x0100 -c $OUT # same as line 2
   samtools view -f 0x0800 -c $OUT 
  • R1 and R2 entries:
   nbE1=$(samtools view -f 0x0040 -c $OUT)
   nbE2=$(samtools view -f 0x0080 -c $OUT)
   echo "$nbE1 + $nbE2" | bc -l # same as line 1
  • R1 and R2 reads:
   samtools view -f 0x0040 $OUT | cut -f1 | sort | uniq -c > tmp1
   samtools view -f 0x0080 $OUT | cut -f1 | sort | uniq -c > tmp2
   wc -l < tmp1 # same as line 7
   wc -l < tmp2 # same as line 8
   cat tmp1 | awk '{print $1}' | sort | uniq -c
   cat tmp2 | awk '{print $1}' | sort | uniq -c
  • entries which read is properly paired in sequencing: irrespective of mapping
   samtools view -f 0x0001 -c $OUT # same as line 1 if both input fastq files have the same number of reads
   samtools view -f 0x0001 -F 0x0100 -c $OUT # same as line 6

Note that this is different from line 6 which is equal to line 1 - line 2.

  • entries which read is properly paired in mapping: same as line 9
   samtools view -f 0x0002 -F 0x0100 -c $OUT
  • total number of read pairs, properly paired in mapping, in primary alignment: to be compared to the total number of input read pairs
   samtools view -f 0x0002 -F 0x0100 $OUT | cut -f1 | sort | uniq | wc -l
  • singletons: same as line 11
   samtools view -f 0x0008 -F 0x0004 -F 0x0100 -c $OUT
  • different chromosomes: same as lines 12 and 13
   samtools view -F 0x0004 -F 0x0008 -F 0x0100 $OUT | cut -f7 | sort | uniq -c # to list all possible values the 7-th field is taking
   samtools view -F 0x0004 -F 0x0008 -F 0x0100 $OUT | awk '($7 != "=")' | wc -l
   samtools view -q 5 -F 0x0004 -F 0x0008 -F 0x0100 $OUT | awk '($7 != "=")' | wc -l
  • strand: get all alignments on the reverse strand (use -F 0x10 for the forward)
   samtools view -f 0x10 $OUT | cut -f2 | sort | uniq -c
   samtools flags 113 # idem for 117, 121, etc