User:Timothee Flutre/Notebook/Postdoc/2013/12/01: Difference between revisions
From OpenWetWare
(Autocreate 2013/12/01 Entry for User:Timothee_Flutre/Notebook/Postdoc) |
(fix raw html notebook nav) |
||
(19 intermediate revisions by one other user not shown) | |||
Line 2: | Line 2: | ||
|- | |- | ||
|style="background-color: #EEE"|[[Image:owwnotebook_icon.png|128px]]<span style="font-size:22px;"> Project name</span> | |style="background-color: #EEE"|[[Image:owwnotebook_icon.png|128px]]<span style="font-size:22px;"> Project name</span> | ||
|style="background-color: #F2F2F2" align="center"| | |style="background-color: #F2F2F2" align="center"|[[File:Report.png|frameless|link={{#sub:{{FULLPAGENAME}}|0|-11}}]][[{{#sub:{{FULLPAGENAME}}|0|-11}}|Main project page]]<br />{{#if:{{#lnpreventry:{{FULLPAGENAME}}}}|[[File:Resultset_previous.png|frameless|link={{#lnpreventry:{{FULLPAGENAME}}}}]][[{{#lnpreventry:{{FULLPAGENAME}}}}{{!}}Previous entry]] }}{{#if:{{#lnnextentry:{{FULLPAGENAME}}}}|[[{{#lnnextentry:{{FULLPAGENAME}}}}{{!}}Next entry]][[File:Resultset_next.png|frameless|link={{#lnnextentry:{{FULLPAGENAME}}}}]]}} | ||
|- | |- | ||
| colspan="2"| | | colspan="2"| | ||
<!-- ##### DO NOT edit above this line unless you know what you are doing. ##### --> | <!-- ##### DO NOT edit above this line unless you know what you are doing. ##### --> | ||
== | ==One-liners for high-throughput sequencing data== | ||
In this document, I'll focus on paired-end reads using, as an example, data from a GBS (RAD-seq) experiment in grapevine. | |||
* '''mapping softwares''': see BWA, Bowtie, MOSAIK, etc; they (usually) take fastq files as input and return bam files as output | |||
IN1="reads_R1.fastq.gz" | |||
IN2="reads_R2.fastq.gz" | |||
OUT="alignments.bam" | |||
* '''number of input pairs''': there are 2,307,772 pairs and 4,615,544 reads | |||
nbLines=$(zcat $IN1 | wc -l); echo "scale=0; "${nbLines}"/4" | bc -l # 2307772 | |||
nbLines=$(zcat $IN2 | wc -l); echo "scale=0; "${nbLines}"/4" | bc -l # 2307772 | |||
* '''understand one read pair in the Fastq format''': | |||
zcat $IN1 | head -4 | |||
zcat $IN2 | head -4 | |||
* '''understand one alignment in the SAM format''': | |||
samtools view $OUT | head -1 | tr "\t" "\n" | nl -n ln | |||
* '''global alignment statistics''': | |||
samtools flagstat $OUT | |||
which returns something like: | |||
line 1: 4635834 + 0 in total (QC-passed reads + QC-failed reads) | |||
line 2: 20290 + 0 secondary | |||
line 3: 0 + 0 supplimentary | |||
line 4: 0 + 0 duplicates | |||
line 5: 4443270 + 0 mapped (95.85%:-nan%) | |||
line 6: 4615544 + 0 paired in sequencing | |||
line 7: 2307772 + 0 read1 | |||
line 8: 2307772 + 0 read2 | |||
line 9: 4299122 + 0 properly paired (93.14%:-nan%) | |||
line 10: 4412810 + 0 with itself and mate mapped | |||
line 11: 10170 + 0 singletons (0.22%:-nan%) | |||
line 12: 57898 + 0 with mate mapped to a different chr | |||
line 13: 44330 + 0 with mate mapped to a different chr (mapQ>=5) | |||
* '''list occurring BAM flags''': along with their number of occurrences | |||
samtools view $OUT | cut -f2 | sort | uniq -c | sort -k1,1nr -k2,2n | |||
# | awk '{sum+=$1}END{print sum}' # <- same as line 1 | |||
[http://www.htslib.org/doc/sam.html list] of bits used to form the flags | |||
[http://broadinstitute.github.io/picard/explain-flags.html this] website is useful to interpret flags | |||
for instance, if there are N entries with flag 99, there should also be N entries with flag 147; idem for 83 and 163; idem for 77 and 141; etc | |||
* '''entries in the BAM file''': same as line 1; I use "entries" instead of "alignments" because the BAM file can contain unmapped reads (see below) | |||
samtools view -c $OUT # 4635834 | |||
* '''mapped entries''': same as line 5 | |||
samtools view -F 0x0004 -c $OUT # 4443270 | |||
echo "scale=3; 100 * 4443270 / 4635834" | bc -l # 95.846% | |||
* '''unmapped entries''': same as line 1 - line 5 | |||
samtools view -f 0x0004 -c $OUT # 192564 | |||
* '''pairs in the BAM file''': should be equal to the nb of input pairs | |||
samtools view $OUT | cut -f1 | sort | uniq | wc -l # 2307772 | |||
* '''not-primary/supplementary entries''': see [http://seqanswers.com/forums/showthread.php?t=40239 details] | |||
samtools view -f 0x0100 -c $OUT # 20290 <- same as line 2 | |||
samtools view -f 0x0800 -c $OUT # 0 <- same as line 3 | |||
* '''R1 and R2 entries''': | |||
nbE1=$(samtools view -f 0x0040 -c $OUT) # 2316858 | |||
nbE2=$(samtools view -f 0x0080 -c $OUT) # 2318976 | |||
echo "$nbE1 + $nbE2" | bc -l # 4635834 <- same as line 1 | |||
* '''R1 and R2 reads''': | |||
samtools view -f 0x0040 $OUT | cut -f1 | sort | uniq -c > tmp1 | |||
wc -l < tmp1 # 2307772 <- same as line 7 | |||
samtools view -f 0x0080 $OUT | cut -f1 | sort | uniq -c > tmp2 | |||
wc -l < tmp2 # 2307772 <- same as line 8 | |||
echo "$(wc -l < tmp1) + $(wc -l < tmp2)" | bc -l # 4615544 <- same as line 6 | |||
cat tmp1 | awk '{print $1}' | sort | uniq -c # some input reads can have multiple entries | |||
cat tmp2 | awk '{print $1}' | sort | uniq -c # idem | |||
* '''entries properly paired in sequencing''': irrespective of mapping | |||
samtools view -f 0x0001 -c $OUT # 4635834 <- same as line 1 if all input reads are paired | |||
samtools view -f 0x0001 -F 0x0100 -c $OUT # 4615544 <- same as line 6 | |||
* '''entries properly paired in mapping''': same as line 9 | |||
samtools view -f 0x0002 -F 0x0100 -c $OUT # 4299122 | |||
echo "scale=3; 100 * 4299122 / (4635834 - 20290)" | bc -l # 93.144% | |||
* '''pairs, properly paired in mapping, in primary alignment''': to be compared to the number of input pairs (see above) | |||
samtools view -f 0x0002 -F 0x0100 $OUT | cut -f1 | sort | uniq | wc -l # 2149561 | |||
echo "scale=3; 100 * 2149561 / 2307772" | bc -l # 93.144 | |||
* '''singletons''': same as line 11 | |||
samtools view -f 0x0008 -F 0x0004 -F 0x0100 -c $OUT # 10170 | |||
* '''different chromosomes''': same as lines 12 and 13 | |||
samtools view -F 0x0004 -F 0x0008 -F 0x0100 $OUT | cut -f7 | sort | uniq -c # to list all possible values the 7-th field is taking | |||
samtools view -F 0x0004 -F 0x0008 -F 0x0100 $OUT | awk '($7 != "=")' | wc -l # 57898 | |||
samtools view -q 5 -F 0x0004 -F 0x0008 -F 0x0100 $OUT | awk '($7 != "=")' | wc -l # 44330 | |||
* '''strand''': get all alignments on the reverse strand (use <code>-F 0x10</code> for the forward) | |||
samtools view -f 0x10 $OUT | cut -f2 | sort | uniq -c | |||
samtools flags 113 # idem for 117, 121, etc | |||
<!-- ##### DO NOT edit below this line unless you know what you are doing. ##### --> | <!-- ##### DO NOT edit below this line unless you know what you are doing. ##### --> |
Latest revision as of 01:02, 27 September 2017
Project name | Main project page Next entry |
One-liners for high-throughput sequencing dataIn this document, I'll focus on paired-end reads using, as an example, data from a GBS (RAD-seq) experiment in grapevine.
IN1="reads_R1.fastq.gz" IN2="reads_R2.fastq.gz" OUT="alignments.bam"
nbLines=$(zcat $IN1 | wc -l); echo "scale=0; "${nbLines}"/4" | bc -l # 2307772 nbLines=$(zcat $IN2 | wc -l); echo "scale=0; "${nbLines}"/4" | bc -l # 2307772
zcat $IN1 | head -4 zcat $IN2 | head -4
samtools view $OUT | head -1 | tr "\t" "\n" | nl -n ln
samtools flagstat $OUT which returns something like: line 1: 4635834 + 0 in total (QC-passed reads + QC-failed reads) line 2: 20290 + 0 secondary line 3: 0 + 0 supplimentary line 4: 0 + 0 duplicates line 5: 4443270 + 0 mapped (95.85%:-nan%) line 6: 4615544 + 0 paired in sequencing line 7: 2307772 + 0 read1 line 8: 2307772 + 0 read2 line 9: 4299122 + 0 properly paired (93.14%:-nan%) line 10: 4412810 + 0 with itself and mate mapped line 11: 10170 + 0 singletons (0.22%:-nan%) line 12: 57898 + 0 with mate mapped to a different chr line 13: 44330 + 0 with mate mapped to a different chr (mapQ>=5)
samtools view $OUT | cut -f2 | sort | uniq -c | sort -k1,1nr -k2,2n # | awk '{sum+=$1}END{print sum}' # <- same as line 1 list of bits used to form the flags this website is useful to interpret flags for instance, if there are N entries with flag 99, there should also be N entries with flag 147; idem for 83 and 163; idem for 77 and 141; etc
samtools view -c $OUT # 4635834
samtools view -F 0x0004 -c $OUT # 4443270 echo "scale=3; 100 * 4443270 / 4635834" | bc -l # 95.846%
samtools view -f 0x0004 -c $OUT # 192564
samtools view $OUT | cut -f1 | sort | uniq | wc -l # 2307772
samtools view -f 0x0100 -c $OUT # 20290 <- same as line 2 samtools view -f 0x0800 -c $OUT # 0 <- same as line 3
nbE1=$(samtools view -f 0x0040 -c $OUT) # 2316858 nbE2=$(samtools view -f 0x0080 -c $OUT) # 2318976 echo "$nbE1 + $nbE2" | bc -l # 4635834 <- same as line 1
samtools view -f 0x0040 $OUT | cut -f1 | sort | uniq -c > tmp1 wc -l < tmp1 # 2307772 <- same as line 7 samtools view -f 0x0080 $OUT | cut -f1 | sort | uniq -c > tmp2 wc -l < tmp2 # 2307772 <- same as line 8 echo "$(wc -l < tmp1) + $(wc -l < tmp2)" | bc -l # 4615544 <- same as line 6 cat tmp1 | awk '{print $1}' | sort | uniq -c # some input reads can have multiple entries cat tmp2 | awk '{print $1}' | sort | uniq -c # idem
samtools view -f 0x0001 -c $OUT # 4635834 <- same as line 1 if all input reads are paired samtools view -f 0x0001 -F 0x0100 -c $OUT # 4615544 <- same as line 6
samtools view -f 0x0002 -F 0x0100 -c $OUT # 4299122 echo "scale=3; 100 * 4299122 / (4635834 - 20290)" | bc -l # 93.144%
samtools view -f 0x0002 -F 0x0100 $OUT | cut -f1 | sort | uniq | wc -l # 2149561 echo "scale=3; 100 * 2149561 / 2307772" | bc -l # 93.144
samtools view -f 0x0008 -F 0x0004 -F 0x0100 -c $OUT # 10170
samtools view -F 0x0004 -F 0x0008 -F 0x0100 $OUT | cut -f7 | sort | uniq -c # to list all possible values the 7-th field is taking samtools view -F 0x0004 -F 0x0008 -F 0x0100 $OUT | awk '($7 != "=")' | wc -l # 57898 samtools view -q 5 -F 0x0004 -F 0x0008 -F 0x0100 $OUT | awk '($7 != "=")' | wc -l # 44330
samtools view -f 0x10 $OUT | cut -f2 | sort | uniq -c samtools flags 113 # idem for 117, 121, etc |