User:Timothee Flutre/Notebook/Postdoc/2013/12/01: Difference between revisions
From OpenWetWare
(→One-liners for high-throughput sequencing data: finish prop. paired in seq) |
(fix raw html notebook nav) |
||
(7 intermediate revisions by one other user not shown) | |||
Line 2: | Line 2: | ||
|- | |- | ||
|style="background-color: #EEE"|[[Image:owwnotebook_icon.png|128px]]<span style="font-size:22px;"> Project name</span> | |style="background-color: #EEE"|[[Image:owwnotebook_icon.png|128px]]<span style="font-size:22px;"> Project name</span> | ||
|style="background-color: #F2F2F2" align="center"| | |style="background-color: #F2F2F2" align="center"|[[File:Report.png|frameless|link={{#sub:{{FULLPAGENAME}}|0|-11}}]][[{{#sub:{{FULLPAGENAME}}|0|-11}}|Main project page]]<br />{{#if:{{#lnpreventry:{{FULLPAGENAME}}}}|[[File:Resultset_previous.png|frameless|link={{#lnpreventry:{{FULLPAGENAME}}}}]][[{{#lnpreventry:{{FULLPAGENAME}}}}{{!}}Previous entry]] }}{{#if:{{#lnnextentry:{{FULLPAGENAME}}}}|[[{{#lnnextentry:{{FULLPAGENAME}}}}{{!}}Next entry]][[File:Resultset_next.png|frameless|link={{#lnnextentry:{{FULLPAGENAME}}}}]]}} | ||
|- | |- | ||
| colspan="2"| | | colspan="2"| | ||
Line 8: | Line 8: | ||
==One-liners for high-throughput sequencing data== | ==One-liners for high-throughput sequencing data== | ||
* '''softwares''': see BWA, Bowtie, MOSAIK, etc; they take fastq files as input and return bam files as output | In this document, I'll focus on paired-end reads using, as an example, data from a GBS (RAD-seq) experiment in grapevine. | ||
* '''mapping softwares''': see BWA, Bowtie, MOSAIK, etc; they (usually) take fastq files as input and return bam files as output | |||
IN1="reads_R1.fastq.gz" | IN1="reads_R1.fastq.gz" | ||
Line 14: | Line 16: | ||
OUT="alignments.bam" | OUT="alignments.bam" | ||
* ''' | * '''number of input pairs''': there are 2,307,772 pairs and 4,615,544 reads | ||
nbLines=$(zcat $IN1 | wc -l); echo "scale=0; "${nbLines}"/4" | bc -l # 2307772 | |||
nbLines=$(zcat $IN2 | wc -l); echo "scale=0; "${nbLines}"/4" | bc -l # 2307772 | |||
* '''understand one read pair in the Fastq format''': | |||
zcat $IN1 | head -4 | |||
zcat $IN2 | head -4 | |||
* ''' | * '''understand one alignment in the SAM format''': | ||
samtools view $OUT | | samtools view $OUT | head -1 | tr "\t" "\n" | nl -n ln | ||
* ''' | * '''global alignment statistics''': | ||
samtools flagstat $OUT | samtools flagstat $OUT | ||
Line 42: | Line 50: | ||
line 13: 44330 + 0 with mate mapped to a different chr (mapQ>=5) | line 13: 44330 + 0 with mate mapped to a different chr (mapQ>=5) | ||
* '''list | * '''list occurring BAM flags''': along with their number of occurrences | ||
samtools view $OUT | cut -f2 | sort | uniq -c | sort -k1,1nr -k2,2n | samtools view $OUT | cut -f2 | sort | uniq -c | sort -k1,1nr -k2,2n | ||
Line 53: | Line 61: | ||
for instance, if there are N entries with flag 99, there should also be N entries with flag 147; idem for 83 and 163; idem for 77 and 141; etc | for instance, if there are N entries with flag 99, there should also be N entries with flag 147; idem for 83 and 163; idem for 77 and 141; etc | ||
* ''' | * '''entries in the BAM file''': same as line 1; I use "entries" instead of "alignments" because the BAM file can contain unmapped reads (see below) | ||
samtools view -c $OUT # 4635834 | |||
* '''mapped entries''': same as line 5 | |||
* | samtools view -F 0x0004 -c $OUT # 4443270 | ||
echo "scale=3; 100 * 4443270 / 4635834" | bc -l # 95.846% | |||
* '''unmapped entries''': same as line 1 - line 5 | |||
samtools view -f 0x0004 -c $OUT # 192564 | |||
samtools view | * '''pairs in the BAM file''': should be equal to the nb of input pairs | ||
samtools view $OUT | cut -f1 | sort | uniq | wc -l # 2307772 | |||
* '''not-primary/supplementary entries''': see [http://seqanswers.com/forums/showthread.php?t=40239 details] | * '''not-primary/supplementary entries''': see [http://seqanswers.com/forums/showthread.php?t=40239 details] | ||
samtools view -f 0x0100 -c $OUT # same as line 2 | samtools view -f 0x0100 -c $OUT # 20290 <- same as line 2 | ||
samtools view -f 0x0800 -c $OUT | samtools view -f 0x0800 -c $OUT # 0 <- same as line 3 | ||
* '''R1 and R2 entries''': | * '''R1 and R2 entries''': | ||
nbE1=$(samtools view -f 0x0040 -c $OUT) | nbE1=$(samtools view -f 0x0040 -c $OUT) # 2316858 | ||
nbE2=$(samtools view -f 0x0080 -c $OUT) | nbE2=$(samtools view -f 0x0080 -c $OUT) # 2318976 | ||
echo "$nbE1 + $nbE2" | bc -l # same as line 1 | echo "$nbE1 + $nbE2" | bc -l # 4635834 <- same as line 1 | ||
* '''R1 and R2 reads''': | * '''R1 and R2 reads''': | ||
samtools view -f 0x0040 $OUT | cut -f1 | sort | uniq -c > tmp1 | samtools view -f 0x0040 $OUT | cut -f1 | sort | uniq -c > tmp1 | ||
wc -l < tmp1 # 2307772 <- same as line 7 | |||
samtools view -f 0x0080 $OUT | cut -f1 | sort | uniq -c > tmp2 | samtools view -f 0x0080 $OUT | cut -f1 | sort | uniq -c > tmp2 | ||
wc -l < | wc -l < tmp2 # 2307772 <- same as line 8 | ||
wc -l < tmp2 # same as line | echo "$(wc -l < tmp1) + $(wc -l < tmp2)" | bc -l # 4615544 <- same as line 6 | ||
cat tmp1 | awk '{print $1}' | sort | uniq -c | cat tmp1 | awk '{print $1}' | sort | uniq -c # some input reads can have multiple entries | ||
cat tmp2 | awk '{print $1}' | sort | uniq -c | cat tmp2 | awk '{print $1}' | sort | uniq -c # idem | ||
* '''entries properly paired in sequencing''': irrespective of mapping | |||
samtools view -f 0x0001 -c $OUT # 4635834 <- same as line 1 if all input reads are paired | |||
samtools view -f 0x0001 -F 0x0100 -c $OUT # 4615544 <- same as line 6 | |||
* '''entries properly paired in mapping''': same as line 9 | |||
samtools view -f 0x0002 -F 0x0100 -c $OUT # 4299122 | |||
echo "scale=3; 100 * 4299122 / (4635834 - 20290)" | bc -l # 93.144% | |||
* '''pairs, properly paired in mapping, in primary alignment''': to be compared to the number of input pairs (see above) | |||
samtools view -f 0x0002 -F 0x0100 $OUT | cut -f1 | sort | uniq | wc -l # 2149561 | |||
echo "scale=3; 100 * 2149561 / 2307772" | bc -l # 93.144 | |||
* '''singletons''': same as line 11 | |||
samtools view -f 0x0008 -F 0x0004 -F 0x0100 -c $OUT # 10170 | |||
* '''different chromosomes''': same as lines 12 and 13 | |||
samtools view -F 0x0004 -F 0x0008 -F 0x0100 $OUT | cut -f7 | sort | uniq -c # to list all possible values the 7-th field is taking | |||
samtools view -F 0x0004 -F 0x0008 -F 0x0100 $OUT | awk '($7 != "=")' | wc -l # 57898 | |||
samtools view -q 5 -F 0x0004 -F 0x0008 -F 0x0100 $OUT | awk '($7 != "=")' | wc -l # 44330 | |||
* ''' | * '''strand''': get all alignments on the reverse strand (use <code>-F 0x10</code> for the forward) | ||
samtools view -f | samtools view -f 0x10 $OUT | cut -f2 | sort | uniq -c | ||
samtools flags 113 # idem for 117, 121, etc | |||
<!-- ##### DO NOT edit below this line unless you know what you are doing. ##### --> | <!-- ##### DO NOT edit below this line unless you know what you are doing. ##### --> |
Latest revision as of 01:02, 27 September 2017
Project name | Main project page Next entry |
One-liners for high-throughput sequencing dataIn this document, I'll focus on paired-end reads using, as an example, data from a GBS (RAD-seq) experiment in grapevine.
IN1="reads_R1.fastq.gz" IN2="reads_R2.fastq.gz" OUT="alignments.bam"
nbLines=$(zcat $IN1 | wc -l); echo "scale=0; "${nbLines}"/4" | bc -l # 2307772 nbLines=$(zcat $IN2 | wc -l); echo "scale=0; "${nbLines}"/4" | bc -l # 2307772
zcat $IN1 | head -4 zcat $IN2 | head -4
samtools view $OUT | head -1 | tr "\t" "\n" | nl -n ln
samtools flagstat $OUT which returns something like: line 1: 4635834 + 0 in total (QC-passed reads + QC-failed reads) line 2: 20290 + 0 secondary line 3: 0 + 0 supplimentary line 4: 0 + 0 duplicates line 5: 4443270 + 0 mapped (95.85%:-nan%) line 6: 4615544 + 0 paired in sequencing line 7: 2307772 + 0 read1 line 8: 2307772 + 0 read2 line 9: 4299122 + 0 properly paired (93.14%:-nan%) line 10: 4412810 + 0 with itself and mate mapped line 11: 10170 + 0 singletons (0.22%:-nan%) line 12: 57898 + 0 with mate mapped to a different chr line 13: 44330 + 0 with mate mapped to a different chr (mapQ>=5)
samtools view $OUT | cut -f2 | sort | uniq -c | sort -k1,1nr -k2,2n # | awk '{sum+=$1}END{print sum}' # <- same as line 1 list of bits used to form the flags this website is useful to interpret flags for instance, if there are N entries with flag 99, there should also be N entries with flag 147; idem for 83 and 163; idem for 77 and 141; etc
samtools view -c $OUT # 4635834
samtools view -F 0x0004 -c $OUT # 4443270 echo "scale=3; 100 * 4443270 / 4635834" | bc -l # 95.846%
samtools view -f 0x0004 -c $OUT # 192564
samtools view $OUT | cut -f1 | sort | uniq | wc -l # 2307772
samtools view -f 0x0100 -c $OUT # 20290 <- same as line 2 samtools view -f 0x0800 -c $OUT # 0 <- same as line 3
nbE1=$(samtools view -f 0x0040 -c $OUT) # 2316858 nbE2=$(samtools view -f 0x0080 -c $OUT) # 2318976 echo "$nbE1 + $nbE2" | bc -l # 4635834 <- same as line 1
samtools view -f 0x0040 $OUT | cut -f1 | sort | uniq -c > tmp1 wc -l < tmp1 # 2307772 <- same as line 7 samtools view -f 0x0080 $OUT | cut -f1 | sort | uniq -c > tmp2 wc -l < tmp2 # 2307772 <- same as line 8 echo "$(wc -l < tmp1) + $(wc -l < tmp2)" | bc -l # 4615544 <- same as line 6 cat tmp1 | awk '{print $1}' | sort | uniq -c # some input reads can have multiple entries cat tmp2 | awk '{print $1}' | sort | uniq -c # idem
samtools view -f 0x0001 -c $OUT # 4635834 <- same as line 1 if all input reads are paired samtools view -f 0x0001 -F 0x0100 -c $OUT # 4615544 <- same as line 6
samtools view -f 0x0002 -F 0x0100 -c $OUT # 4299122 echo "scale=3; 100 * 4299122 / (4635834 - 20290)" | bc -l # 93.144%
samtools view -f 0x0002 -F 0x0100 $OUT | cut -f1 | sort | uniq | wc -l # 2149561 echo "scale=3; 100 * 2149561 / 2307772" | bc -l # 93.144
samtools view -f 0x0008 -F 0x0004 -F 0x0100 -c $OUT # 10170
samtools view -F 0x0004 -F 0x0008 -F 0x0100 $OUT | cut -f7 | sort | uniq -c # to list all possible values the 7-th field is taking samtools view -F 0x0004 -F 0x0008 -F 0x0100 $OUT | awk '($7 != "=")' | wc -l # 57898 samtools view -q 5 -F 0x0004 -F 0x0008 -F 0x0100 $OUT | awk '($7 != "=")' | wc -l # 44330
samtools view -f 0x10 $OUT | cut -f2 | sort | uniq -c samtools flags 113 # idem for 117, 121, etc |