User:Timothee Flutre/Notebook/Postdoc/2012/05/25: Difference between revisions
From OpenWetWare
(→About one-liners in data wrangling: add get length fasta) |
|||
(9 intermediate revisions by the same user not shown) | |||
Line 15: | Line 15: | ||
** [https://en.wikipedia.org/wiki/Grep grep] | ** [https://en.wikipedia.org/wiki/Grep grep] | ||
** [https://en.wikipedia.org/wiki/Sed sed] | ** [https://en.wikipedia.org/wiki/Sed sed] | ||
** [https://en.wikipedia.org/wiki/GNU_Core_Utilities GNU coreutils] (head, tail, cut, uniq, sort, tr, ...) | ** [https://en.wikipedia.org/wiki/GNU_Core_Utilities GNU coreutils] (head, tail, cut, uniq, sort, tr, od, ...) | ||
* '''Tutorials''': | * '''Tutorials''': | ||
** [http://en.flossmanuals.net/command-line/index/ Introduction to the command-line] | ** [http://en.flossmanuals.net/command-line/index/ Introduction to the command-line] | ||
** [http://www.oliverelliott.org/article/computing/tut_unix/ Introduction to Unix] by Oliver Elliott | |||
** [http://www.ibm.com/developerworks/aix/library/au-unixtext/index.html Introduction to text manipulation on UNIX-based systems] by Brad Yoes (IBM) | ** [http://www.ibm.com/developerworks/aix/library/au-unixtext/index.html Introduction to text manipulation on UNIX-based systems] by Brad Yoes (IBM) | ||
** [https://github.com/jlevy/the-art-of-command-line The Art of the Command Line] by Joshua Levy | |||
** [http://www.tldp.org/LDP/abs/html/ Advanced Bash-Scripting Guide] by Mendel Cooper | ** [http://www.tldp.org/LDP/abs/html/ Advanced Bash-Scripting Guide] by Mendel Cooper | ||
** [http://www.shellcheck.net/ ShellCheck] ([http://explainshell.com/ explainshell]) | |||
** [http://quinlanlab.org/tutorials/cshl2013/bedtools.html tutorial for bedtools] | ** [http://quinlanlab.org/tutorials/cshl2013/bedtools.html tutorial for bedtools] | ||
** [http://www.commentcamarche.net/faq/8386-kit-de-survie-linux kit de survie Linux] (en français) | ** [http://www.commentcamarche.net/faq/8386-kit-de-survie-linux kit de survie Linux] (en français) | ||
Line 35: | Line 38: | ||
* '''Use absolute values:''' | * '''Use absolute values:''' | ||
$ for i in {-5..5}; do echo $i; done | awk 'function abs(x){return (((x < 0.0) ? -x : x) + 0.0)} {print abs($1)}' | $ for i in {-5..5}; do echo $i; done | awk 'function abs(x){return (((x < 0.0) ? -x : x) + 0.0)} {print abs($1)}' | ||
* '''Summarize numbers:''' with R | |||
$ for i in {1..10}; do echo $i; done | Rscript -e 'summary(read.table("stdin"))' | |||
Line 62: | Line 69: | ||
for(i=1;i<=length(a[2]);i++)printf "}"; printf"\n"}' probes.fa > probes.fq | for(i=1;i<=length(a[2]);i++)printf "}"; printf"\n"}' probes.fa > probes.fq | ||
</nowiki> | </nowiki> | ||
* '''Extract sequence from fasta''': | |||
$ echo -e ">chr1\nAAA\n>chr2\nTTT\n>chr3\nGGG\n" | awk 'BEGIN{RS=">"} /chr2/ {print $0}' | |||
Line 80: | Line 92: | ||
$ awk 'BEGIN{RS=">"} {split($0,a,"\n"); if(length(a)==0) next; seqlen=0; for(i=2;i<=length(a);++i){seqlen += length(a[i])}; printf a[1]"\t"seqlen"\n"}' sequences.fa | $ awk 'BEGIN{RS=">"} {split($0,a,"\n"); if(length(a)==0) next; seqlen=0; for(i=2;i<=length(a);++i){seqlen += length(a[i])}; printf a[1]"\t"seqlen"\n"}' sequences.fa | ||
* '''Get the bases 6 to 9 of each sequence in a fastq file''': provided that each read only uses 4 lines | |||
$ zcat reads.fq.gz | awk '(NR % 4 == 2)' | cut -c 6-9 | |||
* '''Reverse-complement a DNA sequence''': | |||
$ echo "AAATGAGCC" | rev | tr ATGC TACG | |||
* '''Identify a non-breaking space''': | |||
** download additional file 4 (Table S3) of [http://dx.doi.org/10.1186/s12870-016-0754-z this] article | |||
** open it with LibreOffice Calc | |||
** save it as "Text CSV" with "Character set = Unicode (UTF-8)", "Field delimiter = {Tab}", "Text delimiter = " (i.e. empty), and keep "Save cell content as shown" as checked | |||
** play with the following commands (and see the [http://ascii-code.com/ ASCII code]): | |||
$ cat 12870_2016_754_MOESM4_ESM.csv | sed -n 1123p | cut -f2 | od -An -c -b | |||
L i s z t e s 302 240 f e h e r \n | |||
114 151 163 172 164 145 163 302 240 146 145 150 145 162 012 | |||
$ cat 12870_2016_754_MOESM4_ESM.csv | sed -n 1123p | cut -f2 | sed 's/\xC2\xA0/ /g' | od -An -c -b | |||
L i s z t e s f e h e r \n | |||
114 151 163 172 164 145 163 040 146 145 150 145 162 012 | |||
<!-- ##### DO NOT edit below this line unless you know what you are doing. ##### --> | <!-- ##### DO NOT edit below this line unless you know what you are doing. ##### --> |
Revision as of 05:42, 27 June 2016
Project name | <html><img src="/images/9/94/Report.png" border="0" /></html> Main project page <html><img src="/images/c/c3/Resultset_previous.png" border="0" /></html>Previous entry<html> </html>Next entry<html><img src="/images/5/5c/Resultset_next.png" border="0" /></html> |
About one-liners in data wrangling
for i in {1..10}; do echo $i; done | sed 3,6d
$ for i in {1..20}; do echo $i; done | sed -n 3,5p
$ for i in {-5..5}; do echo $i; done | awk 'function abs(x){return (((x < 0.0) ? -x : x) + 0.0)} {print abs($1)}'
$ for i in {1..10}; do echo $i; done | Rscript -e 'summary(read.table("stdin"))'
$ echo -e "gene\tsnp\tpvalue\ng1\ts1\t0.3\ng1\ts2\t0.002\ng2\ts2\t0.7\ng2\ts3\t0.05" > dat.txt gene snp pvalue g1 s1 0.3 g1 s2 0.002 g2 s2 0.7 g2 s3 0.05 $ cat dat.txt | sed 1d | sort -k1,1 -k3,3 | awk '{print $3"\t"$2"\t"$1}' | uniq -f2 g1 s2 0.002 g2 s3 0.05
$ subgroups=("s1" "s2" "s3" "s4"); for i in {0..2}; do let a=$i+1; for j in $(seq $a 3); do s1=${subgroups[$i]}; s2=${subgroups[$j]}; echo $s1 $s2; done; done
$ awk 'BEGIN{RS=">"} {if(NF==0)next; split($0,a,"\n"); printf "@"a[1]"\n"a[2]"\n+\n"; \ for(i=1;i<=length(a[2]);i++)printf "}"; printf"\n"}' probes.fa > probes.fq
$ echo -e ">chr1\nAAA\n>chr2\nTTT\n>chr3\nGGG\n" | awk 'BEGIN{RS=">"} /chr2/ {print $0}'
$ echo -e "x\ty"; for i in {1..10}; do echo -e $i"\t"$RANDOM; done | (read -r; printf "%s\n" "$REPLY"; sort -k2,2n)
$ echo -e "gene\tsnp\tpvalue\ngene1\tsnp1\t0.002\ngene2\tsnp2\t0.8\ngene2\tsnp3\t0.1" > file_all.txt $ echo -e "gene1\tsnp1" > file_subset.txt $ awk 'NR==FNR{a[$1$2]++;next;}{x=$1$2;if(x in a)print $0}' file_subset.txt <(sed 1d file_all.txt)
$ awk 'BEGIN{RS=">"} {split($0,a,"\n"); if(length(a)==0) next; seqlen=0; for(i=2;i<=length(a);++i){seqlen += length(a[i])}; printf a[1]"\t"seqlen"\n"}' sequences.fa
$ zcat reads.fq.gz | awk '(NR % 4 == 2)' | cut -c 6-9
$ echo "AAATGAGCC" | rev | tr ATGC TACG
$ cat 12870_2016_754_MOESM4_ESM.csv | sed -n 1123p | cut -f2 | od -An -c -b L i s z t e s 302 240 f e h e r \n 114 151 163 172 164 145 163 302 240 146 145 150 145 162 012 $ cat 12870_2016_754_MOESM4_ESM.csv | sed -n 1123p | cut -f2 | sed 's/\xC2\xA0/ /g' | od -An -c -b L i s z t e s f e h e r \n 114 151 163 172 164 145 163 040 146 145 150 145 162 012 |