User:Timothee Flutre/Notebook/Postdoc/2012/05/25

From OpenWetWare

(Difference between revisions)
Jump to: navigation, search
(One-liners with GNU tools: add tutorial + list of other tools)
(About one-liners in data wrangling: ajout kit survie en français)
(8 intermediate revisions not shown.)
Line 6: Line 6:
| colspan="2"|
| colspan="2"|
<!-- ##### DO NOT edit above this line unless you know what you are doing. ##### -->
<!-- ##### DO NOT edit above this line unless you know what you are doing. ##### -->
-
==One-liners with GNU tools==
+
==About one-liners in data wrangling==
-
* '''Tutorial''': [http://www.ibm.com/developerworks/aix/library/au-unixtext/index.html Introduction to text manipulation on UNIX-based systems] by Brad Yoes (IBM)
+
* '''Motivation''': Once we receive raw data and before drawing robust conclusions, we (almost) always need to reformat them as well as extract a few key summary statistics. Hopefully this activity, called [https://en.wikipedia.org/wiki/Data_wrangling data wrangling], is particularly quick and easy on [http://www.gnu.org/gnu/gnu-linux-faq.html GNU/Linux] computers. For instance, using GNU utilities via the [https://en.wikipedia.org/wiki/Command-line_interface command-line interface], we can write a "one-liner", a sequence of tools in which the output of a tool is the input of the next. This is not only easy but also very powerful, as shown below.
 +
* '''Toolbox''': often available by default on many computers with GNU/Linux
 +
** [https://en.wikipedia.org/wiki/Bash_%28Unix_shell%29 Bash]
 +
** [https://en.wikipedia.org/wiki/AWK AWK]
 +
** [https://en.wikipedia.org/wiki/Grep grep]
 +
** [https://en.wikipedia.org/wiki/Sed sed]
 +
** [https://en.wikipedia.org/wiki/GNU_Core_Utilities GNU coreutils] (head, tail, cut, uniq, sort, tr, ...)
-
* '''Toolbox''':
+
* '''Tutorials''':
-
** [http://en.wikipedia.org/wiki/AWK AWK]
+
** [http://en.flossmanuals.net/command-line/index/ Introduction to the command-line]
-
** grep
+
** [http://www.ibm.com/developerworks/aix/library/au-unixtext/index.html Introduction to text manipulation on UNIX-based systems] by Brad Yoes (IBM)
-
** sed
+
** [http://www.tldp.org/LDP/abs/html/ Advanced Bash-Scripting Guide] by Mendel Cooper
-
** cut
+
** [http://quinlanlab.org/tutorials/cshl2013/bedtools.html tutorial for bedtools]
-
** tr
+
** [http://www.commentcamarche.net/faq/8386-kit-de-survie-linux kit de survie Linux] (en français)
-
** wc
+
 
 +
 
 +
* '''Skip a subset of successive lines''':
 +
for i in {1..10}; do echo $i; done | sed 3,6d
 +
 
 +
 
 +
* '''Extract a subset of successive lines''':
 +
$ for i in {1..20}; do echo $i; done | sed -n 3,5p
* '''Use absolute values:'''
* '''Use absolute values:'''
 +
$ for i in {-5..5}; do echo $i; done | awk 'function abs(x){return (((x < 0.0) ? -x : x) + 0.0)} {print abs($1)}'
-
awk 'function abs(x){return (((x < 0.0) ? -x : x) + 0.0)} {print abs($1)}' input.txt
 
 +
* '''Extract the best snp per gene''':
 +
 +
$ echo -e "gene\tsnp\tpvalue\ng1\ts1\t0.3\ng1\ts2\t0.002\ng2\ts2\t0.7\ng2\ts3\t0.05" > dat.txt
 +
gene    snp    pvalue
 +
g1      s1      0.3
 +
g1      s2      0.002
 +
g2      s2      0.7
 +
g2      s3      0.05
 +
 +
$ cat dat.txt | sed 1d | sort -k1,1 -k3,3 | awk '{print $3"\t"$2"\t"$1}' | uniq -f2
 +
g1      s2      0.002
 +
g2      s3      0.05
 +
 +
 +
* '''Loop over pairs''':
 +
 +
$ subgroups=("s1" "s2" "s3" "s4"); for i in {0..2}; do let a=$i+1; for j in $(seq $a 3); do s1=${subgroups[$i]}; s2=${subgroups[$j]}; echo $s1 $s2; done; done
Line 29: Line 59:
  <nowiki>
  <nowiki>
-
awk 'BEGIN{RS=">"} {if(NF==0)next; split($0,a,"\n"); printf "@"a[1]"\n"a[2]"\n+\n"; \
+
$ awk 'BEGIN{RS=">"} {if(NF==0)next; split($0,a,"\n"); printf "@"a[1]"\n"a[2]"\n+\n"; \
for(i=1;i<=length(a[2]);i++)printf "}"; printf"\n"}' probes.fa > probes.fq
for(i=1;i<=length(a[2]);i++)printf "}"; printf"\n"}' probes.fa > probes.fq
</nowiki>
</nowiki>
 +
 +
 +
* '''Sort a file with header line''': that is, we don't want the first line to be sorted
 +
 +
$ echo -e "x\ty"; for i in {1..10}; do echo -e $i"\t"$RANDOM; done | (read -r; printf "%s\n" "$REPLY"; sort -k2,2n)
 +
 +
 +
* '''Get rows from a big file which are also in a small file''': example of using awk with 2 input files by loading the important information from the small file into an array in memory, then parsing the big file line by line and comparing each with the content of the array
 +
 +
$ echo -e "gene\tsnp\tpvalue\ngene1\tsnp1\t0.002\ngene2\tsnp2\t0.8\ngene2\tsnp3\t0.1" > file_all.txt
 +
$ echo -e "gene1\tsnp1" > file_subset.txt
 +
$ awk 'NR==FNR{a[$1$2]++;next;}{x=$1$2;if(x in a)print $0}' file_subset.txt <(sed 1d file_all.txt)
 +
<!-- ##### DO NOT edit below this line unless you know what you are doing. ##### -->
<!-- ##### DO NOT edit below this line unless you know what you are doing. ##### -->

Revision as of 04:20, 7 January 2014

Project name Main project page
Previous entry      Next entry

About one-liners in data wrangling

  • Motivation: Once we receive raw data and before drawing robust conclusions, we (almost) always need to reformat them as well as extract a few key summary statistics. Hopefully this activity, called data wrangling, is particularly quick and easy on GNU/Linux computers. For instance, using GNU utilities via the command-line interface, we can write a "one-liner", a sequence of tools in which the output of a tool is the input of the next. This is not only easy but also very powerful, as shown below.
  • Toolbox: often available by default on many computers with GNU/Linux


  • Skip a subset of successive lines:
for i in {1..10}; do echo $i; done | sed 3,6d


  • Extract a subset of successive lines:
$ for i in {1..20}; do echo $i; done | sed -n 3,5p


  • Use absolute values:
$ for i in {-5..5}; do echo $i; done | awk 'function abs(x){return (((x < 0.0) ? -x : x) + 0.0)} {print abs($1)}'


  • Extract the best snp per gene:
$ echo -e "gene\tsnp\tpvalue\ng1\ts1\t0.3\ng1\ts2\t0.002\ng2\ts2\t0.7\ng2\ts3\t0.05" > dat.txt
gene    snp     pvalue
g1      s1      0.3
g1      s2      0.002
g2      s2      0.7
g2      s3      0.05
$ cat dat.txt | sed 1d | sort -k1,1 -k3,3 | awk '{print $3"\t"$2"\t"$1}' | uniq -f2
g1      s2      0.002
g2      s3      0.05


  • Loop over pairs:
$ subgroups=("s1" "s2" "s3" "s4"); for i in {0..2}; do let a=$i+1; for j in $(seq $a 3); do s1=${subgroups[$i]}; s2=${subgroups[$j]}; echo $s1 $s2; done; done


$ awk 'BEGIN{RS=">"} {if(NF==0)next; split($0,a,"\n"); printf "@"a[1]"\n"a[2]"\n+\n"; \
for(i=1;i<=length(a[2]);i++)printf "}"; printf"\n"}' probes.fa > probes.fq


  • Sort a file with header line: that is, we don't want the first line to be sorted
$ echo -e "x\ty"; for i in {1..10}; do echo -e $i"\t"$RANDOM; done | (read -r; printf "%s\n" "$REPLY"; sort -k2,2n)


  • Get rows from a big file which are also in a small file: example of using awk with 2 input files by loading the important information from the small file into an array in memory, then parsing the big file line by line and comparing each with the content of the array
$ echo -e "gene\tsnp\tpvalue\ngene1\tsnp1\t0.002\ngene2\tsnp2\t0.8\ngene2\tsnp3\t0.1" > file_all.txt
$ echo -e "gene1\tsnp1" > file_subset.txt
$ awk 'NR==FNR{a[$1$2]++;next;}{x=$1$2;if(x in a)print $0}' file_subset.txt <(sed 1d file_all.txt)



Personal tools