User:Timothee Flutre/Notebook/Postdoc/2012/05/25

Project name

Main project page

Previous entry Next entry

About one-liners in data wrangling

Motivation: Once we receive raw data and before drawing robust conclusions, we (almost) always need to reformat them as well as extract a few key summary statistics. Hopefully this activity, called data wrangling, is particularly quick and easy on GNU/Linux computers. For instance, using GNU utilities via the command-line interface, we can write a "one-liner", a sequence of tools in which the output of a tool is the input of the next. This is not only easy but also very powerful, as shown below.

Toolbox: often available by default on many computers with GNU/Linux
- Bash
- AWK
- grep
- sed
- GNU coreutils (head, tail, cut, uniq, sort, tr, od, ...)

Tutorials:
- Introduction to the command-line
- Introduction to Unix by Oliver Elliott
- Introduction to text manipulation on UNIX-based systems by Brad Yoes (IBM)
- The Art of the Command Line by Joshua Levy
- Advanced Bash-Scripting Guide by Mendel Cooper
- ShellCheck (explainshell)
- tutorial for bedtools
- kit de survie Linux (en français)

Skip a subset of successive lines:

for i in {1..10}; do echo $i; done | sed 3,6d

Extract a subset of successive lines:

$ for i in {1..20}; do echo $i; done | sed -n 3,5p

Use absolute values:

$ for i in {-5..5}; do echo $i; done | awk 'function abs(x){return (((x < 0.0) ? -x : x) + 0.0)} {print abs($1)}'

Summarize numbers: with R

$ for i in {1..10}; do echo $i; done | Rscript -e 'summary(read.table("stdin"))'

Extract the best snp per gene:

$ echo -e "gene\tsnp\tpvalue\ng1\ts1\t0.3\ng1\ts2\t0.002\ng2\ts2\t0.7\ng2\ts3\t0.05" > dat.txt
gene    snp     pvalue
g1      s1      0.3
g1      s2      0.002
g2      s2      0.7
g2      s3      0.05

$ cat dat.txt | sed 1d | sort -k1,1 -k3,3 | awk '{print $3"\t"$2"\t"$1}' | uniq -f2
g1      s2      0.002
g2      s3      0.05

Loop over pairs:

$ subgroups=("s1" "s2" "s3" "s4"); for i in {0..2}; do let a=$i+1; for j in $(seq $a 3); do s1=${subgroups[$i]}; s2=${subgroups[$j]}; echo $s1 $s2; done; done

Convert file from fasta to fastq: we can use the built-in variable "RS" (split records) and use "split" (string function):

$ awk 'BEGIN{RS=">"} {if(NF==0)next; split($0,a,"\n"); printf "@"a[1]"\n"a[2]"\n+\n"; \
for(i=1;i<=length(a[2]);i++)printf "}"; printf"\n"}' probes.fa > probes.fq

Extract sequence from fasta:

$ echo -e ">chr1\nAAA\n>chr2\nTTT\n>chr3\nGGG\n" | awk 'BEGIN{RS=">"} /chr2/ {print $0}'

Sort a file with header line: that is, we don't want the first line to be sorted

$ echo -e "x\ty"; for i in {1..10}; do echo -e $i"\t"$RANDOM; done | (read -r; printf "%s\n" "$REPLY"; sort -k2,2n)

Get rows from a big file which are also in a small file: example of using awk with 2 input files by loading the important information from the small file into an array in memory, then parsing the big file line by line and comparing each with the content of the array

$ echo -e "gene\tsnp\tpvalue\ngene1\tsnp1\t0.002\ngene2\tsnp2\t0.8\ngene2\tsnp3\t0.1" > file_all.txt
$ echo -e "gene1\tsnp1" > file_subset.txt
$ awk 'NR==FNR{a[$1$2]++;next;}{x=$1$2;if(x in a)print $0}' file_subset.txt <(sed 1d file_all.txt)

Get length of each sequence in a fasta file:

$ awk 'BEGIN{RS=">"} {split($0,a,"\n"); if(length(a)==0) next; seqlen=0; for(i=2;i<=length(a);++i){seqlen += length(a[i])}; printf a[1]"\t"seqlen"\n"}' sequences.fa

Get the bases 6 to 9 of each sequence in a fastq file: provided that each read only uses 4 lines

$ zcat reads.fq.gz | awk '(NR % 4 == 2)' | cut -c 6-9

Reverse-complement a DNA sequence:

$ echo "AAATGAGCC" | rev | tr ATGC TACG

Identify a non-breaking space:
- download additional file 4 (Table S3) of this article
- open it with LibreOffice Calc
- save it as "Text CSV" with "Character set = Unicode (UTF-8)", "Field delimiter = {Tab}", "Text delimiter = " (i.e. empty), and keep "Save cell content as shown" as checked
- play with the following commands (and see the ASCII code):

$ cat 12870_2016_754_MOESM4_ESM.csv | sed -n 1123p | cut -f2 | od -An -c -b
   L   i   s   z   t   e   s 302 240   f   e   h   e   r  \n
 114 151 163 172 164 145 163 302 240 146 145 150 145 162 012
$ cat 12870_2016_754_MOESM4_ESM.csv | sed -n 1123p | cut -f2 | sed 's/\xC2\xA0/ /g' | od -An -c -b
   L   i   s   z   t   e   s       f   e   h   e   r  \n
 114 151 163 172 164 145 163 040 146 145 150 145 162 012

Extract substring based on regex: uses regex groups, specific of GNU awk

$ echo "project_all-lanes/H3NHKBBXX_7/demultiplex/H3NHKBBXX_7_A3-30-10-10_R1.fastq.gz" | awk '{match($0, /([a-zA-Z0-9-]*)_(R[12])/, a); print a[1]}'

User:Timothee Flutre/Notebook/Postdoc/2012/05/25

About one-liners in data wrangling

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools