User:Timothee Flutre/Notebook/Postdoc/2012/05/22

From OpenWetWare
Jump to navigationJump to search
Project name Main project page
Previous entry      Next entry

One-liners to split an input directory

(I spent some time to think about this, thus it may be useful to others.)

  • A joint analysis can be described as an analysis that combines results of several separate analyses. To the program that implements such a joint analysis, it is therefore required to give as input the results of each separate analysis. When each separate analysis outputs lots of results (ie. a large file), it is not recommended to gather the results from all separate analyses into a single file because this file is likely to be enormous. A solution is to give, as input to the joint analysis program, each individual file from each separate analysis. However, if there are many separate analyses to combine, one doesn't want to write a kilometer-long command line... A solution is to give a directory as input to the joint analysis program which will then list its content to determine how many separate analyses has to be combined.
  • Further more, it is often the case that the joint analysis can be performed in parallel. In that case, one needs to split the results from each separate analysis coherently and put them in as many directories as one wants to parallelize upon. Below are two one-liners that allow to do that efficiently.
  • Let's assume we want to combine S separate analyses. The directory "original_inputs/" contains S files (first line has sample names, first column has gene names, then each line contains expression values). We need to split each of these S files N lines by N lines (here N=100 but it can be any number). At the end, if S=3, each original file contains 5000 lines and N=100, we obtain 3 x 50 =150 "split" files, each with 101 lines (don't forget the header!).
mkdir split_inputs
cd split_inputs
ls ../original_inputs/* | while read f; do \
echo $f; \
head -1 $f > header.txt; \
sed 1d $f | split -d -l 100 - $(basename ${f})_split; \
a=($(ls $(basename ${f})*split*)); echo "nb split: "${#a[@]}; \
for g in ${a[@]}; do \
#echo $g; \
cat header.txt $g > tmp.txt; \
mv tmp.txt $g; \
done; \
rm header.txt; \
done

  • Now we create one directory for each split, and put inside the split input files from each separate analysis.
ls *_split* | while read f; do awk '{split($0,a,"_"); print a[length(a)]}'; done | sort | uniq \
| while read i; do \
mkdir $i ;\
mv *_${i} $i ;\
done