Butlin:Unix for Bioinformatics - advanced tutorial

From OpenWetWare
Jump to navigationJump to search

Overview

This session will be more challenging if you are still new to Unix. However, it’s main aim is to demonstrate that Unix is much more than an environment in which you can execute big programmes like `bwa` or `samtools`. Rather, with it’s suite of text searching, extraction and manipulation programmes, it is itself a powerful tool for the many small bioinformatic tasks that you will encounter during everyday work. This module expects that you have done the steps in the Basic Unix module, i. e. it expects that you have already created certain folders and have copied the example data into your account. There are still small (hopefully not large) bugs lurking in this protocol. Please help improve it by adding comments.

Before you start

Log into your iceberg account and change to the directory NGS_workshop, that you created in the Basic Unix module:

cd NGS_workshop/Unix_module


I have downloaded my illumina read data. Now I want to know how many reads my sequence run has yielded.

zless sequence.fastq.gz
zcat sequence.fastq.gz | wc -l
man wc

gives you the number of lines in your sequence file, which is the number of reads times four for fastq format. Note, you can usually avoid uncompressing data files, which saves disk space.


I have a .fastq file with raw sequences from a RAD library. How can I find out how many of the reads contain the correct sequence of the remainder of the restriction site?

Let's assume you had 5bp long barcode sequences incorporated into the single-end adapters, which should show up at the beginning of each sequence read. Let's also assume that you have used the restriction enzyme SbfI for the creation of the library, which has the following recognition sequence: CCTGCAGG . So the correct remainder of the restriction site that you expect to see after the 5bp barcode is "TGCAGG". First have a look at your fastq file again:

zless -N sequence.fastq.gz

Each sequence record contains four lines. The actual sequences are on the 2nd, 6th, 10th, 14th, 18th line and so on. The following can give you the count:

zcat sequences.fastq.gz | perl -ne 'print unless ($.-2)%4;'  | grep -c "^.....TGCAGG"

This is a pipeline which first uncompresses your sequence read file and pipes it into the perl command, which extracts only the DNA sequence part of each fastq record. “$.” in perl stands for the current line number, “%4” returns modulo 4 of the current line number minus 2. If that results in a floating point number (non-interger), then the line is not printed out. Get it?

zcat sequences.fastq.gz | perl -ne'print unless ($.-2)%4;' | less

Grep searches each line from the output of the perl command for the regular expression given in quotation marks. ^ stands for the beginning of a line. A dot stands for any single character. There are five dots, because the barcodes are all 5 base pairs long. The -c switch makes grep return the number of lines in which it has found the search pattern at least once.

man grep
man perlrun
zcat sequences.fastq.gz | perl -ne'print unless ($.-2)%4;' | grep  "^.....TGCAGG" | less

In less, type /^.....TGCAGG then hit enter.


I have split my reads by barcode and I have quality filtered them. Now I want to know how many reads I have left from each (barcoded) individual. How can I find that out?

for file in *.fq; \
do echo -n "$file " >> retained; \
cat $file | perl -ne 'print unless ($.-2)%4;' | wc -l >> retained; \
done &
less retained

This bash ‘for’ loop goes sequentially over each file in the current directory which has the file ending ".fq". It prints that file name to the output file "retained". The >> redirection character makes sure that all output is appended to the file “retained”. Otherwise only the output from the last command in the loop and from the last file in the list of files would be stored in the file “retained”. The ampersand & at the end of the command line sends it into the background, which means you get your command line prompt back immediately.


I have 30 output files from a programme, but their names are not informative. I want to insert the keyword "cleaned" into their names? How can I do this?

Renaming files is a very common and important task. I’ll show you a simple and fast way to do that. There are as always many ways to solve a task. First, you could type 30 mv commands ⇒ that’s the fastest way to rename a single file but doing that 30 times is very tedious and error prone. Second, you could use a bash 'for' loop as above in combination with echo and the substitution command sed, like this:

for file in *.out; do mv $file `echo $file | sed 's|^\(.*\)\(\.out\)$|\1_cleaned\2|'`; done

but that's a bit tricky if you are not using sed every day. Note the two backticks characters, one before echo, the other right before the last semicolon. Everything in backticks will be evaluated by bash and then replaced by it’s output. So in this case it’s the modified file name that is provided as second argument to the mv command. If you want to find out what the sed command does, take a look at the beginning of this sed tutorial. The best way, however, to solve the task is by downloading the perl rename script, put it into a directory that is in your PATH (i. e. “~/prog”) and make that text file executable with chmod. You can download Larry Wall's rename.pl script from here with 'wget':

cd ~/src
wget http://tips.webdesign10.com/files/rename.pl.txt

or use firefox if you are in an interactive session.

mv rename.pl.txt ~/prog/rename.pl

Note, this cut/pastes and renames at the same time.

ll ~/prog

Let’s call the programme:

rename.pl

You should get “Permission denied”. That’s because we first have to tell bash that we permit its execution:

chmod u+x ~/prog/rename.pl
man chmod

The “u” stands for “user” (that’s you) and “x” stands for “executable”. Let’s try again:

rename.pl

How to make the documentation in the script file visible?

cd ~/prog
pod2man rename.pl > rename.pl.1
mv rename.pl.1 ~/man/man1

Then look at the documentation with:

man rename.pl

First let's create 30 empty files to rename:

cd
mkdir test
cd test
for i in {1..30}; do touch test_$i.out; done
ll


rename.pl -nv 's/^(.+)(\.out)/$1_cleaned$2/' *.out

You already know that bash expands *.out at the end of the command line into a list of all the file names in the current directory that end with “.out”. So the command does the renaming on all the 30 files we created. Let’s look at the stuff in single quotes. s turns on substitution, since we are substituting old file names with new file names. The forward slashes “/” separate the search and the replacement patterns. The search pattern begins with ^, which stands for the beginning of the file name. The dot stands for any single character. The + means “one or more of the preceding character”. So .+ will capture everything from the beginning of the file name until “.out”. In the second pair of brackets we needed to escape to the dot with a backslash to indicate that we mean dot in its literal sense. Otherwise, it would have been interpreted as representing any single character. Anything that matches the regular expressions in the brackets will be automatically saved, so that it can be called in the replacement pattern. Everything that matches the pattern in the first pair of brackets is saved to $1, everything that matches the pattern in the second pair of brackets is saved to $2. Fore more on regular expressions:

man grep
man perlrequick

Notes

Please feel free to post comments, questions, or improvements to this protocol. Happy to have your input!

  1. List troubleshooting tips here.
  2. You can also link to FAQs/tips provided by other sources such as the manufacturer or other websites.
  3. Anecdotal observations that might be of use to others can also be posted here.

Please sign your name to your note by adding '''*~~~~''': to the beginning of your tip.

References

Relevant papers and books

  1. Goldbeter, A and Koshland, DE (1981) - Proc Natl Acad Sci U S A 78(11) 6840-4 PMID 6947258
  2. Jacob, F and Monod, J J (1961) - Mol Biol 3(3) 318-56 PMID 13718526
  3. Ptashne, M (2004) Genetic Switch: Phage Lambda Revisited - Cold Spring Harbor Laboratory Press ISBN 0879697164

Contact

  • Who has experience with this protocol?

or instead, discuss this protocol.