Butlin:Unix for Bioinformatics - advanced tutorial: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
(32 intermediate revisions by the same user not shown)
Line 1: Line 1:
<!--
<div class="center" style="width:auto; margin-left:auto; margin-right:auto;"><big><font color="red">This tutorial is still under construction. Please come back.</font></big></div>  
<div class="center" style="width:auto; margin-left:auto; margin-right:auto;"><big><font color="red">This tutorial is still under construction. Please come back.</font></big></div>  
-->


==Overview==
==Overview==


This session will be more challenging if you are still new to [[Unix]]. However, it’s main aim is to demonstrate that Unix is much more than an environment in which you can execute big programmes like [[bwa]] or [[samtools]]. Rather, with it’s suite of text searching, extraction and manipulation programmes, it is itself a powerful tool for the many small bioinformatic tasks that you will encounter during everyday work. This module expects that you have done the steps in the [[Butlin:Unix for Bioinformatics - basic tutorial | Basic Unix module]], i. e. it expects that you have already created certain folders and have copied the example data into your account. There are still small (hopefully not large) bugs lurking in this protocol. Please help improve it by adding comments.
This session will be more challenging if you are still new to Unix. However, it’s main aim is to demonstrate that Unix is much more than an environment in which you can execute big programmes like <code>bwa</code> or <code>samtools</code>. Rather, with it’s suite of text searching, extraction and manipulation programmes, it is itself a powerful tool for the many small bioinformatic tasks that you will encounter during everyday work. This module expects that you have done the steps in the [[Butlin:Unix for Bioinformatics - basic tutorial | Basic Unix module]], i. e. it expects that you have already created certain folders and have copied the example data into your account. There are still small (hopefully not large) bugs lurking in this protocol. Please help improve it by correcting those mistakes or adding comments to the [[Talk:Butlin:Unix for Bioinformatics - advanced tutorial | talk]] page. Many thanks.


==Before you start==
==Before you start==


Log into your iceberg account and change to the directory NGS_workshop, that you created in the Basic Unix module:
Log into your ''iceberg'' account. On the head node ''iceberg1'', type:


cd NGS_workshop/Unix_module
  $ qrsh


Then change to the directory ''NGS_workshop'', that you created in the [[Butlin:Unix_for_Bioinformatics_-_basic_tutorial | Basic Unix module]]:


==I have downloaded my illumina read data. Now I want to know how many reads my sequence run has yielded.==
$ cd NGS_workshop/Unix_module


zless sequence.fastq.gz
== TASK 3: I have downloaded my illumina read data. Now I want to know how many reads my sequence run has yielded.==
zcat sequence.fastq.gz | wc -l
man wc


gives you the number of lines in your sequence file, which is the number of reads times four for fastq format. Note, you can usually avoid uncompressing data files, which saves disk space.
$ zless Unix_tut_sequences.fastq.gz
$ zcat Unix_tut_sequences.fastq.gz | wc -l
$ man wc


gives you the number of lines in your sequence file, which is the number of reads times four for [http://en.wikipedia.org/wiki/FASTQ_format#Encoding fastq] format. Note, you can usually avoid uncompressing data files, which saves disk space.


==I have a .fastq file with raw sequences from a RAD library. How can I find out how many of the reads contain the correct sequence of the remainder of the restriction site?==
== TASK 4: I have a .fastq file with raw sequences from a RAD library. How can I find out how many of the reads contain the correct sequence of the remainder of the restriction site?==


Let's assume you had 5bp long barcode sequences incorporated into the single-end adapters, which should show up at the beginning of each sequence read. Let's also assume that you have used the restriction enzyme SbfI for the creation of the library, which has the following recognition sequence: CCTGCAGG . So the correct remainder of the restriction site that you expect to see after the 5bp barcode is "TGCAGG". First have a look at your fastq file again:
Let's assume you had 5bp long barcode sequences incorporated into the single-end adapters, which should show up at the beginning of each sequence read. Let's also assume that you have used the restriction enzyme [http://www.neb.uk.com/productcatalogue/productinfotransfer.aspx?id=/New%20England%20Biolabs/Restriction%20Endonucleases/Restriction%20Endonucleases/R3642 SbfI] for the creation of the library, which has the following recognition sequence: <tt>CCTGCAGG</tt> . So the correct remainder of the restriction site that you expect to see after the 5bp barcode is <tt>TGCAGG</tt>. First have a look at your fastq file again:


  zless -N sequence.fastq.gz
  $ zless -N Unix_tut_sequences.fastq.gz


Each sequence record contains four lines. The actual sequences are on the 2nd, 6th, 10th, 14th, 18th line and so on. The following can give you the count:
Each sequence record contains four lines. The actual sequences are on the 2nd, 6th, 10th, 14th, 18th line and so on. The following can give you the count:


  zcat sequences.fastq.gz | perl -ne 'print unless ($.-2)%4;'  | grep -c "^.....TGCAGG"
  $ zcat Unix_tut_sequences.fastq.gz | perl -ne 'print unless ($.-2)%4;'  | grep -c "^.....TGCAGG"


This is a pipeline which first uncompresses your sequence read file and pipes it into the perl command, which extracts only the DNA sequence part of each fastq record. <tt>$.</tt> in perl stands for the current line number, %4” returns modulo 4 of the current line number minus 2. If that results in a floating point number (non-interger), then the line is not printed out. Get it?
This is a pipeline which first uncompresses your sequence read file and pipes it into the perl command, which extracts only the DNA sequence part of each fastq record. <tt>$.</tt> in perl stands for the current line number, <code>%4</code> returns modulo 4 of the current line number minus 2 (the modulo operator returns 0 if the result of the division is an integer (i. e. if the line number is a multiple of 4) and 1, 2, 3 etc. otherwise. Zero is equivalent to FALSE and any integer other than zero is equivalent to TRUE). Get it?


  zcat sequences.fastq.gz | perl -ne'print unless ($.-2)%4;' | less
  $ zcat Unix_tut_sequences.fastq.gz | perl -ne'print unless ($.-2)%4;' | less


Grep searches each line from the output of the perl command for the [http://en.wikipedia.org/wiki/Regular_expression regular expression] given in quotation marks. ^ stands for the beginning of a line. A dot stands for any single character. There are five dots, because the barcodes are all 5 base pairs long. The -c switch makes grep return the number of lines in which it has found the search pattern at least once.
<code>Grep</code> searches each line from the output of the <code>perl</code> command for the [http://en.wikipedia.org/wiki/Regular_expression regular expression] given in quotation marks. <code>^</code> stands for the beginning of a line. A dot stands for any single character. There are five dots, because the barcodes are all 5 base pairs long. The <code>-c</code> switch makes <code>grep</code> return the number of lines in which it has found the search pattern at least once.


  man grep
  $ man grep
  man perlrun
  $ man perlrun


  zcat sequences.fastq.gz | perl -ne'print unless ($.-2)%4;' | grep  "^.....TGCAGG" | less
  $ zcat Unix_tut_sequences.fastq.gz | perl -ne'print unless ($.-2)%4;' | grep  "^.....TGCAGG" | less


In less, type /^.....TGCAGG then hit enter.
In <code>less</code>, type:


== I have split my reads by barcode and I have quality filtered them. Now I want to know how many reads I have left from each (barcoded) individual. How can I find that out?==
/^.....TGCAGG


  for file in *.fq; \
then hit enter.
 
== TASK 5: I have split my reads by barcode and I have quality filtered them. Now I want to know how many reads I have left from each (barcoded) individual. How can I find that out?==
 
  $ for file in *.fq; \
  do echo -n "$file " >> retained; \
  do echo -n "$file " >> retained; \
  cat $file | perl -ne 'print unless ($.-2)%4;' | wc -l >> retained; \
  cat $file | perl -ne 'print unless ($.-2)%4;' | wc -l >> retained; \
  done &
  done &


  less retained
  $ less retained


This bash ‘for’ loop goes sequentially over each file in the current directory which has the file ending ".fq". It prints that file name to the output file "retained". The >> redirection character makes sure that all output is appended to the file “retained”. Otherwise only the output from the last command in the loop and from the last file in the list of files would be stored in the file “retained”. The ampersand & at the end of the command line sends it into the background, which means you get your command line prompt back immediately.
This bash <code>for</code> loop goes sequentially over each file in the current directory which has the file ending <tt>.fq</tt>. It prints that file name to the output file ''retained''. The <code>>></code> redirection character makes sure that all output is appended to the file ''retained''. Otherwise only the output from the last command in the loop and from the last file in the list of files would be stored in the file ''retained''. The ampersand <code>&</code> at the end of the command line sends it into the background, which means you get your command line prompt back immediately while the process is running in the background.


== TASK 6: I have 30 output files from a programme, but their names are not informative. I want to insert the keyword ''cleaned'' into their names? How can I do this?==


==I have 30 output files from a programme, but their names are not informative. I want to insert the keyword "cleaned" into their names? How can I do this?==
Renaming files is a very common and important task. I’ll show you a simple and fast way to do that. There are as always many ways to solve a task. First, you could type 30 <code>mv</code> commands <Big>→</Big> that’s the fastest way to rename a single file but doing that 30 times is very tedious and error prone. Second,  you could use a bash for<code></code> loop as above in combination with <code>echo</code> and the substitution command <code>sed</code>, like this:


Renaming files is a very common and important task. I’ll show you a simple and fast way to do that. There are as always many ways to solve a task. First, you could type 30 mv commands ⇒ that’s the fastest way to rename a single file but doing that 30 times is very tedious and error prone. Second,  you could use a bash 'for' loop as above in combination with echo and the substitution command sed, like this:
$ for file in *.out; do mv $file `echo $file | sed 's|^\(.*\)\(\.out\)$|\1_cleaned\2|'`; done


for file in *.out; do mv $file `echo $file | sed 's|^\(.*\)\(\.out\)$|\1_cleaned\2|'`; done
but that's a bit tricky if you are not using <code>sed</code> every day. Note the two backticks characters: one before <code>echo</code>, the other right before the last semicolon. Everything in backticks will be evaluated by <code>bash</code> and then replaced by it’s output. So in this case it’s the modified file name that is provided as second argument to the <code>mv</code> command. If you want to find out what the <code>sed</code> command does, take a look at the beginning of this [http://www.grymoire.com/Unix/Sed.html#uh-0 <code>sed</code> tutorial].


but that's a bit tricky if you are not using sed every day. Note the two backticks characters, one before echo, the other right before the last semicolon. Everything in backticks will be evaluated by bash and then replaced by it’s output. So in this case it’s the modified file name that is provided as second argument to the mv command. If you want to find out what the sed command does, take a look at the beginning of this [http://www.grymoire.com/Unix/Sed.html#uh-0 sed tutorial].
The best way, however, to solve the task is by downloading the perl <code>rename</code> script (note, this is not the system <code>rename</code> command that comes with Unix, which has very limited capabilities), put it into a directory that is in your PATH (e. g. ''~/prog'') and make that text file executable with <code>chmod</code>.  
The best way, however, to solve the task is by downloading the perl rename script, put it into a directory that is in your PATH (i. e. ~/prog”) and make that text file executable with chmod.  
You can download Larry Wall's <code>rename.pl</code> script from [http://tips.webdesign10.com/files/rename.pl.txt here] with <code>wget</code>:
You can download Larry Wall's rename.pl script from [http://tips.webdesign10.com/files/rename.pl.txt here] with 'wget':


  cd ~/src
  $ cd ~/src
  wget http://tips.webdesign10.com/files/rename.pl.txt
  $ wget http://tips.webdesign10.com/files/rename.pl.txt


or use firefox if you are in an interactive session.
or use firefox if you are in an interactive session.


  mv rename.pl.txt ~/prog/rename.pl
  $ mv rename.pl.txt ~/prog/rename.pl


Note, this cut/pastes and renames at the same time.  
Note, this cut/pastes <u>and</u> renames at the same time.  


  ll ~/prog
  $ ll ~/prog


Let’s call the programme:
Let’s call the programme:
Line 81: Line 90:
  rename.pl
  rename.pl


You should get “Permission denied”. That’s because we first have to tell bash that we permit its execution:
You should get <tt>Permission denied</tt>. That’s because we first have to tell <code>bash</code> that we permit its execution:


  chmod u+x ~/prog/rename.pl
  $ chmod u+x ~/prog/rename.pl
  man chmod
  $ man chmod


The “u” stands for “user” (that’s you) and “x” stands for “executable”. Let’s try again:
The <code>u</code> stands for ''user'' (that’s you) and <code>x</code> stands for ''executable''. Let’s try again:


  rename.pl
  $ rename.pl


How to make the documentation in the script file visible?
How to make the documentation in the script file visible?


  cd ~/prog
  $ cd ~/prog
  pod2man rename.pl > rename.pl.1
  $ pod2man rename.pl > rename.pl.1
  mv rename.pl.1 ~/man/man1
  $ mv rename.pl.1 ~/man/man1


Then look at the documentation with:
Then look at the documentation with:


  man rename.pl
  $ man rename.pl


First let's create 30 empty files to rename:
First let's create 30 empty files to rename:
      
      
  cd
  $ cd
  mkdir test
  $ mkdir test
  cd test
  $ cd test
  for i in {1..30}; do touch test_$i.out; done
  $ for i in {1..30}; do touch test_$i.out; done
  ll
  $ ll


$ rename.pl -nv 's/^(.+)(\.out)/$1_cleaned$2/' <span style="color: green">*.out</span>


rename.pl -nv 's/^(.+)(\.out)/$1_cleaned$2/' *.out
You already know that bash expands <span style="color: green">*.out</span> at the end of the command line into a list of all the file names in the current directory that end with <tt>.out</tt>. So the command does the renaming on all the 30 files we created. Let’s look at the stuff in single quotes. <code>s</code> turns on substitution, since we are substituting old file names with new file names. The forward slashes <code>/</code> separate the search and the replacement patterns. The search pattern begins with <code>^</code>, which stands for the beginning of the file name. The dot stands for any single character. The <code>+</code> means ''one or more of the preceding character''. So <code>.+</code> will capture everything from the beginning of the file name until <tt>.out</tt>. In the second pair of brackets we needed to escape the dot with a backslash to indicate that we mean dot in its literal sense. Otherwise, it would have been interpreted as representing any single character. Anything that matches the regular expressions in the brackets will be automatically saved, so that it can be called in the replacement pattern. Everything that matches the pattern in the first pair of brackets is saved to <code>$1</code>, everything that matches the pattern in the second pair of brackets is saved to <code>$2</code>. Fore more on regular expressions, look up:


You already know that bash expands *.out at the end of the command line into a list of all the file names in the current directory that end with “.out”. So the command does the renaming on all the 30 files we created. Let’s look at the stuff in single quotes. s turns on substitution, since we are substituting old file names with new file names. The forward slashes “/” separate the search and the replacement patterns. The search pattern begins with ^, which stands for the beginning of the file name. The dot stands for any single character. The + means “one or more of the preceding character”. So .+ will capture everything from the beginning of the file name until “.out”. In the second pair of brackets we needed to escape to the dot with a backslash to indicate that we mean dot in its literal sense. Otherwise, it would have been interpreted as representing any single character. Anything that matches the regular expressions in the brackets will be automatically saved, so that it can be called in the replacement pattern. Everything that matches the pattern in the first pair of brackets is saved to $1, everything that matches the pattern in the second pair of brackets is saved to $2. Fore more on regular expressions:
$ man grep
$ man perlrequick


man grep
== TASK 7: I have got a text file originating from a Windows or an old Mac text editor. But when I open it in <code>less</code>, all lines are concatenated into one line and with repeatedly the strange <tt>^M</tt> character in it. How can I fix this?==
man perlrequick


 
  $ cd ~/NGS_workshop/Unix_module/TASK_7
==I have got a text file originating from a Windows or an old Mac text editor. But when I open it in less, all lines are concatenated into one line and with repeatedly the strange ^M character in it. How can I fix this?==
  $ ll
 
  $ cat -v text_file_from_Windows.txt
  cd ~/NGS_workshop/Unix_module
  ll
  cat -v text_file_from_Windows.txt


Unix has only linefeeds, the old Mac’s had only carriage returns and Windows uses both characters together to represent one return character. The following will remove the carriage return from the end of lines, leaving only linefeeds:
Unix has only linefeeds, the old Mac’s had only carriage returns and Windows uses both characters together to represent one return character. The following will remove the carriage return from the end of lines, leaving only linefeeds:


  tr -d ‘\r’ < text_file_from_Windows.txt > file_from_Windows_fixed.txt
  $ tr -d ‘\r’ < text_file_from_Windows.txt > file_from_Windows_fixed.txt
  cat -v file_from_Windows_fixed.txt
  $ cat -v file_from_Windows_fixed.txt
  man tr
  $ man tr


Note, most Unix programmes print their output to STDOUT, which is the screen. If you want to save the output, it needs to be redirected into a new file. Never redirect output back into the input file, like this:
Note, most Unix programmes print their output to STDOUT, which is the screen. If you want to save the output, it needs to be redirected into a new file. <span style="color: red">Never redirect output back into the input file, like this</span>:


  tr -d ‘\r’ < text_file_from_Windows.txt > text_file_from_Windows.txt  
  $ tr -d ‘\r’ < text_file_from_Windows.txt > text_file_from_Windows.txt  
  cat text_file_from_Windows.txt
  $ cat text_file_from_Windows.txt


You’ve just clobbered your input file.  
You’ve just clobbered your input file.  
Line 140: Line 147:
The old Mac text file has just one carriage return instead of a linefeed. That’s why, all lines of the file are concatenated into one line. To fix this:
The old Mac text file has just one carriage return instead of a linefeed. That’s why, all lines of the file are concatenated into one line. To fix this:


  tr ‘\r’ ‘\n’ < text_file_from_Mac.txt > file_from_Mac_fixed.txt
  $ tr ‘\r’ ‘\n’ < text_file_from_Mac.txt > file_from_Mac_fixed.txt
  less file_from_Mac_fixed.txt
  $ less file_from_Mac_fixed.txt


sed ‘s/\r/\n/g’ < text_file_from_Windows.txt > \ file_from_Windows_fixed.txt
As usual there are several ways to achieve the same result:
dos2unix text_file_from_Windows.txt
mac2unix text_file_from_Mac.txt


The last two programmes are available on iceberg, but are not by default included in a Unix system. You would have to install them yourself. Note also, that dos2unix and mac2unix do in place editing, i. e. the old version will be overwritten by the new version.
$ sed ‘s/\r/\n/g’ < text_file_from_Windows.txt > \ file_from_Windows_fixed.txt
$ dos2unix text_file_from_Windows.txt
$ mac2unix text_file_from_Mac.txt


The last two programmes are available on ''iceberg'', but are not by default included in a Unix system. You would have to install them yourself. Note also, that <code>dos2unix</code> and <code>mac2unix</code> do in place editing, i. e. the old version will be overwritten by the new version.


==I have just mapped my reads against the reference sequences and I got a BAM file for each individual. How can I find out the proportion of reads from each individual that got mapped successfully?==
== TASK 8: I have just mapped my reads against the reference sequences and I got a BAM file for each individual. How can I find out the proportion of reads from each individual that got mapped successfully?==


Note: it seems that the samtools view command should be able to do that with the -f and -c switches. However, in my trials it only returned the total number of reads in the BAM file, i .e including those that did not get mapped. Fortunately, this is a really easy Unix task. First let’s have a look at the .bam file:
Note: it seems that the <code>samtools view</code> command should be able to do that with the <code>-f</code> and <code>-c</code> switches. However, in my trials it only returned the total number of reads in the BAM file, i .e including those that did not get mapped. Fortunately, this is a really easy Unix task. First let’s have a look at the .bam file:


  samtools view alignment.bam | less
  $ cd ~/NGS_workshop/Unix_module/TASK_8
$ samtools view alignment.bam | less


If that fits on your screen without line wrapping ... lucky bastard! If not, just turn off line wrapping in less:
If that fits on your screen without line wrapping ... lucky bastard! If not, just turn off line wrapping in <code>less</code>:
      
      
  samtools view alignment.bam | less -S
  $ samtools view alignment.bam | less -S


The second column contains a numerical flag which indicates the result of the mapping for that read. The flag 0 stands for a successful mapping of the read. So in order to get the number of reads that got mapped successfully we need to count the number of lines with zeros in their second column. Let’s cut out the second column of the tab delimited .bam file:
Use the arrow keys on your keyboard to scroll left and right. The second column contains a numerical flag which indicates the result of the mapping for that read. The flag of 0 stands for a successful mapping of the read. A flag of 16 also stands for successful mapping, but of the reverse complement of the read. So in order to get the number of reads that got mapped successfully we need to count the number of lines with zeros or 16's in their second column. Let’s cut out the second column of the tab delimited .bam file:


  samtools view alignment.bam | cut -f 2 | less
  $ samtools view alignment.bam | cut -f 2 | less
  man cut
  $ man cut


Next, we have to sort this column numerically in order to collapse it into unique numbers:
Next, we have to sort this column numerically in order to collapse it into unique numbers:


  samtools view alignment.bam | cut -f 2 | sort -n | uniq -c
  $ samtools view alignment.bam | cut -f 2 | sort -n | uniq -c
  man sort
  $ man sort


With the -c switch to uniq the output is a contingency table of the flags from column 2 that can be found in your alignment file. However, some reads with a 0 flag in the second column still have a mapping quality of 0 (don’t ask me why), which is in the 5 column. So, in order to get a count of the number of reads that got mapped successfully with a mapping quality above 0, use gawk:
With the <code>-c</code> switch to <code>uniq</code> the output is a contingency table of the flags from column 2 that can be found in your alignment file. However, some reads with a 0 flag in the second column still have a mapping quality of 0 (don’t ask me why), which is in the 5th column. So, in order to get a count of the number of reads that got mapped successfully with a mapping quality above 0, use <code>gawk</code>:


  samtools view alignment.bam | gawk '$2==0 && $5>0' | wc -l
  $ samtools view alignment.bam | awk '($2==0 || $2==16) && $5>0' | wc -l
  man gawk
  $ man awk


Gawk is the GNU version of awk. Both are almost identical except that awk can’t handle large files. $2 stands for the content in the second column, $5 for the fifth and && means logical AND. Gawk prints only those lines to output which match our conditions. We then simply count those lines.
<code>'''G'''awk</code> is the GNU version of <code>awk</code>. Macs get only shipped with <code>awk</code>. Both are almost identical except that <code>awk</code> can’t handle large files (since we are mostly working on one input line at a time, this rarely makes a practical differnece). <code>$2</code> stands for the content in the second column, <code>$5</code> for the fifth, <code>||</code> means logical OR and <code>&&</code> means logical AND. <code>Awk</code> prints only those lines to output which match our conditions. We then simply count those lines.


==I have got a big multi-fasta file with thousands of sequences. How can I extract only those fasta records whose sequences contain at least 3 consecutive repeats of the dinucleotide "AG"?==
== TASK 9: I have got a big multi-fasta file with thousands of sequences. How can I extract only those fasta records whose sequences contain at least 3 consecutive repeats of the dinucleotide <tt>AG</tt>?==


Believe it or not, this task can be solved with a combination of Unix commands, i. e. no programming is required. Let’s approach the task step by step. First have a look at the input file:
Believe it or not, this task can be solved with a combination of Unix commands, i. e. no programming is required. Let’s approach the task step by step. First have a look at the input file:


  less multi_fasta.fa
  $ cd ~/NGS_workshop/Unix_module/TASK_9
$ less multi_fasta.fa


We could just use the programme grep to search each line of the fasta file for our pattern (i. e. 3 consecutive AG), but since grep is line based we would lose the fasta headers and some part of the sequence in the process. So, we somehow have to get the fasta headers and their corresponding sequences on the same line.
We could just use the programme <code>grep</code> to search each line of the fasta file for our pattern (i. e. 3 consecutive <tt>AG</tt>), but since <code>grep</code> is line based we would lose the fasta headers and some part of the sequence in the process. So, we somehow have to get the fasta headers and their corresponding sequences on the same line.


  tr '\n' '@' < multi_fasta.fa | less
  $ tr '\n' '@' < multi_fasta.fa | less


Everything is on one line now (but don’t do this with really large files as this causes the whole file to be read into memory).  
Everything is on one line now (but <span style="color:red">don’t do this with really large files</span> as this causes the whole file to be read into memory).  


  tr '\n' '@' < multi_fasta.fa | sed 's/>/#>/g' | less
  $ tr '\n' '@' < multi_fasta.fa | sed 's/>/#>/g' | less


Note the g at the end of the sed command, which stands for global. With the global option sed does the replacement for every occurrence of a search pattern in a line, not just for the first.
Note the <code>g</code> at the end of the <code>sed</code> command, which stands for global. With the global option <code>sed</code> does the replacement for every occurrence of a search pattern in a line, not just for the first.


  tr '\n' '@' < multi_fasta.fa | sed 's/>/#>/g' | tr '#' '\n' | less
  $ tr '\n' '@' < multi_fasta.fa | sed 's/>/#>/g' | tr '#' '\n' | less


Ok, we are almost ready to search, but we first need to get rid of the @ sign in the middle of the sequences.
Ok, we are almost ready to search, but we first need to get rid of the @ sign in the middle of the sequences.


  tr '\n' '@' < multi_fasta.fa | sed 's/>/#>/g' | tr '#' '\n'| | sed 's/\([AGCT]\)@\([AGCT]\)/\1\2/' | less
  $ tr '\n' '@' < multi_fasta.fa | sed 's/>/#>/g' | tr '#' '\n'| sed 's/\([AGCT]\)@\([AGCT]\)/\1\2/' | less


In the sed command the search and the replacement patterns are enclosed in forward slashes /.
In the <code>sed</code> command the search and the replacement patterns are enclosed in forward slashes <tt>/</tt>.
Anything that matches the pattern between \( and \) will be stored by sed and can be called again in the replacement pattern. The square brackets [ ] mean “match any one character that is enclosed by them”. In the replacement pattern \1 stands for the base before the @ sign, \2 stands for the base after the @ sign. Finally, let’s search for the microsats:
Anything that matches the pattern between <code>\(</code> and <code>\)</code> will be stored by <code>sed</code> and can be called again in the replacement pattern. The square brackets <code>[ ]</code> mean ''match any one character that is enclosed by them''. In the replacement pattern <code>\1</code> stands for the base before the @ sign, <code>\2</code> stands for the base after the @ sign. Finally, let’s search for the microsats:


  tr '\n' '@' < multi_fasta.fa | sed 's/>/#>/g' | tr '#' '\n'| | \
  $ tr '\n' '@' < multi_fasta.fa | sed 's/>/#>/g' | tr '#' '\n'| | \
  sed 's/\([AGCT]\)@\([AGCT]\)/\1\2/' | egrep "@.*(AG){3,}.*@" | less
  sed 's/\([AGCT]\)@\([AGCT]\)/\1\2/' | egrep "@.*(AG){3,}.*@" | less


Note the use of egrep instead of just grep for extended regular expressions. The search pattern is enclosed in quotations marks. Our sequences are delimited by @’s on each side. The dot stands for any single character. The asterisk * means “zero or more of the preceding character”. So .* could match exactly nothing or anything else. The {3,} means 3 or more times the preceding character. Without the brackets this would only refer to the the G in AG.
Note the use of <code>egrep</code> instead of just <code>grep</code> for extended regular expressions. The search pattern is enclosed in quotations marks. Our sequences are delimited by @’s on each side. The dot stands for any single character. The asterisk <code>*</code> means ''zero or more of the preceding character''. So <code>.*</code> could match exactly nothing or anything else. The <code>{3,}</code> means ''3 or more times the preceding character''. Without the brackets around <tt>AG</tt> this would only refer to the the <tt>G</tt> in <tt>AG</tt>.
Now that we have all sequences with 3x AG microsats, let’s get them back into fasta format again:
Now that we have all sequences with ≥3x <tt>AG</tt> microsats, let’s get them back into fasta format again:


  tr '\n' '@' < multi_fasta.fa | sed 's/>/#>/g' | tr '#' '\n'| | \
  $ tr '\n' '@' < multi_fasta.fa | sed 's/>/#>/g' | tr '#' '\n'| | \
  sed 's/\([AGCT]\)@\([AGCT]\)/\1\2/' | egrep "@.*(AG){5,}.*@" | \
  sed 's/\([AGCT]\)@\([AGCT]\)/\1\2/' | egrep "@.*(AG){5,}.*@" | \
  tr '@' '\n' | less
  tr '@' '\n' | less
Line 214: Line 224:
And let’s get rid of empty lines:
And let’s get rid of empty lines:


  tr '\n' '@' < multi_fasta.fa | sed 's/>/#>/g' | tr '#' '\n'| | \
  $ tr '\n' '@' < multi_fasta.fa | sed 's/>/#>/g' | tr '#' '\n'| | \
  sed 's/\([AGCT]\)@\([AGCT]\)/\1\2/' | egrep "@.*(AG){3,}.*@" | \
  sed 's/\([AGCT]\)@\([AGCT]\)/\1\2/' | egrep "@.*(AG){3,}.*@" | \
  tr '@' '\n' | grep -v "^$” > AGAGAG.fa  
  tr '@' '\n' | grep -v "^$” > AGAGAG.fa  


The grep search pattern contains the regular expression symbols for the beginning and the end of the line with nothing in between an empty line. The -v switch inverts the matching.  
The <code>grep</code> search pattern contains the regular expression symbols for the beginning and the end of the line with nothing in between <Big>→</Big> an empty line. The <code>-v</code> switch inverts the matching.  
Open AGAGAG.fa in less, and type /(AG){3,} and hit Enter. Each sequence should have a highlighted match.  
 
The syntax of regular expressions for grep, sed and less (which are very similar) is one of the most useful things you can learn. The most flexible regular expressions, however, are provided by Perl. Finally, let’s count how many sequences are in the output file:
Open ''AGAGAG.fa'' in <code>less</code>, and type:
 
/(AG){3,}
 
and hit Enter. Each sequence should have a highlighted match.  
The syntax of regular expressions for <code>grep</code>, <code>sed</code> and <code>less</code> (which are very similar) is one of the most useful things you can learn. The most flexible regular expressions, however, are provided by Perl. Finally, let’s count how many sequences are in the output file:
      
      
  grep -c “^>” AGAGAG.fa
  $ grep -c “^>” AGAGAG.fa


==I have a large table from the output of the programme stacks containing the condensed information for each reference tag. Each column is tab delimited. The first column contains the tag ids. Another column the reference sequence for the tag. How can I create a multi-fasta file from this table with the tag ids as fasta headers?==
== TASK 10: I have a large table from the output of the programme <tt>stacks</tt> containing the condensed information for each reference tag. Each column is tab delimited. The first column contains the tag ids. Another column the reference sequence for the tag. How can I create a multi-fasta file from this table with the tag ids as fasta headers?==


First, convince yourself that the table is actually tab delimited:
First, convince yourself that the table is actually tab delimited:
      
      
  cat -T stacks_output.tsv | less -S
  $ cd ~/NGS_workshop/Unix_module/TASK_10
$ cat -T stacks_output.tsv | less -S
      
      
Tabs will be replaced by ^I. In which column is the “Consensus sequence” of each tag? We want the “Catalog ID” as fasta header. Let’s first extract the two columns we need:
Tabs will be replaced by <tt>^I</tt>. In which column is the ''Consensus sequence'' of each tag? We want the ''Catalog ID'' as fasta header. Let’s first extract the two columns we need:


  cut -f 1,5 stacks_output.tsv | less
  $ cut -f 1,5 stacks_output.tsv | less


The first line of the output contains the column headers of the input table. We don’t want them in the fasta file. So let’s remove this line:
The first line of the output contains the column headers of the input table. We don’t want them in the fasta file. So let’s remove this line:


  cut -f 1,5 stacks_output.tsv | tail -n +2 | less
  $ cut -f 1,5 stacks_output.tsv | tail -n +2 | less
  man tail
  $ man tail


Now let’s insert a > in front of the tag ids in order to mark them as the fasta headers.
Now let’s insert a <tt>></tt> in front of the tag ids in order to mark them as the fasta headers.


  cut -f 1,5 tacks_output.tsv | tail -n +2 | sed ‘s/\(.*\)/>\1/’ | less
  $ cut -f 1,5 stacks_output.tsv | tail -n +2 | sed ‘s/\(.*\)/>\1/’ | less


In the sed command \(.*\) captures a whole line, which we call in the replacement pattern with \1. Finally we have to replace the tab that separates each header from its sequence by a return character in order to bring each sequence on the line below its header.
In the <code>sed</code> command <code>\(.*\)</code> captures a whole line, which we call in the replacement pattern with <code>\1</code>. Finally we have to replace the tab that separates each header from its sequence by a return character in order to bring each sequence on the line below its header.


  cut -f 1,5 tacks_output.tsv | tail -n +2 | sed ‘s/\(.*\)/>\1/’ | \
$ cut -f 1,5 tacks_output.tsv | tail -n +2 | sed ‘s/\(.*\)/>\1/’ | \
  tr ‘\t’ ‘\n’ | less
  tr ‘\t’ ‘\n’ | less


If you’re satisfied with the result, then redirect the output of tr into an output file. But what if you also wanted the multi fasta file to be sorted by tag id?
If you’re satisfied with the result, then redirect the output of <code>tr</code> into an output file. But what if you also wanted the multi fasta file to be sorted by tag id?


  cut -f 1,5 tacks_output.tsv | tail -n +2 | sort -nk 1 | \
  $ cut -f 1,5 tacks_output.tsv | tail -n +2 | <span style="color:green">sort -nk 1,1</span> | \
  sed ‘s/\(.*\)/>\1/’ | tr ‘\t’ ‘\n’ | less
  sed ‘s/\(.*\)/>\1/’ | tr ‘\t’ ‘\n’ | less


With the sort command we turn on numerical sort with -n and the -k switch lets us specify on which column the sorting should be done (by default, sort would use the whole line for sorting).
With the <code>sort</code> command we turn on numerical sort with <code>-n</code> and the <code>-k</code> switch lets us specify on which column the sorting should be done (by default, sort would use the whole line for sorting).
 
== TASK 11: I want to map the reads from 96 individuals against a reference sequence (e. g. partial or full reference genome or transcriptome, etc.). A mapping programme takes one of the 96 input files and tries to find locations for each read in the reference sequence. How can I parallelise this task and thus get the job done in a fraction of the time?==


==I want to map the reads from 96 individuals against a reference sequence (e. g. partial or full reference genome or transcriptome, etc.). A mapping programme takes one of the 96 input files and tries to find locations for each read in the reference sequence. How can I parallelise this task and thus get the job done in a fraction of the time?==
<span style="color: green">Note, this task makes use of a computer cluster running the job scheduler SGE. It also requires that you have the mapping programme [http://www.well.ox.ac.uk/project-stampy stampy] installed in a location that is in your PATH. Check this with:</span>
 
$ stampy


Current hardware in the Iceberg computer cluster:
Current hardware in the Iceberg computer cluster:
Line 263: Line 283:
</blockquote>
</blockquote>


… that’s <math>96 \times 4 + 31 \times 8 + 96 \times 12 = 1,784</math> CPU’s !!! Let’s try to use up to 96 of them at the same time.
… that’s 96 × 4 + 31 × 8 + 96 × 12 = 1,784 CPU’s !!! Let’s try to use up to 96 of them at the same time.


  cd ~/Genomics_workshop/Unix_module
  $ cd ~/Genomics_workshop/Unix_module/TASK_11
  ll ind_seqs
  $ ll ind_seqs


There are 96 fastq files with the reads from 96 individuals in this folder. All input files contain a number from 1 to 96. Otherwise, their names are identical. We now want to map the reads from those individuals against the “reference_seq”. We have already prepared a so-called array job submission script for you. Let’s have a look at it.
There are 96 fastq files with the reads from 96 individuals in this folder. All input files contain a number from 1 to 96 in their names. Otherwise, their names are identical. We now want to map the reads from those individuals against the ''reference_seq''. We have already prepared a so-called ''array'' job submission script for you. Let’s have a look at it.


  nano array-job.sh
  $ nano array-job.sh


The lines starting with #$ are specific to the SGE. The first requests 2 Gigabyte of memory. the second asks for slightly less than 1 hour to complete each task of the job. Any job or task taking less than 8 hours will be submitted to the short queue by the SGE and there is almost no waiting time for this queue. The next two lines specify that you get an email when each task has been begun and ended. -j y saves STDERR and STDOUT from each task in one file. The important line is the following:
The lines starting with <code>#$</code> are specific to the SGE. The first requests 2 Gigabytes of memory. the second asks for slightly less than 1 hour to complete each task of the job. Any job or task taking less than 8 hours will be submitted to the short queue by the SGE and there is almost no waiting time for this queue. The next two lines specify that you get an email when each task has been begun and ended. <code>-j y</code> saves STDERR and STDOUT from each task in one file. Change those two lines appropriately:
 
# tell the SGE where to find your executable
export PATH=/usr/local/extras/Genomics/Applications/bin:$PATH
 
# change into the directory where the input files are
cd /home/your_login_name/NGS_workshop/Unix_module/TASK_11/ind_seqs
 
 
The important line is the following:


  #$ -t 1-96
  #$ -t 1-96
Line 284: Line 313:
  -o ind_$SGE_TASK_ID.sam
  -o ind_$SGE_TASK_ID.sam


This is the actual command line that executes stampy, our mapping programme. The -M switch to stampy takes the input file, the -o switch the output file name. Submitting this array job script to the SGE scheduler is equivalent to submitting 96 different job scripts, each with an explicit number instead of $SGE_TASK_ID. After exiting nano let’s submit the job.
This is the actual command line that executes <code>stampy</code>, our mapping programme. The <code>-M</code> switch to <code>stampy</code> takes the input file, the <code>-o</code> switch the output file name. Submitting this array job script to the SGE scheduler is equivalent to submitting 96 different job scripts, each with an explicit number instead of $SGE_TASK_ID. After exiting <code>nano</code> let’s submit the job.


  qsub array_job.sh
  $ qsub array_job.sh
  Qstat
  $ Qstat




This is the end of the advanced Unix session. If you’ve made it until here CONGRATULATIONS !
This is the end of the advanced Unix session. If you’ve made it until here CONGRATULATIONS !!!


==Notes==
==Notes==


Please feel free to post comments, questions, or improvements to this protocol. Happy to have your input!
Please feel free to post comments, questions, or improvements to this protocol. Happy to have your input!
#List troubleshooting tips here. 
#You can also link to FAQs/tips provided by other sources such as the manufacturer or other websites.
#Anecdotal observations that might be of use to others can also be posted here. 


Please sign your name to your note by adding <font face="courier"><nowiki>'''*~~~~''':</nowiki></font> to the beginning of your tip.
Please sign your name to your note by adding <font face="courier"><nowiki>'''*~~~~''':</nowiki></font> to the beginning of your tip.
 
==References==
'''Relevant papers and books'''
<!-- If this protocol has papers or books associated with it, list those references here.-->
<!-- See the [[OpenWetWare:Biblio]] page for more information, but this format doesn't work currently.
<biblio>
#Goldbeter-PNAS-1981 pmid=6947258
#Jacob-JMB-1961 pmid=13718526
#Ptashne-Genetic-Switch isbn=0879697164
</biblio>-->
<!-- Try the [[Template:FormatRef|FormatRef template]]-->
#{{FormatRef|Goldbeter, A and Koshland, DE|1981| |Proc Natl Acad Sci U S A 78(11)|6840-4| }} PMID 6947258
#{{FormatRef|Jacob, F and Monod, J J|1961| |Mol Biol 3(3)|318-56| }} PMID 13718526
#{{FormatRef|Ptashne, M|2004|Genetic Switch: Phage Lambda Revisited| | |Cold Spring Harbor Laboratory Press}} ISBN 0879697164
 
==Contact==
*Who has experience with this protocol?
 
or instead, [[Talk:{{PAGENAME}}|discuss this protocol]].


<!-- You can tag this protocol with various categories.  See the [[Categories]] page for more information. -->
<!-- You can tag this protocol with various categories.  See the [[Categories]] page for more information. -->


[[Category:Protocol]]
[[Category:In silico]]
[[Category:DNA]]
<!-- Move the relevant categories above this line to tag your protocol with the label
<!-- Move the relevant categories above this line to tag your protocol with the label
[[Category:Protocol]]
[[Category:Protocol]]

Revision as of 07:16, 16 September 2014


Overview

This session will be more challenging if you are still new to Unix. However, it’s main aim is to demonstrate that Unix is much more than an environment in which you can execute big programmes like bwa or samtools. Rather, with it’s suite of text searching, extraction and manipulation programmes, it is itself a powerful tool for the many small bioinformatic tasks that you will encounter during everyday work. This module expects that you have done the steps in the Basic Unix module, i. e. it expects that you have already created certain folders and have copied the example data into your account. There are still small (hopefully not large) bugs lurking in this protocol. Please help improve it by correcting those mistakes or adding comments to the talk page. Many thanks.

Before you start

Log into your iceberg account. On the head node iceberg1, type:

 $ qrsh

Then change to the directory NGS_workshop, that you created in the Basic Unix module:

$ cd NGS_workshop/Unix_module

TASK 3: I have downloaded my illumina read data. Now I want to know how many reads my sequence run has yielded.

$ zless Unix_tut_sequences.fastq.gz
$ zcat Unix_tut_sequences.fastq.gz | wc -l
$ man wc

gives you the number of lines in your sequence file, which is the number of reads times four for fastq format. Note, you can usually avoid uncompressing data files, which saves disk space.

TASK 4: I have a .fastq file with raw sequences from a RAD library. How can I find out how many of the reads contain the correct sequence of the remainder of the restriction site?

Let's assume you had 5bp long barcode sequences incorporated into the single-end adapters, which should show up at the beginning of each sequence read. Let's also assume that you have used the restriction enzyme SbfI for the creation of the library, which has the following recognition sequence: CCTGCAGG . So the correct remainder of the restriction site that you expect to see after the 5bp barcode is TGCAGG. First have a look at your fastq file again:

$ zless -N Unix_tut_sequences.fastq.gz

Each sequence record contains four lines. The actual sequences are on the 2nd, 6th, 10th, 14th, 18th line and so on. The following can give you the count:

$ zcat Unix_tut_sequences.fastq.gz | perl -ne 'print unless ($.-2)%4;'  | grep -c "^.....TGCAGG"

This is a pipeline which first uncompresses your sequence read file and pipes it into the perl command, which extracts only the DNA sequence part of each fastq record. $. in perl stands for the current line number, %4 returns modulo 4 of the current line number minus 2 (the modulo operator returns 0 if the result of the division is an integer (i. e. if the line number is a multiple of 4) and 1, 2, 3 etc. otherwise. Zero is equivalent to FALSE and any integer other than zero is equivalent to TRUE). Get it?

$ zcat Unix_tut_sequences.fastq.gz | perl -ne'print unless ($.-2)%4;' | less

Grep searches each line from the output of the perl command for the regular expression given in quotation marks. ^ stands for the beginning of a line. A dot stands for any single character. There are five dots, because the barcodes are all 5 base pairs long. The -c switch makes grep return the number of lines in which it has found the search pattern at least once.

$ man grep
$ man perlrun
$ zcat Unix_tut_sequences.fastq.gz | perl -ne'print unless ($.-2)%4;' | grep  "^.....TGCAGG" | less

In less, type:

/^.....TGCAGG 

then hit enter.

TASK 5: I have split my reads by barcode and I have quality filtered them. Now I want to know how many reads I have left from each (barcoded) individual. How can I find that out?

$ for file in *.fq; \
do echo -n "$file " >> retained; \
cat $file | perl -ne 'print unless ($.-2)%4;' | wc -l >> retained; \
done &
$ less retained

This bash for loop goes sequentially over each file in the current directory which has the file ending .fq. It prints that file name to the output file retained. The >> redirection character makes sure that all output is appended to the file retained. Otherwise only the output from the last command in the loop and from the last file in the list of files would be stored in the file retained. The ampersand & at the end of the command line sends it into the background, which means you get your command line prompt back immediately while the process is running in the background.

TASK 6: I have 30 output files from a programme, but their names are not informative. I want to insert the keyword cleaned into their names? How can I do this?

Renaming files is a very common and important task. I’ll show you a simple and fast way to do that. There are as always many ways to solve a task. First, you could type 30 mv commands that’s the fastest way to rename a single file but doing that 30 times is very tedious and error prone. Second, you could use a bash for loop as above in combination with echo and the substitution command sed, like this:

$ for file in *.out; do mv $file `echo $file | sed 's|^\(.*\)\(\.out\)$|\1_cleaned\2|'`; done

but that's a bit tricky if you are not using sed every day. Note the two backticks characters: one before echo, the other right before the last semicolon. Everything in backticks will be evaluated by bash and then replaced by it’s output. So in this case it’s the modified file name that is provided as second argument to the mv command. If you want to find out what the sed command does, take a look at the beginning of this sed tutorial.

The best way, however, to solve the task is by downloading the perl rename script (note, this is not the system rename command that comes with Unix, which has very limited capabilities), put it into a directory that is in your PATH (e. g. ~/prog) and make that text file executable with chmod. You can download Larry Wall's rename.pl script from here with wget:

$ cd ~/src
$ wget http://tips.webdesign10.com/files/rename.pl.txt

or use firefox if you are in an interactive session.

$ mv rename.pl.txt ~/prog/rename.pl

Note, this cut/pastes and renames at the same time.

$ ll ~/prog

Let’s call the programme:

rename.pl

You should get Permission denied. That’s because we first have to tell bash that we permit its execution:

$ chmod u+x ~/prog/rename.pl
$ man chmod

The u stands for user (that’s you) and x stands for executable. Let’s try again:

$ rename.pl

How to make the documentation in the script file visible?

$ cd ~/prog
$ pod2man rename.pl > rename.pl.1
$ mv rename.pl.1 ~/man/man1

Then look at the documentation with:

$ man rename.pl

First let's create 30 empty files to rename:

$ cd
$ mkdir test
$ cd test
$ for i in {1..30}; do touch test_$i.out; done
$ ll
$ rename.pl -nv 's/^(.+)(\.out)/$1_cleaned$2/' *.out

You already know that bash expands *.out at the end of the command line into a list of all the file names in the current directory that end with .out. So the command does the renaming on all the 30 files we created. Let’s look at the stuff in single quotes. s turns on substitution, since we are substituting old file names with new file names. The forward slashes / separate the search and the replacement patterns. The search pattern begins with ^, which stands for the beginning of the file name. The dot stands for any single character. The + means one or more of the preceding character. So .+ will capture everything from the beginning of the file name until .out. In the second pair of brackets we needed to escape the dot with a backslash to indicate that we mean dot in its literal sense. Otherwise, it would have been interpreted as representing any single character. Anything that matches the regular expressions in the brackets will be automatically saved, so that it can be called in the replacement pattern. Everything that matches the pattern in the first pair of brackets is saved to $1, everything that matches the pattern in the second pair of brackets is saved to $2. Fore more on regular expressions, look up:

$ man grep
$ man perlrequick

TASK 7: I have got a text file originating from a Windows or an old Mac text editor. But when I open it in less, all lines are concatenated into one line and with repeatedly the strange ^M character in it. How can I fix this?

$ cd ~/NGS_workshop/Unix_module/TASK_7
$ ll
$ cat -v text_file_from_Windows.txt

Unix has only linefeeds, the old Mac’s had only carriage returns and Windows uses both characters together to represent one return character. The following will remove the carriage return from the end of lines, leaving only linefeeds:

$ tr -d ‘\r’ < text_file_from_Windows.txt > file_from_Windows_fixed.txt
$ cat -v file_from_Windows_fixed.txt
$ man tr

Note, most Unix programmes print their output to STDOUT, which is the screen. If you want to save the output, it needs to be redirected into a new file. Never redirect output back into the input file, like this:

$ tr -d ‘\r’ < text_file_from_Windows.txt > text_file_from_Windows.txt 
$ cat text_file_from_Windows.txt

You’ve just clobbered your input file.

less text_file_from_Mac.txt

The old Mac text file has just one carriage return instead of a linefeed. That’s why, all lines of the file are concatenated into one line. To fix this:

$ tr ‘\r’ ‘\n’ < text_file_from_Mac.txt > file_from_Mac_fixed.txt
$ less file_from_Mac_fixed.txt

As usual there are several ways to achieve the same result:

$ sed ‘s/\r/\n/g’ < text_file_from_Windows.txt > \ file_from_Windows_fixed.txt
$ dos2unix text_file_from_Windows.txt
$ mac2unix text_file_from_Mac.txt

The last two programmes are available on iceberg, but are not by default included in a Unix system. You would have to install them yourself. Note also, that dos2unix and mac2unix do in place editing, i. e. the old version will be overwritten by the new version.

TASK 8: I have just mapped my reads against the reference sequences and I got a BAM file for each individual. How can I find out the proportion of reads from each individual that got mapped successfully?

Note: it seems that the samtools view command should be able to do that with the -f and -c switches. However, in my trials it only returned the total number of reads in the BAM file, i .e including those that did not get mapped. Fortunately, this is a really easy Unix task. First let’s have a look at the .bam file:

$ cd ~/NGS_workshop/Unix_module/TASK_8
$ samtools view alignment.bam | less

If that fits on your screen without line wrapping ... lucky bastard! If not, just turn off line wrapping in less:

$ samtools view alignment.bam | less -S

Use the arrow keys on your keyboard to scroll left and right. The second column contains a numerical flag which indicates the result of the mapping for that read. The flag of 0 stands for a successful mapping of the read. A flag of 16 also stands for successful mapping, but of the reverse complement of the read. So in order to get the number of reads that got mapped successfully we need to count the number of lines with zeros or 16's in their second column. Let’s cut out the second column of the tab delimited .bam file:

$ samtools view alignment.bam | cut -f 2 | less
$ man cut

Next, we have to sort this column numerically in order to collapse it into unique numbers:

$ samtools view alignment.bam | cut -f 2 | sort -n | uniq -c
$ man sort

With the -c switch to uniq the output is a contingency table of the flags from column 2 that can be found in your alignment file. However, some reads with a 0 flag in the second column still have a mapping quality of 0 (don’t ask me why), which is in the 5th column. So, in order to get a count of the number of reads that got mapped successfully with a mapping quality above 0, use gawk:

$ samtools view alignment.bam | awk '($2==0 || $2==16) && $5>0' | wc -l
$ man awk

Gawk is the GNU version of awk. Macs get only shipped with awk. Both are almost identical except that awk can’t handle large files (since we are mostly working on one input line at a time, this rarely makes a practical differnece). $2 stands for the content in the second column, $5 for the fifth, || means logical OR and && means logical AND. Awk prints only those lines to output which match our conditions. We then simply count those lines.

TASK 9: I have got a big multi-fasta file with thousands of sequences. How can I extract only those fasta records whose sequences contain at least 3 consecutive repeats of the dinucleotide AG?

Believe it or not, this task can be solved with a combination of Unix commands, i. e. no programming is required. Let’s approach the task step by step. First have a look at the input file:

$ cd ~/NGS_workshop/Unix_module/TASK_9
$ less multi_fasta.fa

We could just use the programme grep to search each line of the fasta file for our pattern (i. e. 3 consecutive AG), but since grep is line based we would lose the fasta headers and some part of the sequence in the process. So, we somehow have to get the fasta headers and their corresponding sequences on the same line.

$ tr '\n' '@' < multi_fasta.fa | less

Everything is on one line now (but don’t do this with really large files as this causes the whole file to be read into memory).

$ tr '\n' '@' < multi_fasta.fa | sed 's/>/#>/g' | less

Note the g at the end of the sed command, which stands for global. With the global option sed does the replacement for every occurrence of a search pattern in a line, not just for the first.

$ tr '\n' '@' < multi_fasta.fa | sed 's/>/#>/g' | tr '#' '\n' | less

Ok, we are almost ready to search, but we first need to get rid of the @ sign in the middle of the sequences.

$ tr '\n' '@' < multi_fasta.fa | sed 's/>/#>/g' | tr '#' '\n'| sed 's/\([AGCT]\)@\([AGCT]\)/\1\2/' | less

In the sed command the search and the replacement patterns are enclosed in forward slashes /. Anything that matches the pattern between \( and \) will be stored by sed and can be called again in the replacement pattern. The square brackets [ ] mean match any one character that is enclosed by them. In the replacement pattern \1 stands for the base before the @ sign, \2 stands for the base after the @ sign. Finally, let’s search for the microsats:

$ tr '\n' '@' < multi_fasta.fa | sed 's/>/#>/g' | tr '#' '\n'| | \
sed 's/\([AGCT]\)@\([AGCT]\)/\1\2/' | egrep "@.*(AG){3,}.*@" | less

Note the use of egrep instead of just grep for extended regular expressions. The search pattern is enclosed in quotations marks. Our sequences are delimited by @’s on each side. The dot stands for any single character. The asterisk * means zero or more of the preceding character. So .* could match exactly nothing or anything else. The {3,} means 3 or more times the preceding character. Without the brackets around AG this would only refer to the the G in AG. Now that we have all sequences with ≥3x AG microsats, let’s get them back into fasta format again:

$ tr '\n' '@' < multi_fasta.fa | sed 's/>/#>/g' | tr '#' '\n'| | \
sed 's/\([AGCT]\)@\([AGCT]\)/\1\2/' | egrep "@.*(AG){5,}.*@" | \
tr '@' '\n' | less

And let’s get rid of empty lines:

$ tr '\n' '@' < multi_fasta.fa | sed 's/>/#>/g' | tr '#' '\n'| | \
sed 's/\([AGCT]\)@\([AGCT]\)/\1\2/' | egrep "@.*(AG){3,}.*@" | \
tr '@' '\n' | grep -v "^$” > AGAGAG.fa 

The grep search pattern contains the regular expression symbols for the beginning and the end of the line with nothing in between an empty line. The -v switch inverts the matching.

Open AGAGAG.fa in less, and type:

/(AG){3,}

and hit Enter. Each sequence should have a highlighted match. The syntax of regular expressions for grep, sed and less (which are very similar) is one of the most useful things you can learn. The most flexible regular expressions, however, are provided by Perl. Finally, let’s count how many sequences are in the output file:

$ grep -c “^>” AGAGAG.fa

TASK 10: I have a large table from the output of the programme stacks containing the condensed information for each reference tag. Each column is tab delimited. The first column contains the tag ids. Another column the reference sequence for the tag. How can I create a multi-fasta file from this table with the tag ids as fasta headers?

First, convince yourself that the table is actually tab delimited:

$ cd ~/NGS_workshop/Unix_module/TASK_10
$ cat -T stacks_output.tsv | less -S
   

Tabs will be replaced by ^I. In which column is the Consensus sequence of each tag? We want the Catalog ID as fasta header. Let’s first extract the two columns we need:

$ cut -f 1,5 stacks_output.tsv | less

The first line of the output contains the column headers of the input table. We don’t want them in the fasta file. So let’s remove this line:

$ cut -f 1,5 stacks_output.tsv | tail -n +2 | less
$ man tail

Now let’s insert a > in front of the tag ids in order to mark them as the fasta headers.

$ cut -f 1,5 stacks_output.tsv | tail -n +2 | sed ‘s/\(.*\)/>\1/’ | less

In the sed command \(.*\) captures a whole line, which we call in the replacement pattern with \1. Finally we have to replace the tab that separates each header from its sequence by a return character in order to bring each sequence on the line below its header.

$ cut -f 1,5 tacks_output.tsv | tail -n +2 | sed ‘s/\(.*\)/>\1/’ | \
tr ‘\t’ ‘\n’ | less

If you’re satisfied with the result, then redirect the output of tr into an output file. But what if you also wanted the multi fasta file to be sorted by tag id?

$ cut -f 1,5 tacks_output.tsv | tail -n +2 | sort -nk 1,1 | \
sed ‘s/\(.*\)/>\1/’ | tr ‘\t’ ‘\n’ | less

With the sort command we turn on numerical sort with -n and the -k switch lets us specify on which column the sorting should be done (by default, sort would use the whole line for sorting).

TASK 11: I want to map the reads from 96 individuals against a reference sequence (e. g. partial or full reference genome or transcriptome, etc.). A mapping programme takes one of the 96 input files and tries to find locations for each read in the reference sequence. How can I parallelise this task and thus get the job done in a fraction of the time?

Note, this task makes use of a computer cluster running the job scheduler SGE. It also requires that you have the mapping programme stampy installed in a location that is in your PATH. Check this with:

$ stampy

Current hardware in the Iceberg computer cluster:

SGE allows jobs to be submitted to the cluster of 96 Sun X2200s (each with 4 CPUs and 16GB memory) 31 Sun X2200s (each with 8 CPUs and 32GB memory and Infiniband) 96 Dell C6100 nodes (with 12 CPUs and 24GB memory and Infiniband)

… that’s 96 × 4 + 31 × 8 + 96 × 12 = 1,784 CPU’s !!! Let’s try to use up to 96 of them at the same time.

$ cd ~/Genomics_workshop/Unix_module/TASK_11
$ ll ind_seqs

There are 96 fastq files with the reads from 96 individuals in this folder. All input files contain a number from 1 to 96 in their names. Otherwise, their names are identical. We now want to map the reads from those individuals against the reference_seq. We have already prepared a so-called array job submission script for you. Let’s have a look at it.

$ nano array-job.sh

The lines starting with #$ are specific to the SGE. The first requests 2 Gigabytes of memory. the second asks for slightly less than 1 hour to complete each task of the job. Any job or task taking less than 8 hours will be submitted to the short queue by the SGE and there is almost no waiting time for this queue. The next two lines specify that you get an email when each task has been begun and ended. -j y saves STDERR and STDOUT from each task in one file. Change those two lines appropriately:

# tell the SGE where to find your executable
export PATH=/usr/local/extras/Genomics/Applications/bin:$PATH
# change into the directory where the input files are
cd /home/your_login_name/NGS_workshop/Unix_module/TASK_11/ind_seqs


The important line is the following:

#$ -t 1-96

This initialises 96 task ids, which we can call in the rest of the array job submission script with $SGE_TASK_ID.

stampy \
-g ../reference_seq \
-h ../reference_seq \
-M ind_$SGE_TASK_ID.fq \
-o ind_$SGE_TASK_ID.sam

This is the actual command line that executes stampy, our mapping programme. The -M switch to stampy takes the input file, the -o switch the output file name. Submitting this array job script to the SGE scheduler is equivalent to submitting 96 different job scripts, each with an explicit number instead of $SGE_TASK_ID. After exiting nano let’s submit the job.

$ qsub array_job.sh
$ Qstat


This is the end of the advanced Unix session. If you’ve made it until here CONGRATULATIONS !!!

Notes

Please feel free to post comments, questions, or improvements to this protocol. Happy to have your input!

Please sign your name to your note by adding '''*~~~~''': to the beginning of your tip.