Revision as of 03:18, 10 November 2013

Winterschool program

Software list

Basics

linux Ubuntu 12.04.3 vs Debian 7.1 (think about 32 vs 64 bit versions)
java http://www.java.com/en/download/linux_manual.jsp?locale=en

Specific tools 1

FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

TagDust: http://genome.gsc.riken.jp/osc/english/software/src/tagdust.tgz
fastareformat from fastareformat exonerate-2.2.0 [1]
fixing fasta headers (gff fields) with python? small script
GEM [2]
1. CAVEAT: (problem with cores on different laptops...)

http://sourceforge.net/projects/gemlibrary/files/gem-library/Binary%20pre-release%203/

BWA http://sourceforge.net/projects/bio-bwa/files/
Stampy http://www.well.ox.ac.uk/~gerton/software/Stampy/stampy-1.0.22r1848.tgz
last http://last.cbrc.jp/ (the 362 versiona has split and splice-mapping options)
bowtie http://bowtie-bio.sourceforge.net/bowtie2/index.shtml (bowtie2)
samtools http://sourceforge.net/projects/samtools/files/
picard http://sourceforge.net/projects/picard/files/
IGV/ IGVtools http://www.broadinstitute.org/software/igv/download

bamtools https://github.com/pezmaster31/bamtools
1. requires cmake: http://www.cmake.org/files/v2.8/cmake-2.8.12.tar.gz (or apt get)
bedtools http://code.google.com/p/bedtools/downloads/list
GATK http://www.broadinstitute.org/gatk/auth?package=GATK (download yourself: license!)
vcftools http://sourceforge.net/projects/vcftools/files/

Specific tools 2/RNA-Seq

tophat http://tophat.cbcb.umd.edu/
cufflinks http://cufflinks.cbcb.umd.edu/ (may require Boost libs!)
GEMtools https://github.com/gemtools/gemtools

Vagrant fixes

For X11 forwarding the Vagrantfile has to contain

config.ssh.forward_x11 = true

Introduction to Linux and the command line

why Linux?
1. runs on everything from cell phones to supercomputers
2. long history of stable tools
3. most of the bioinformatics software was written and intended to run on Linux

logging in, connecting to other servers with ssh / sftp

As with other computers, one requires username password combination to connect to a specific computer. This combination can be specific to each of the computers or shared between i.e. all workstations at a given location.

SSH is a name for secure, encrypted connections between computers. It consist of two components, ssh server running on a remote machine and a ssh client on your laptop / workstation. The client is included in the default installations of recent Linux and OS X (Mac), but on Windows one has to install it ( http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html ). It is not required you have it for the course, but it is a good idea to have it on your computer if you want to access remote servers on the command line.

With properly configured ssh connection (on Linux and Mac) you can run not just command line programs but also graphical user interface. This is also possible on Windows, but it is way more complicated to set up.

XXX Emilio: start vagrant or Linux from inside your virtual box machine, log in XXX

File and directories naming

Linux is case sensitive, do MyFile.txt is different from MYFILE.TXT or myfile.txt. Try to use some consistent naming schemes for your project directories, input data or result files. You can use very long, descriptive names, like these result files from ENCODE project:

wgEncodeUwTfbsNhdfneoCtcfStdAlnRep0.bam_VS_wgEncodeUwTfbsNhdfneoInputStdAlnRep1.bam.regionPeak.gz

Things to keep in mind:

1. never ever use space/tabs in your file/directory names
2. avoid Unix special characters in file names (!?"'%&^~*$|/\{}[]()<>:)

Linux directory structure/navigating

As with Windows and Mac systems one can imagine the systems of directories (synonym of folders) as a giant tree, where each of the branches has an upper directory and may contain lower level directories (sub-directories). At the top of this branched hierarchy it is just one place from where everything starts (no C:, D: X: drives), and in Unix speak it is called root directory. Some examples of directory naming:

/home/linus/

/usr/local/bin

/home/linus/bioinf/programs

/home/linus/projects/chicken

Individual users have separate directories in which they store their data, results etc.

Some shortcuts to remember (what is after # sign is a description):


/ #root directory

~ #your home directory

.. #one directory above

.  #current directory

There are few commands to move between directories, create new ones, and list what is located there:


pwd #print working directory = show where you are

ls #list what is located in the current directory

ls -l #list what is located in the current directory with details

cd /home/linus/projects/chicken #go to this directory

cd - # special shortcut to go back to the directory you have been before

cd / #go to the root directory

cd ~ #go to your home directory spec

cd .. #go todirectory one level above the current one

mkdir mynew_directory #create new directory

rmdir myold_directory #remove some directory (it must be empty!)

Absolute vs relative directory naming

We can do all the operations on directories or files using two conventions:

# absolute directory path starting with root "/"

ls /home/linus/projects/chicken/genome.fasta

#relative directory path (lets assume we are in  /home/linus/projects/banana/)

ls ../chicken/genome.fasta

This is useful shortcut for saving typing and avoiding errors when i.e. working on different systems.

Permissions

In order to restrict actions associated with a given file or directory, each entity has 3 flags (r = read, w = write, e = execute) for 3 groups of users (file owner, group to which file owner belongs, i.e. students, and "all" for all the remaining users on that computer). So we have 9 fields describing what each of these groups can do with the file. On the top of it (or rather in the front of the string) we have another flag to tell us about what kind of thing it is ("-" = just a regular file, d = directory, l = link, etc.)

By default, owner has read+write permissions, his group members can read but not write (modify or delete) his files, and the rest of the word should have no right to see content of his files. These permissions are visible when listing content of a directory with "ls -l":


-rwxr-xr-x 1 vagrant users    49 Nov  7 14:43 my_program.py
-rw-r----- 1 vagrant users  1812 Nov  7 14:39 myfile01.txt
---------- 1 vagrant users  9045 Nov  7 14:41 myfile02.txt
-r--r--r-- 1 vagrant users  2016 Nov  7 14:39 myfile03.txt
-rw------- 1 vagrant users 67863 Nov  7 14:40 myfile04.txt
-rw-rw-rw- 1 vagrant users  8125 Nov  7 14:40 myfile05.txt
-rw-rw---- 1 vagrant users  7233 Nov  7 14:40 myfile06.txt
lrwxrwxrwx 1 vagrant users    12 Nov  7 14:50 myfile07.txt -> myfile06.txt
drwxr-xr-x 2 vagrant users     0 Nov  7 14:50 test_dir1

To change permissions, we have to specify first the group we want to modify, action (add permissions or remove them) then the permission themselves, finaly the name of the file(s):

chmod a+x my_new_script.py #add execute permission to for all 

chmod a-w # remove write permission for owner, group and the rest

This is often useful for sharing data / results with other users on the same machine or when writing scripts/executing some programs downloaded from the net.

copy, rename/move files, create symbolic links

Often we need to organize files in directories from the existing ones, which requires moving thing around, making copies, renaming directories and files. To save space or to avoid using long path names, we can create links to other existing files or folders.

CAVEAT: both cp and mv commands will perform the task without warning if the destination file exist. It is one of the ways of destroying your data. To avoid this error use "cp -i" or "mv -i" which will ask you for confirmation when you are overwriting existing file.

#copy move rename pattern: 
cp/mv source destination

cp file1.txt file2.txt

cp file1.txt /some/directory/of/your/choice/ 

cp /home/linus/projecs/linus/file1.txt . #notice the space and the dot "." at the end of the command. The command is copy the file to the current directory

Instead of doing copy (original file stays intact) one can do mv (move/rename) where only the new copy remains and the old one is destroyed.

mv file1.txt file2.txt

mv file1.txt /some/directory/of/your/choice/ 

mv /home/linus/projecs/linus/file1.txt . #notice the space and the dot "." at the end of the command. The command is copy the file to the current directory

Linking is a way of organizing your data. I.e. you have one big genome fasta file, and want to map your next generation sequencing reads using different mappers, which will require using this one file as an input to several programs. You can store your fasta file in one directory (i.e. /home/vagrant/projects/chicken/genome1.fa) and create links to this file in several subdirectories /home/vagrant/projects/chicken/bwa_mapping/, /home/vagrant/projects/chicken/bowtie2_mapping/, etc.). The command:

cd /home/vagrant/projects/chicken/bwa_mapping/
ln -s  /home/vagrant/projects/chicken/genome1.fa .

cd /home/vagrant/projects/chicken/bowtie2_mapping/
ln -s  /home/vagrant/projects/chicken/genome1.fa .

view files (more/less, head, tail), count (wc)

There are two basic types of files, text files and binary files. A lot of file formats in bioinformatics belong to simple text files, so the basic viewing/processing them is essential. Never ever do not try to open big text files (say few GB FASTQ file) in an editor, be it an editor on Linux or Microsoft Word...

Basic checking of the file content:

more myfile04.txt #displays content of the file screen by screen

less myfile04.txt #same as above but you can scroll back

head myfile04.txt #displays just the first 10 top lines of the file

head -100 myfile04.txt #displays the first 100 lines 

tail myfile04.txt  #displays just the last 10 lines of the file

tail -100 myfile04.txt  #displays  the last 100 lines of the file

wc myfile04.txt #newline, word, and byte counts for the file

wc -l myfile04.txt # just count the lines of the file

Using these commands is very useful for basic sanity check, so we can be sure that the FASTA file looks OK, that we have X lines in the file etc. First practical NGS application:

Q: how many reads do we have in our FASTQ file A: count the lines in FASTQ file, divide by 4 (each sequence occupies 4 lines)

pattern matching, redirection and piping

Instead of using full name of a file each time, we can substitute part of the file name with special characters. Say we want to list the names all the files with .txt suffix ( myfile01.txt, myfile02.txt, etc. ) in a directory. And lets say we have about 500 of these:

ls *.txt

On Linux we can capture the output of one simple command (assuming it produces a text output) and do something useful with it, instead of just reading it from the screen. If we want to count how many such files we got, we can pipe "|" (vertical line, not the letter "I") the output of the ls command to wc command:

ls *.txt | wc -l #displays the number of .txt files in this directory

We can also redirect the output to a file, creating i.e. list of txt files with names starting with some letter (redirection with single ">":

ls m*.txt > list_of_txt_files_starting_with_m.fof  #we create a new file listing m*.txt files

Or we can keep on adding the text to some file, which will be created if it does not exist (redirection with double ">>"):

ls a*.txt >> list_of_txt_files_starting_with_a_or_m.fof  #file is created and we get just a*.txt file names in it
ls m*.txt >> list_of_txt_files_starting_with_a_or_m.fof  #file exists, so we are just appending   m*.txt file names to it
more list_of_txt_files_starting_with_a_or_m.fof

search for strings / replace strings (grep & sed)

We can search the text files using command called "grep". Few examples:

#general pattern:
grep some_word text_file_name

grep ITALY my_file.txt #prints all lines containing the word ITALY

grep -v ITALY my_file.txt #"reverse grep" prints all lines not containing ITALY

grep -c ITALY my_file.txt #just counts the number of lines containing ITALY

grep "^>" multiple_sequences.fasta #prints all FASTA headers (lines starting with ">" sign)

grep -c "^>" multiple_sequences.fasta # prints number of sequences in the multiple fasta file

Instead of opening an editor and doing search and replace, we can do it on a command line, using utility called sed, and operate both on files or use pipes.

sed 's/ITALY/SPAIN/g' my_file.txt # replaces each occurrence of ITALY with SPAIN

grep "^>" multiple_sequences.fasta  | sed 's/>//g' # we get all FASTA headers and then remove the ">" sign, getting just the proper FASTA IDs

compressing / uncompressing files (gzip, tar)

The sequence files are often really big, and we save space and time during downloads by compressing them. There are several compression programs, but the most frequently used is gzip:

gzip some_file #we will get some_file.gz, and the original uncompressed one will be removed

gunzip some_file.gz # reversing the process, we get some_file

Program sources (almost always) or data are distributed not as single file but bundled in one file with "tar" program, then compressed for easier transfers/storage.

#we downloaded program_123.tar

tar xfv program_123.tar #this will unpack the content of the tar archive

#we downloaded program_123.tar.gz
tar xfvz program_123.tar.gz #this will unpack the content of the compressed by gzip tar archive

#we want to create tar archive containing all data_001.txt to data_100.txt (assuming there are only 100such files in this directory)
#pattern
#tar cfv result.tar files_to_be_archived

tar cfvz my data.tar data_*.txt
  
#to create gzipped archive
tar cfvz my data.tar.gz data_*.txt

awk in few minutes

Awk is a simple programing language used for basic text processing. It is still being used because often it is faster to write and execute a command in awk than write small small script in more advanced languages, such as perl, or python. The basic concept is that awk splits the lines of the text into individual words/numbers (== great for text files in a column form), which we can then access using $column_number notation

awk '{print $1}' my_data.txt # prints just first column of the file

awk '{print $1"_my_new_label"}' # prints first column but also "_my_new_label" string just after it

We can perform simple calculations, (assuming we have columns of numbers).

#this will simply sum all the numbers from the first column then print that number
awk '{ sum+=$1} END {print sum}' my_data.txt

where to go from there

more information

All the commands used above have multiple options impossible to cover during the short introduction. Once you know the name of the program, you can learn about it functions and usage reading installed manual:

man ls
man grep

The recommended reference book is:

Unix Power Tools, 3rd Edition

By Jerry Peek, Shelley Powers, Tim O'Reilly, Mike Loukides

Publisher: O'Reilly Media

Released: October 2002

Pages: 1156

shell scripts

Sometimes we need to do the same thing on a set of files, like submitting a hundred jobs to a computer cluster using the same program (lets say for mapping FASTQ files) but with different input data (FASTQ files) and differently named outputs (BAM files). For doing it we need to first create 100 individual "program files" (== shell scripts), here called "map_script" i.e. using python (next section). But instead of submitting the jobs one by one we can do instead on a command line:

 
for file in map_script*. #press enter
do #press enter
qsub $file #press enter
done #press enter

Python

Despite that there is a lot of bioinformatics software written in Perl (ENSEMBL, GBrowse, BioPerl), Python is a good choice as a first (and maybe even only) programming language you need for day to day bioinformatics tasks. It is easier to learn and read than Perl, but still fast enough to deal with i.e. FASTQ files.

XXX two python programs XXX

1) Fasta header fixer

2) ??? some mapper runner ???

Computer clusters

Individual workstations or even multi CPU servers are not good enough for running hundred of CPU-demanding jobs. For that we have computer clusters, where we use tens or hundreds of computer nodes. To distribute loads on such clusters we use special software maintained by system admins, i.e. Sun Grid Engine (SGE). It takes care that not all jobs are run at once the moment they are submitted by the user, and it distributes the jobs among the nodes balancing the loads. Because individual cluster installations at your institutions are likely different from each other (they may use different software than SGE), this section is mostly for convincing you that whenever you have some cluster you can use, you should investigate how you can use it and run your next generation sequencing jobs on them.

XXX qsub XXX qmon XXX qdel

FASTQ

Illumina file formats (quality encodings)
paired / unpaired reads
quality checking (fastqc)
trimming & filtering (TagDust)

Tagdust is a program for removing Illumina adapter sequences from the reads containing them. Such reads containing 6-8 bases not from genome will be impossible to map using typical mappers having often just 2 mismatch base limit. (XXX check if it is true for bwa/bowtie)

source of published FASTQ data: Short Read Archive vs ENA

Genomic fasta and gtf/gff gene annotation

resources at ENSEMBL
basic checks and reformatting

grepping fasta headers
fasta reformat from exonerate??

Mapping genomic reads

overview of mapper

There are tens of different mappers for aligning millions of short reads to genomes. There are huge differences in speed, memory usage, handling of errors in the read. We will take a look at just few, most commonly used by various pipelines. For the time being we will trat these programs as black boxes, not going into algorithms used inside.

We need reference genome (FASTA) from the same species to which we will map our reads, and short read data (Illumina FASTQ). Before mapping we need to do some basic checks.

genome check

If you get the genome for canonical species from ENSEMBL (cow, chicken, pig) you can trust that it contains proper number of chromosomes and mitochondrial sequences. This is not the case with draft genomes available from individual research groups. It is always a good idea to look at number and names of individual contigs (see grep section), and check that i.e sequences are present. In case of genomes lacking certain sequences we have bigger chances to map reads to wrong locations, which in turn will give us wrong coverage, SNPs or expression in case of RNA-Seq.

checking reads

See FastQC above and Tagdust.

Before we can use the genome for mapping we have to transform it into a format specific for each of the mappers allowing for much faster search and lower memory usage.

1. GEM
2. bwa +/- stampy
3. bowtie2

1. last

This is less popular but sometimes quite useful mapper reporting unique mappings only. It can handle large number of mismatches and it simply remove the non-matching parts of the read, as long as what is left is sufficient to secure unique mapping. It can also be used to map very long reads, and even genome to genome (but then one has to index the genome differently). Standard usage:

lastdb -mXXXXX last_databse_name fasta_file

lastal -QXXX last_databse_name input_reads

mapping steps (for each mapper)
genome indexing
mapping
+/- postprocessing

SAM and BAM file formats

Analyzing BAM files
sorting / indexing

During the mapping reads are mapped on first in first out, so the output is ordered by order of reads not the position along the chromosomes. Querring such BAM files is very inefficient, and the files themselves are unnecessary large. So every BAM file to be used downstream needs to be first sorted, and then indexed. BAM index (.bai suffix) is just for external programs to access a given part of the genome even more effectively. For sorting and indexing we will use samtools:

samtools sort unsorted_input.bam sorted_output #notice no bam suffix!!!
samtools index sorted_output.bam #creates sorted_output.bam.bai

viewing the mappings in IGV

IGV is a java program primarily for viewing mappings of short reads to a genome. But it can also be used for viewing SNPs (VCF files), genome anmotation or even genom-2-genome alignment (not a typical usage). The order of steps:

start IGV (it needs to be started specifying the amount of RAM being used by the program). This depends in the coverage, number of BAM files oppened at the same time. In short, more RAM assigned, faster scrolling.
select exact the same version of te genome with contigs named also the same way as in your BAM files
open BAM files (need to be sorted and indexed), plus any annotation you may need.

tools for processing BAM files

samtools
picard
bamtools

getting mapping stats

extracting reads mapping to regions
getting coverage info for selected regions

Detecting SNPs

general procedure

The detection of SNPs depends on mapping, which in turn depends on the input read length, quality and coverage. Very short reads (i.e. 36bp) are hard to map to many regions.

GATK pipeline
other SNP calling programs [tba]

Working with VCF files

VCF file format
viewing VCFs in IGV
filtering SNPs by quality
set operations on VCF files (common SNPs, unique SNPs)

RNASeq

caveats (ribosomal RNA contamination)
mapping RNASeq
tophat
GRAPE
creating gene models from RNASeq (cufflinks)

@@ Line 393: / Line 393: @@
 #quality checking (fastqc)
 #trimming & filtering (TagDust)
+Tagdust is a program for removing Illumina adapter sequences from the reads containing them. Such reads containing 6-8 bases not from genome will be impossible to map using typical mappers having often just 2 mismatch base limit. (XXX check if it is true for bwa/bowtie)
 #source of published FASTQ data: Short Read Archive vs ENA
 ==Genomic fasta and gtf/gff gene annotation==
@@ Line 401: / Line 404: @@
 ==Mapping genomic reads==
-#overview of mappers
+====overview of mapper====
+There are tens of different mappers for aligning millions of short reads to genomes. There are huge differences in speed, memory usage, handling of errors in the read. We will take a look at just few, most commonly used by various pipelines. For the time being we will trat these programs as black boxes, not going into algorithms used inside.
+We need reference genome (FASTA) from the same species to which we will map our reads, and short read data (Illumina FASTQ). Before mapping we need to do some basic checks.
+# genome check
+If you get the genome for canonical species from ENSEMBL (cow, chicken, pig) you can trust that it contains proper number of chromosomes and mitochondrial sequences. This is not the case with draft genomes available from individual research groups. It is always a good idea to look at number and names of individual contigs (see grep section), and check that i.e sequences are present. In case of genomes lacking certain sequences we have bigger chances to map reads to wrong locations, which in turn will give us wrong coverage, SNPs or expression in case of RNA-Seq.
+#checking reads
+See FastQC above and Tagdust.
+Before we can use the genome for mapping we have to transform it into a format specific for each of the mappers allowing for much faster search and lower memory usage.
 ##GEM
 ##bwa +/- stampy
-##last / bowtie
+## bowtie2
+##last
+This is less popular but sometimes quite useful mapper reporting unique mappings only. It can handle large number of mismatches and it simply remove the non-matching parts of the read, as long as what is left is sufficient to secure unique mapping.
+It can also be used to map very long reads, and even genome to genome (but then one has to index the genome differently).
+Standard usage:
+<pre>
+lastdb -mXXXXX last_databse_name fasta_file
+lastal -QXXX last_databse_name input_reads
+ </pre>
 #mapping steps (for each mapper)
 #genome indexing
@@ Line 412: / Line 438: @@
 #Analyzing BAM files
 #sorting / indexing
+During the mapping reads are mapped on first in first out, so the output is ordered by order of reads not the position along the chromosomes. Querring such BAM files is very inefficient, and the files themselves are unnecessary large. So every BAM file to be used downstream needs to be first sorted, and then indexed. BAM index (.bai suffix) is just for external programs to access a given part of the genome even more effectively. For sorting and indexing we will use samtools:
+<pre>
+samtools sort unsorted_input.bam sorted_output #notice no bam suffix!!!
+samtools index sorted_output.bam #creates sorted_output.bam.bai
+</pre>
 #viewing the mappings in IGV
+IGV is a java program primarily for viewing mappings of short reads to a genome. But it can also be used for viewing SNPs (VCF files), genome anmotation or even genom-2-genome alignment (not a typical usage).
+The order of steps:
+# start IGV (it needs to be started specifying the amount of RAM being used by the program). This depends in the coverage, number of BAM files oppened at the same time. In short, more RAM assigned, faster scrolling.
+# select exact the same version of te genome with contigs named also the same way as in your BAM files
+# open BAM files (need to be sorted and indexed), plus any annotation you may need.
 ==tools for processing BAM files==
 #samtools
@@ Line 422: / Line 462: @@
 ==Detecting SNPs==
 #general procedure
+The detection of SNPs depends on mapping, which in turn depends on the input read length, quality and coverage. Very short reads (i.e. 36bp) are hard to map to many regions.
 #GATK pipeline
 #other SNP calling programs [tba]

User talk:Darek Kedra/sandbox 28: Difference between revisions