VonHoldt:High Throughput Sequencing Resources: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
Line 8: Line 8:
You should familiarize yourself with some basic Unix commands by doing a few [https://www.google.com/webhp?hl=en#hl=en&sugexp=les%3B&gs_rn=4&gs_ri=psy-ab&tok=695YN0gQHgjTmJgR_jf5RA&cp=7&gs_id=51&xhr=t&q=unix+tutorial&es_nrs=true&pf=p&output=search&sclient=psy-ab&oq=unix+tu&gs_l=&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.&bvm=bv.42661473,d.cGE&fp=f1235a99eba9d386&biw=1440&bih=724 tutorials]. Here is also a nice website with a large number of [http://www.trembath.co.za/commands2.html linux commands].  
You should familiarize yourself with some basic Unix commands by doing a few [https://www.google.com/webhp?hl=en#hl=en&sugexp=les%3B&gs_rn=4&gs_ri=psy-ab&tok=695YN0gQHgjTmJgR_jf5RA&cp=7&gs_id=51&xhr=t&q=unix+tutorial&es_nrs=true&pf=p&output=search&sclient=psy-ab&oq=unix+tu&gs_l=&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.&bvm=bv.42661473,d.cGE&fp=f1235a99eba9d386&biw=1440&bih=724 tutorials]. Here is also a nice website with a large number of [http://www.trembath.co.za/commands2.html linux commands].  
<br>
<br>
<br>


'''Login'''
'''Login'''

Revision as of 16:38, 23 July 2013


Della and Tigress

The DELLA processing and TIGRESS data storage servers of the High Performance Computing center of Princeton are our analytical powerhouses and we have specific locations on the server to do specific jobs. It is stored in a lovely server closet and so the way to access it is though a secure shell (ssh). Your username and password are obtained through the IT staff. Once you have logged on, there are a series of commands and "server etiquette" you will need to follow. The PU website has more information on basic usage and tutorials if you are interested.

You should familiarize yourself with some basic Unix commands by doing a few tutorials. Here is also a nice website with a large number of linux commands.


Login

  • ssh netid@della.princeton.edu --- to secure login
  • slogin netid@della.princeton.edu --- to secure login
  • uname -a --- to learn about the server
  • passwd --- to change the default password you are given
  • logout (or control+D) --- to logout


Rules

  • Della is only to be used to execute code via a formal job submission program (qsub command)
    • You only have 1GB of space on your Della home directory
    • You have 500GB of SCRATCH space on your Della home directory
  • Tigress is for storage of all data! Write all output to this server, as well
    • ln -s --- soft link to your Tigress data files using the command


Submitting jobs using qsub
#!/bin/bash
#PBS -l nodes=1:ppn=1,walltime=1:05:00
#PBS -m abe
#PBS -N clusetr_data
#PBS -M vonholdt@princeton.edu
cd /tigress/vonholdt/RRBS_foxes
./cluster_data.sh

Usage

  • qsub --- to submit a script (e.g. jobs_to_run.sh) on Della which can point to a perl/python/R/shell scripts on Tigress that does the actual work
  • Job length: Initially estimate 2x the amount of time you think your job will take to complete. You can refine this value over time.
    • Test queue
      • 1 hour limit
      • 2 job maximum per user and NOT to be used for production mode
    • Short queue
      • 24 hour limit
      • 40 job maximum
    • Medium queue
      • 72 hour limit
      • 16 jobs maximum per user
      • 432 total cores
  • qstat --- to check the job progress on Della
  • You can ssh into any node once you have the node ID from your qsub to check on the job status using traditional commands:
    • htop --- use to view real-time CPU usage
    • top --- displays the top CPU processes/jobs and provides an ongoing look at processor activity in real time. It displays a listing of the most CPU-intensive tasks on the system, and can provide an interactive interface for manipulating processes. It can sort the tasks by CPU usage, memory usage and runtime.


Basic unix

  • If you don't know something, use manual
    • man ls --- to look up the functionality of the ls tool, use Google, or ask admins (Jonathan or Ron) or in-lab (Rena or Pedro)
  • mpstat --- to display the utilization of each CPU individually. It reports processors related statistics
  • mpstat -P ALL --- the mpstat command display activities for each available processor, processor 0 being the first one. Global average activities among all processors are also reported
  • sar --- displays the contents of selected cumulative activity counters in the operating system
  • ps -u yourusername --- lists your processes
  • kill PID --- kills (ends) the process with that process ID
  • ps -u username --- lists all the current jobs for a specified username


Installing programs yourself (locally on the lab computers)

  • Check if it's already installed
  • mkdir ~/bin --- to creak a directory in your home folder
  • cat .bash_profile --- put it in your path or check to see if it's already there
  • PATH=$PATH:$HOME/bin
  • export PATH
  • compile it with prefix ~/bin --- install programs to bin


Data transfer (network)

  • scp options user@host_source:path/to/file1 user@host_dest:/dest/path/to/file2 --- Command Line Interface (CLI) for moving files
  • scp -r user@host_source:path/to/dir user@host_dest:/dest/path --- Command Line Interface (CLI) for moving directories
  • FileZilla, Cyberduck, Fugu, etc. --- Graphical User Interface (GUI)
  • df -h -- check disk usage
  • du -hs /path --- check disk space used by a directory
  • du -h -max-depth=1 /path --- check disk space used by a directory


Files

  • ls --- lists your files
  • ls -l --- lists your files in long format
  • ls -a --- shows hidden files... this is actually a critical command! If you *think* you are using little space but it turns out you have a million hidden files... voila, hidden files can be managed.
  • ls -t --- sorted by time modified instead of name
  • ls -h --- lists your files in "human" format
  • ls -hla --- gives you all from combing the three commands; it's beautiful.
  • more filename --- shows first part of a file; hit space bar to see more
  • head filename --- print to screen the top 10 lines or so of the specified file
  • tail filename --- print to screen the last 10 lines or so of the specified file
  • emacs filename --- an editor for editing a file
  • cp filename1 filename2 --- copies a file in your current location
  • cp path/to/filename1 path/to/filename2 --- you can specify a file copy at another location
  • rm filename --- permanently remove a file (Caution! This cannot be undone!)
  • diff filename1 filename2 --- compares files and shows where they differ
  • wc filename --- tells you how many lines (whitespace or newline delimited), words, and characters (bytes) are in a file
  • wc -l filename --- tells you how many lines are in a file (whitespace or newline delimited)
  • wc -w filename --- tells you how many words are in a file
  • wc -c filename --- tells you how many characters (bytes) are in a file
  • chmod options filename --- change the read, write, and execute permissions for a file (Google this!)


File compression [see also the gzip usage website]

  • gzip filename --- compresses files to make a file with a .gz extension
  • gzip -c filename >filename.gz --- compress file into tar.gz; the ">" means print to outfile filename.gz
  • gunzip filename ---uncompress a gzip file
  • tar -xzf filename.tar.gz --- decompressing a tar.gz file
  • gzcat filename --- lets y ou look at a gzipped file without having to gunzip it


Directories

  • pwd --- prints working directory (your current location)
  • cd /path/to/desired/location --- change directories by providing path
  • cd ../ --- go up one directory
  • mkdir directoryName --- make a new directory
  • rmdir directoryName --- remove directory (must be empty)...Remember that you cannot undo this move!
  • rmdir -r directoryName --- recursively remove directory and the files it contains...Remember that you cannot undo this move!
  • rmdir filename --- remove specified file...Remember that you cannot undo this move!


Finding things

  • whereis [filename, command] --- lists all occurances of filename or command
  • ff --- finds files anywhere on the system
  • ff -p --- finds a file by just typing in the beginning of the file name
  • grep string filename(s) --- looks for strings in the files (use man grep for more information)
  • ~/path --- tilde designated a shortcut for the path to your home directory
  • nohup commands & --- to initiate a no-hangup background job (writes stdout to nohup.out)
  • screen --- to initiate a new screen session to start a new background job (ctrl+a+d if you need to detach; screen -ls to list running screens; reattach screen pid)


Data editing

  • vim filename --- to edit the file


History

  • ctrl+r --- searching history
  • history --- display history
  • !#cmd_num --- display history
  • Arrow up is a short cut to scroll through recently used commands




High throughput (HT) platform and read types

Take a moment to check out this Cornell site describing the specs of a few platforms!

  • ABI-SOLiD
  • Illumina single-end vs. paired-end
  • Ion Torrent
  • MiSeq
  • Roche-454
  • Solexa


CBI Collaboratory

UCLA's Computational Biosciences Institute Collaboratory hosts a variety of 3-day workshops that provide both a general introduction to genome/bioinformatic sciences as well as more advanced (focus) workshops (e.g. ChIP-Seq; BS-Seq; Exome sequencing). The CBI Collaboratory focuses on a set of publicly available resources, from the web-based bioinformatic tool Galaxy/UCLA (resource for HT workflows and is a central location of a variety of HT tools for multiple platforms and data types), but also tools such as R and Matlab. The introductory workshops do not require any programming experience and the Collaboratory Fellows additionally serve as a counseling resource for data analysis.


Getting your HT sequence data

1. Walk a hard drive over (e.g. Freimer Lab)

  • Not deplexed
  • bcl are image files to help the machine store read data during sample sequencing...this is the NEW way of producing results files
  • Convert to qseq using the program CASAVA

2. rsync (e.g. Pellegrini Lab)

  • Retrieve qseq (not deplexed) files

3. ftp site (e.g. Berkeley)

  • Added cost for library preps ($150/sample), run bioanalyzer, qPCR, and quantification
  • Conversion of bcl to qseq
  • Option to retrieve data as fastq
  • Added cost (?) for deplexing

4. MiSeq (e.g. UCLA Human Genetics Core)

  • Retrieve fasta file formats
  • They can deplex and map data


File formats and conversions

  • blc
  • qseq
  • fastq


Deplexing using barcoded sequence tags

  • Editing (or hamming) distance


Quality control

  • Fastx tools
  • Using mapping as the quality control for reads
  • For PE data Fastqc is preferable to Fastx



Trimming and clipping

    • Trim based on low quality scored per nucleotide position within a read
    • Clip sequence artefacts (e.g. adapters, primers)
      • cutadapt for SE reads cutadapt download and run from your personal programs or scripts folder
      • trimgalore for PE reads trimgalore download and run from your personal programs or scripts folder (also runs fastqc which is installed on sirius)


FASTQC and FASTX tools


BED and SAM tools


GATK variant calling

GATK and GATK Guide


R basics

Here is a file with some helpful R commands for inputting data, making basic plots, statistics, etc. courtesy of Los Lobos.

Also, refer to the following websites for help:

Additionally, with high throughput genome sequence data, we often need modules that are implemented in R's Bioconductor. Here is a great website and course material from a short course on using R and Bioconductor


Python basics

Here is a file with helpful commands in Python, BioPython, EggLib, etc., from Los Lobos.

Also, here are several links to help you get going:


HT sequence analysis using R (and Bioconductor)


DNA sequence analysis


DNA methylation analysis

Primarily, the vonHoldt lab works on reduced representation bisulfite sequencing (RRBS) of wild populations. Here are some suggested analytical methods with tutorials.

To use methylKit the data must have the following columns:
[1]ID (e.g. Chromosome.position)
[2]Chromosome
[3]Position
[4]Read direction (F/R)
[5]# of reads
[6]%Methylated Reads (as a number 0-100)
[7]%Unmethylated Reads (as a number 0-100)

E.g.:
chr01.11979 chr01 11979 F 4 100.0 0.0

Then follow the tutorial for analysis steps.

Other notes:

  • Memory limited.
  • To look at CpG islands, must provide defined islands (either from genomic data

or from predefined grouping of the data).


And for any sort of GO enrichment analysis



RNA-seq analysis

Common objectives of transcriptome analysis:

For a reasonably thorough list of RNA-seq bioinformatic tools, please see this site!


SOLiD software tools


Passing Arguments to Scripts and Programs Using xargs

  • xargs passes commands from the bash shell command line to a shell script and to other scripts or programs called in the script.
    • Although the argument is always simply called as $1 in a script, xargs works iteratively, going through the script with the first argument then the second, and so on.
  • Create this simple script:
 #! /bin/bash
 #check that a base file name argument was supplied
 if [ $# -eq 0 ]  # if no arguments were entered the script will complain and then stop
   then
     echo "Please supply argument .... "
     echo "Useage: echo arg1 arg2 ... argn | xargs -n 1 scriptname.sh"
 else
     echo $1
 fi
  • Call it using:
 echo arg1 arg2 arg3 | xargs -n 1 script.sh 
  • The -n flag to xargs specifies how many arguments at a time to supply to the given command. -n 1 tells xargs to supply 1 argument to the command. The command will be invoked repeatedly until all input is exhausted.
    • This means you can also use xargs for a command that needs two or more arguments.
      • For instance you could use this to supply read group information to the picard AddReadGroups command.
  • Another option -P # will tell xargs to split job into # different cores. -P 4 uses 4 cores.
    • This only works if you have multiple jobs that can be run in PARALLEL, ie one command run multiple times, once with each xarg or set of xargs


  • You can pass arguments to a program like fastqc, tophat, samtools etc.
    • I split up my aligned reads by chromosome to speed up processing.
      • With xargs I can call them all at once and process them on more than one core, something that samtools can't do by itself.
      • The follwing command would pileup three samples and do it sequentially for however many chromosomes I call in xargs.
 #! /bin/bash
 #check that a base file name argument was supplied
 if [ $# -eq 0 ]  # if no arguments were entered
   then
     echo "Please supply argument .... "
     echo "Useage: echo arg1 arg2 ... argn | xargs -n 1 scriptname.sh"
 else
     samtools mpileup -uf referencefilename /path/sample1$1.bam /path/sample2$1.bam /path/sample3$1.bam | bcftools view -bvcg - > /path/$1var.raw.bcf
 fi


  • You can pass the arguments to a python script by using sys.argv to supply arguments to the python script and calling the python script as myscript.py arg1
  • Save this simple script:
 #! /bin/bash
 if [ $# -eq 0 ]  # if no arguments were entered
   then
     echo "Please supply argument .... "
     echo "Useage: echo arg1 arg2 ... argn | xargs -n 1 scriptname.sh"
 else
     test.py $1
 fi
  • Save the following as test.py. It will be called by the last shell script above.
    • This is a very simple example but number could just as easily designate a file to be opened by the python script.
 #! /usr/bin/env python
 import sys #
 number = sys.argv[1]
 print "This is argument number ", number