# Wayne:High Throughput Sequencing Resources

## Basic unix and usage of Sirius (our lab server)

Sirius is our analytical powerhouse (64 cores, amazing for parallel computing; 512Gb memory; 64 bit file system in the x86_64 configuration) and we have specific locations on the server to do specific jobs. It is stored in a lovely server closet and so the way to access it is though a secure shell (ssh). Your username and password are obtained through our IT staff. Once you have logged on, there are a series of commands and "server etiquette" you will need to follow. For the PDF, click here.

You should familiarize yourself with some basic Unix commands by doing a few tutorials. Here is also a nice website with a large number of linux commands .

• ssh user@sirius.eeb.ucla.edu --- to secure login
• uname -a --- to learn about the server
• passwd --- to change the default password you are given
• logout (or control+D) --- to logout

Structure and organization

• Your home (user) directory on Sirius holds <5Gb of data (be aware!)
• /home/user
• For genomes and databases
• /databases
• Location of installed programs
• /usr/local/bin
• /opt/
• The location to store your data
• /data/
• /data/user
• You can create your own personal directory if you'd like (see below for commands)
• The location to place scripts and data ONLY while you are working with it
• /work/user
• du -a username --- returns your space usage but make sure to do this from the parent directory of your user directory

Rules

• Developing a pipeline:
• copy a small but representative part of your data to sirius
• run all the programs you need on them
• debug and save final version of pipeline (e.g. in a text file)
• run your pipeline on all data
• debug and update pipeline
• move results wherever you want
• erase data
• Never start more jobs than the number of available cores (e.g. If there are 50 jobs running, do NOT submit more than 14 to make a total of 64 jobs)!!
• Look at the memory and cpu usage before you start to load sirius with commands (cmd)
• htop --- use to view real-time CPU usage
• top --- displays the top CPU processes/jobs and provides an ongoing look at processor activity in real time. It displays a listing of the most CPU-intensive tasks on the system, and can provide an interactive interface for manipulating processes. It can sort the tasks by CPU usage, memory usage and runtime.
• If you don't know something, use manual
• man ls --- to look up the functionality of the ls tool, use Google, or ask admins (Jonathan or Ron) or in-lab (Rena or Pedro)
• mpstat --- to display the utilization of each CPU individually. It reports processors related statistics
• mpstat -P ALL --- the mpstat command display activities for each available processor, processor 0 being the first one. Global average activities among all processors are also reported
• sar --- displays the contents of selected cumulative activity counters in the operating system
• kill PID --- kills (ends) the process with that process ID
• ps -u username --- lists all the current jobs for a specified username

Installing programs yourself

• Check if it's already installed
• mkdir ~/bin --- to creak a directory in your home folder
• cat .bash_profile --- put it in your path or check to see if it's already there
• PATH=$PATH:$HOME/bin
• export PATH
• compile it with prefix ~/bin --- install programs to bin

Data transfer (network)

• scp options user@host_source:path/to/file1 user@host_dest:/dest/path/to/file2 --- Command Line Interface (CLI) for moving files
• scp -r user@host_source:path/to/dir user@host_dest:/dest/path --- Command Line Interface (CLI) for moving directories
• FileZilla, Cyberduck, Fugu, etc. --- Graphical User Interface (GUI)
• df -h -- check disk usage
• du -hs /path --- check disk space used by a directory
• du -h -max-depth=1 /path --- check disk space used by a directory

Files

• ls --- lists your files
• ls -l --- lists your files in long format
• ls -a --- shows hidden files... this is actually a critical command! If you *think* you are using little space but it turns out you have a million hidden files... voila, hidden files can be managed.
• ls -t --- sorted by time modified instead of name
• ls -h --- lists your files in "human" format
• ls -hla --- gives you all from combing the three commands; it's beautiful.
• more filename --- shows first part of a file; hit space bar to see more
• head filename --- print to screen the top 10 lines or so of the specified file
• tail filename --- print to screen the last 10 lines or so of the specified file
• emacs filename --- an editor for editing a file
• cp filename1 filename2 --- copies a file in your current location
• cp path/to/filename1 path/to/filename2 --- you can specify a file copy at another location
• rm filename --- permanently remove a file (Caution! This cannot be undone!)
• diff filename1 filename2 --- compares files and shows where they differ
• wc filename --- tells you how many lines (whitespace or newline delimited), words, and characters (bytes) are in a file
• wc -l filename --- tells you how many lines are in a file (whitespace or newline delimited)
• wc -w filename --- tells you how many words are in a file
• wc -c filename --- tells you how many characters (bytes) are in a file
• chmod options filename --- change the read, write, and execute permissions for a file (Google this!)

• gzip filename --- compresses files to make a file with a .gz extension
• gzip -c filename >filename.gz --- compress file into tar.gz; the ">" means print to outfile filename.gz
• gunzip filename ---uncompress a gzip file
• tar -xzf filename.tar.gz --- decompressing a tar.gz file
• gzcat filename --- lets y ou look at a gzipped file without having to gunzip it

Directories

• pwd --- prints working directory (your current location)
• cd /path/to/desired/location --- change directories by providing path
• cd ../ --- go up one directory
• mkdir directoryName --- make a new directory
• rmdir directoryName --- remove directory (must be empty)...Remember that you cannot undo this move!
• rmdir -r directoryName --- recursively remove directory and the files it contains...Remember that you cannot undo this move!
• rmdir filename --- remove specified file...Remember that you cannot undo this move!

Finding things

• whereis [filename, command] --- lists all occurances of filename or command
• ff --- finds files anywhere on the system
• ff -p --- finds a file by just typing in the beginning of the file name
• grep string filename(s) --- looks for strings in the files (use man grep for more information)
• ~/path --- tilde designated a shortcut for the path to your home directory
• nohup commands & --- to initiate a no-hangup background job (writes stdout to nohup.out)
• screen --- to initiate a new screen session to start a new background job (ctrl+a+d if you need to detach; screen -ls to list running screens; reattach screen pid)

Data editing

• vim filename --- to edit the file

History

• ctrl+r --- searching history
• history --- display history
• !#cmd_num --- display history
• Arrow up is a short cut to scroll through recently used commands

## High throughput (HT) platform and read types

Take a moment to check out this Cornell site describing the specs of a few platforms!

• ABI-SOLiD
• Illumina single-end vs. paired-end
• Ion Torrent
• MiSeq
• Roche-454
• Solexa

## CBI Collaboratory

UCLA's

Computational Biosciences Institute Collaboratory hosts a variety of 3-day workshops that provide both a general introduction to genome/bioinformatic sciences as well as more advanced (focus) workshops (e.g. ChIP-Seq; BS-Seq; Exome sequencing). The CBI Collaboratory focuses on a set of publicly available resources, from the web-based bioinformatic tool Galaxy/UCLA (resource for HT workflows and is a central location of a variety of HT tools for multiple platforms and data types), but also tools such as R and Matlab. The introductory workshops do not require any programming experience and the Collaboratory Fellows additionally serve as a counseling resource for data analysis.

## Getting your HT sequence data

1. Walk a hard drive over (e.g. Freimer Lab)

• Not deplexed
• bcl are image files to help the machine store read data during sample sequencing...this is the NEW way of producing results files
• Convert to qseq using the program CASAVA

2. rsync (e.g. Pellegrini Lab)

• Retrieve qseq (not deplexed) files

3. ftp site (e.g. Berkeley)

• Added cost for library preps (150/sample), run bioanalyzer, qPCR, and quantification • Conversion of bcl to qseq • Option to retrieve data as fastq • Added cost (?) for deplexing 4. MiSeq (e.g. UCLA Human Genetics Core) • Retrieve fasta file formats • They can deplex and map data ## File formats and conversions • blc • qseq • fastq ## Deplexing using barcoded sequence tags • Editing (or hamming) distance ## Quality control • Fastx tools • Using mapping as the quality control for reads • For PE data Fastqc is preferable to Fastx ## Trimming and clipping • Trim based on low quality scored per nucleotide position within a read • Clip sequence artefacts (e.g. adapters, primers) • cutadapt for SE reads cutadapt download and run from your personal programs or scripts folder • trimgalore for PE reads trimgalore download and run from your personal programs or scripts folder (also runs fastqc which is installed on sirius) ## FASTQC and FASTX tools ## BED and SAM tools ## GATK variant calling ## R basics Here is a file with some helpful R commands for inputting data, making basic plots, statistics, etc. courtesy of Los Lobos. Also, refer to the following websites for help: ## Python basics Here is a file with helpful commands in Python, BioPython, EggLib, etc., from Los Lobos. Also, here are several links to help you get going: ## HT sequence analysis using R (and Bioconductor) ## DNA sequence analysis ## RNA-seq analysis Common objectives of transcriptome analysis: • Quantifying and annotating aligned reads • Normalizing RNA-Seq read count data and identifying differentially expressed genes (DEG) (R packages): • Detection of alternative splice junctions For a reasonably thorough list of RNA-seq bioinformatic tools, please see this site! ## SOLiD software tools ## Passing Arguments to Scripts and Programs Using xargs • xargs passes commands from the bash shell command line to a shell script and to other scripts or programs called in the script. • Although the argument is always simply called as1 in a script, xargs works iteratively, going through the script with the first argument then the second, and so on.
• Create this simple script:
 #! /bin/bash
#check that a base file name argument was supplied
if [ $# -eq 0 ] # if no arguments were entered the script will complain and then stop then echo "Please supply argument .... " echo "Useage: echo arg1 arg2 ... argn | xargs -n 1 scriptname.sh" else echo$1
fi

• Call it using:
 echo arg1 arg2 arg3 | xargs -n 1 script.sh

• The -n flag to xargs specifies how many arguments at a time to supply to the given command. -n 1 tells xargs to supply 1 argument to the command. The command will be invoked repeatedly until all input is exhausted.
• This means you can also use xargs for a command that needs two or more arguments.
• For instance you could use this to supply read group information to the picard AddReadGroups command.
• Another option -P # will tell xargs to split job into # different cores. -P 4 uses 4 cores.
• This only works if you have multiple jobs that can be run in PARALLEL, ie one command run multiple times, once with each xarg or set of xargs

• You can pass arguments to a program like fastqc, tophat, samtools etc.
• I split up my aligned reads by chromosome to speed up processing.
• With xargs I can call them all at once and process them on more than one core, something that samtools can't do by itself.
• The follwing command would pileup three samples and do it sequentially for however many chromosomes I call in xargs.
 #! /bin/bash
#check that a base file name argument was supplied
if [ $# -eq 0 ] # if no arguments were entered then echo "Please supply argument .... " echo "Useage: echo arg1 arg2 ... argn | xargs -n 1 scriptname.sh" else samtools mpileup -uf referencefilename /path/sample1$1.bam /path/sample2$1.bam /path/sample3$1.bam | bcftools view -bvcg - > /path/$1var.raw.bcf fi  • You can pass the arguments to a python script by using sys.argv to supply arguments to the python script and calling the python script as myscript.py arg1 • Save this simple script:  #! /bin/bash if [$# -eq 0 ]  # if no arguments were entered
then
echo "Please supply argument .... "
echo "Useage: echo arg1 arg2 ... argn | xargs -n 1 scriptname.sh"
else
test.py \$1
fi

• Save the following as test.py. It will be called by the last shell script above.
• This is a very simple example but number could just as easily designate a file to be opened by the python script.
 #! /usr/bin/env python
import sys #
number = sys.argv[1]
print "This is argument number ", number