Wayne:High Throughput Sequencing Resources: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
No edit summary
No edit summary
Line 4: Line 4:


== Basic unix and usage of Sirius (our lab server) ==
== Basic unix and usage of Sirius (our lab server) ==
Sirius is our analytical powerhouse (64 cores, amazing for parallel computing; 512Gb memory; 64 bit file system in the x86_64 configuration) and we have specific locations on the server to do specific jobs. It is stored in a lovely server closet and so the way to access it is though a secure shell (''ssh''). Your username and password are obtained through our IT staff. Once you have logged on, there are a series of commands and "server etiquette" you will need to follow. For the PDF, click [http://openwetware.org/images/5/5e/Sirius_rules.pdf here].
Sirius is our analytical powerhouse (64 cores, amazing for parallel computing; 512Gb memory; 64 bit file system in the x86_64 configuration) and we have specific locations on the server to do specific jobs. It is stored in a lovely server closet and so the way to access it is though a secure shell (''ssh''). Your username and password are obtained through our IT staff. Once you have logged on, there are a series of commands and "server etiquette" you will need to follow. For the PDF, click [http://openwetware.org/images/5/5e/Sirius_rules.pdf here]. <br>


<u>Login</u>
'''Login'''
*ssh user@sirius.eeb.ucla.edu
*ssh user@sirius.eeb.ucla.edu
*slogin user@sirius.eeb.ucla.edu
*slogin user@sirius.eeb.ucla.edu
Line 13: Line 13:
*To change the default password you are given, use:
*To change the default password you are given, use:
**passwd
**passwd
*To logout of the server
**logout (or control+D)
<br>


<u>Structure and organization</u>
'''Structure and organization'''
*Your home (user) director holds <5Gb of data (be aware!)
*Your home (user) director holds <5Gb of data (be aware!)
**/home/user
**/home/user
Line 28: Line 31:
*The location to place scripts and data ONLY while you are working with it
*The location to place scripts and data ONLY while you are working with it
**/work/user
**/work/user
<br>


<u>Rules</u>
'''Rules'''
*Developing a pipeline:
*Developing a pipeline:
**copy a small but representative part of your data to sirius
**copy a small but representative part of your data to sirius
Line 41: Line 45:
*Never start more jobs than the number of available cores (e.g. If there are 50 jobs running, do NOT submit more than 14 to make a total of 64 jobs)!!
*Never start more jobs than the number of available cores (e.g. If there are 50 jobs running, do NOT submit more than 14 to make a total of 64 jobs)!!
*Look at the memory and cpu usage before you start to load sirius with commands (cmd)
*Look at the memory and cpu usage before you start to load sirius with commands (cmd)
*Use to view real-time CPU usage:
**htop --- use to view real-time CPU usage
**htop
**top --- displays the top CPU processes/jobs and provides an ongoing look at processor activity in real time. It displays a listing of the most CPU-intensive tasks on the system, and can provide an interactive interface for manipulating processes. It can sort the tasks by CPU usage, memory usage and runtime.
**top
*If you don't know something, use manual
*If you don't know something, use manual
**man ls
**man ls --- to look up the functionality of the ls tool, use Google, or ask admins (Jonathan or Ron) or in-lab (Rena or Pedro)
**Google
*''mpstat'' --- to display the utilization of each CPU individually. It reports processors related statistics
**Ask admins (Jonathan or Ron) or in-lab (Rena or Pedro)
*''mpstat -P ALL'' --- the mpstat command display activities for each available processor, processor 0 being the first one. Global average activities among all processors are also reported
*''sar'' --- displays the contents of selected cumulative activity counters in the operating system
<br>


<u>Installing programs yourself</u>
'''Installing programs yourself'''
*Check if it's already installed
*Check if it's already installed
*Create a dir in your home folder
*mkdir ~/bin --- to creak a directory in your home folder
**mkdir ~/bin
*cat .bash_profile --- put it in your path or check to see if it's already there
*Put it in your path or check to see if it's already there
*PATH=$PATH:$HOME/bin
**cat .bash_profile
*export PATH
***PATH=$PATH:$HOME/bin
*compile it with prefix ~/bin --- install programs to bin
***export PATH
<br>
*Install programs to bin
**compile it with prefix ~/bin


<u>Data transfer (network)</u>
'''Data transfer (network)'''
*Command Line Interface (CLI)
*scp ''options'' user@host_source:path/to/file1 user@host_dest:/dest/path/to/file2 -- Command Line Interface (CLI) for moving files
**scp [options] user@host_source:path/to/file1 user@host_dest:/dest/path/to/file2
*scp -r user@host_source:path/to/dir user@host_dest:/dest/path -- Command Line Interface (CLI) for moving directories
**if you copy directories, use the option -r for recursive
*FileZilla, Cyberduck, Fugu, etc..... -- Graphical User Interface (GUI)
*Graphical User Interface (GUI)
*df -h -- check disk usage
**FileZilla, Cyberduck, Fugu, etc.....
*du -hs /path --- check disk space used by a directory
*First check if there is enough space available for you to move data
*du -h -max-depth=1 /path --- check disk space used by a directory
**check disk usage
***df -h
**check disk space used by a dir
***du -hs /path
***du -h -max-depth=1 /path


<u>Data editing)</u>
'''Data editing'''
*Small modifications to a file on the server
**vim ''filename'' ---  to edit the file
**vim filename


<u>History</u>
'''History'''
*ctrl+r for searching history
*ctrl+r ---  searching history
*history
*history --- display history
*!#cmd_num
*!#cmd_num --- display history
*Arrow up is a short cut to scroll through recently used commands
*Arrow up is a short cut to scroll through recently used commands
<br>
<div align="right">[http://openwetware.org/wiki/Wayne:High_Throughput_Sequencing_Resources Top]</div>
<div align="right">[http://openwetware.org/wiki/Wayne_Lab Wayne Lab Home]</div>


'''Files'''
'''Files'''
*ls --- lists your files
*ls --- lists your files
*ls -l ---
*ls -l --- lists your files in long format
* ---
*ls -a --- shows hidden files
* ---
*ls -t --- sorted by time modified instead of name
* ---
*more ''filename'' --- shows first part of a file; hit space bar to see more
* ---
*head ''filename'' --- print to screen the top 10 lines or so of the specified file
* ---
*tail ''filename'' --- print to screen the last 10 lines or so of the specified file
* ---
*emacs ''filename'' --- an editor for editing a file
* ---
*cp ''filename1'' ''filename2'' --- copies a file in your current location
* ---
*cp ''path/to/filename1'' ''path/to/filename2'' --- you can specify a file copy at another location
* ---
*rm ''filename'' --- permanently remove a file (Caution! This cannot be undone!)
* ---
*diff ''filename1'' ''filename2'' --- compares files and shows where they differ
* ---
*wc ''filename'' --- tells you how many lines (whitespace or newline delimited), words, and characters (bytes) are in a file
<table border="0">
*wc -l ''filename'' --- tells you how many lines are in a file (whitespace or newline delimited)
<tr>
*wc -w ''filename'' --- tells you how many words are in a file
<td><b>Command</b></td>
*wc -c ''filename'' --- tells you how many characters (bytes) are in a file
<td><b>Usage</b></td>
*chmod ''options'' ''filename'' --- change the read, write, and execute permissions for a file (Google this!)
</tr>
<tr>
<td>ssh ''username@sirius.eeb.ucla.edu''</td>
<td>Secure shell login to the Sirius server</td>
</tr>
<tr>
<td>logout (or control+D)</td>
<td>Logout of the Sirius server</td>
</tr>
<tr>
<td>pwd</td>
<td>Print working directory (your current location</td>
</tr>
<tr>
<td>ls</td>
<td>List (all contents of current location)</td>
</tr>
<tr>
<td>ls ''options''</td>
<td>ls -a (hidden files), ls -l (long/detailed list), ls -t (sorted by time modified instead of name)</td>
</tr>
<tr>
<td>cd /give/path</td>
<td>Change directories</td>
</tr>
<tr>
<td>cd ..</td>
<td>Go up one directory</td>
</tr>
<tr>
<td>mkdir ''directoryName''</td>
<td>Make a new directory</td>
</tr>
<tr>
<td>rmdir ''directoryName''</td>
<td>Remove directory (must be empty)...Remember that you cannot undo this move!</td>
</tr>
<tr>
<td>rmdir -r ''directoryName''</td>
<td>Recursively remove directory and the files it contains...Remember that you cannot undo this move!</td>
</tr>
<tr>
<td>rmdir ''filename''</td>
<td>Remove specified file...Remember that you cannot undo this move!</td>
</tr>
<tr>
<td>head ''filename''</td>
<td>Print to screen the top 10 lines or so of the specified file</td>
</tr>
<tr>
<td>tail ''filename''</td>
<td>Print to screen the last 10 lines or so of the specified file</td>
</tr>
<tr>
<td>more ''filename''</td>
<td>Allows file contents or piped output to be sent to the screen one page at a time</td>
</tr>
<tr>
<td>less ''filename''</td>
<td>Opposite of more command</td>
</tr>
<tr>
<td>wc ''filename''</td>
<td>Print byte, word, and line counts</td>
</tr>
<tr>
<td>wc [''options''] ''filename'' </td>
<td>-c (bytes); -l (lines); -w (words) delimited by whitespace or newline</td>
</tr>
<tr>
<td>whereis [''filename, command'']</td>
<td>Lists all occurances of filename or command</td>
</tr>
<tr>
<td>mv ''current/path destination/path''</td>
<td>Move (akin to cut/paste), to remove the file in the current location</td>
</tr>
<tr>
<td>cp ''current/path destination/path''</td>
<td>Copy (also used to rename files if you keep them in their current path), keeps a copy in the current path </td>
</tr>
<tr>
<td>~''/path''</td>
<td>Tilde designated a shortcut for the path to your home directory</td>
</tr>
<tr>
<td>nohup ''commands'' &</td>
<td>To initiate a no-hangup background job (writes stdout to nohup.out)</td>
</tr>
<tr>
<td>screen</td>
<td>To initiate a new screen session to start a new background job (ctrl+a+d if you need to detach; screen -ls to list running screens; reattach screen pid)</td>
</tr>
<tr>
<td>tar -xzf ''filename.tar.gz''</td>
<td>Decompress tar.gz file</td>
</tr>
<tr>
<td>gzip -c ''filename'' >''filename.gz''</td>
<td>Compress file into tar.gz; the ">" means print to outfile ''filename.gz''</td>
</tr>
</table>
<br>
<br>
'''File compression'''
*gzip ''filename'' --- compresses files to make a file with a .gz extension
*gzip -c ''filename'' >''filename.gz'' --- compress file into tar.gz; the ">" means print to outfile ''filename.gz''
*gunzip ''filename'' ---uncompress a gzip file
*tar -xzf ''filename.tar.gz'' --- decompressing a tar.gz file
*gzcat ''filename'' --- lets y ou look at a gzipped file without having to gunzip it
<br>
<br>


Here is a list of commonly used linux commands for learning about the CPU utilization:
'''Directories'''
* pwd --- prints working directory (your current location)
* cd /path/to/desired/location --- change directories by providing path
* cd ../ --- go up one directory
*mkdir ''directoryName'' --- make a new directory
*rmdir ''directoryName'' --- remove directory (must be empty)...Remember that you cannot undo this move!
*rmdir -r ''directoryName'' --- recursively remove directory and the files it contains...Remember that you cannot undo this move!
*rmdir ''filename'' --- remove specified file...Remember that you cannot undo this move!
<br>


<table border="0">
'''Finding things'''
<tr>
*whereis [''filename, command''] --- lists all occurances of filename or command
<td><b>Command</b></td>
*~''/path'' --- tilde designated a shortcut for the path to your home directory
<td><b>Usage</b></td>
*nohup ''commands'' & --- to initiate a no-hangup background job (writes stdout to nohup.out)
</tr>
*screen --- to initiate a new screen session to start a new background job (ctrl+a+d if you need to detach; screen -ls to list running screens; reattach screen pid)
<tr>
<br>
<td> ''top''</td>
<td>Display top CPU processes/jobs and provides an ongoing look at processor activity in real time. It displays a listing of the most CPU-intensive tasks on the system, and can provide an interactive interface for manipulating processes. It can sort the tasks by CPU usage, memory usage and runtime. </td>
</tr>
<tr>
<td> ''mpstat''</td>
<td>To display the utilization of each CPU individually. It reports processors related statistics.</td>
</tr>
<tr>
<td> ''mpstat -P ALL'' </td>
<td>The mpstat command display activities for each available processor, processor 0 being the first one. Global average activities among all processors are also reported.</td>
</tr>
<tr>
<td> ''sar''</td>
<td>Displays the contents of selected cumulative activity counters in the operating system</td>
</tr>
</table>


<br>
<br>

Revision as of 08:44, 20 February 2013


Basic unix and usage of Sirius (our lab server)

Sirius is our analytical powerhouse (64 cores, amazing for parallel computing; 512Gb memory; 64 bit file system in the x86_64 configuration) and we have specific locations on the server to do specific jobs. It is stored in a lovely server closet and so the way to access it is though a secure shell (ssh). Your username and password are obtained through our IT staff. Once you have logged on, there are a series of commands and "server etiquette" you will need to follow. For the PDF, click here.

Login

  • ssh user@sirius.eeb.ucla.edu
  • slogin user@sirius.eeb.ucla.edu
  • to learn about the server:
    • uname -a
  • To change the default password you are given, use:
    • passwd
  • To logout of the server
    • logout (or control+D)


Structure and organization

  • Your home (user) director holds <5Gb of data (be aware!)
    • /home/user
  • For genomes and databases
    • /databases
  • Location of installed programs
    • /usr/local/bin
    • /opt/
  • The location to store your data
    • /data/
    • /data/user
      • You can create your own personal directory if you'd like (see below for commands)
  • The location to place scripts and data ONLY while you are working with it
    • /work/user


Rules

  • Developing a pipeline:
    • copy a small but representative part of your data to sirius
    • run all the programs you need on them
    • debug and save final version of pipeline e.g. in a text file
    • copy all your data
    • run your pipeline on all data
    • debug and update pipeline
    • mv results wherever you want
    • erase data
  • Never start more jobs than the number of available cores (e.g. If there are 50 jobs running, do NOT submit more than 14 to make a total of 64 jobs)!!
  • Look at the memory and cpu usage before you start to load sirius with commands (cmd)
    • htop --- use to view real-time CPU usage
    • top --- displays the top CPU processes/jobs and provides an ongoing look at processor activity in real time. It displays a listing of the most CPU-intensive tasks on the system, and can provide an interactive interface for manipulating processes. It can sort the tasks by CPU usage, memory usage and runtime.
  • If you don't know something, use manual
    • man ls --- to look up the functionality of the ls tool, use Google, or ask admins (Jonathan or Ron) or in-lab (Rena or Pedro)
  • mpstat --- to display the utilization of each CPU individually. It reports processors related statistics
  • mpstat -P ALL --- the mpstat command display activities for each available processor, processor 0 being the first one. Global average activities among all processors are also reported
  • sar --- displays the contents of selected cumulative activity counters in the operating system


Installing programs yourself

  • Check if it's already installed
  • mkdir ~/bin --- to creak a directory in your home folder
  • cat .bash_profile --- put it in your path or check to see if it's already there
  • PATH=$PATH:$HOME/bin
  • export PATH
  • compile it with prefix ~/bin --- install programs to bin


Data transfer (network)

  • scp options user@host_source:path/to/file1 user@host_dest:/dest/path/to/file2 -- Command Line Interface (CLI) for moving files
  • scp -r user@host_source:path/to/dir user@host_dest:/dest/path -- Command Line Interface (CLI) for moving directories
  • FileZilla, Cyberduck, Fugu, etc..... -- Graphical User Interface (GUI)
  • df -h -- check disk usage
  • du -hs /path --- check disk space used by a directory
  • du -h -max-depth=1 /path --- check disk space used by a directory

Data editing

    • vim filename --- to edit the file

History

  • ctrl+r --- searching history
  • history --- display history
  • !#cmd_num --- display history
  • Arrow up is a short cut to scroll through recently used commands

Files

  • ls --- lists your files
  • ls -l --- lists your files in long format
  • ls -a --- shows hidden files
  • ls -t --- sorted by time modified instead of name
  • more filename --- shows first part of a file; hit space bar to see more
  • head filename --- print to screen the top 10 lines or so of the specified file
  • tail filename --- print to screen the last 10 lines or so of the specified file
  • emacs filename --- an editor for editing a file
  • cp filename1 filename2 --- copies a file in your current location
  • cp path/to/filename1 path/to/filename2 --- you can specify a file copy at another location
  • rm filename --- permanently remove a file (Caution! This cannot be undone!)
  • diff filename1 filename2 --- compares files and shows where they differ
  • wc filename --- tells you how many lines (whitespace or newline delimited), words, and characters (bytes) are in a file
  • wc -l filename --- tells you how many lines are in a file (whitespace or newline delimited)
  • wc -w filename --- tells you how many words are in a file
  • wc -c filename --- tells you how many characters (bytes) are in a file
  • chmod options filename --- change the read, write, and execute permissions for a file (Google this!)


File compression

  • gzip filename --- compresses files to make a file with a .gz extension
  • gzip -c filename >filename.gz --- compress file into tar.gz; the ">" means print to outfile filename.gz
  • gunzip filename ---uncompress a gzip file
  • tar -xzf filename.tar.gz --- decompressing a tar.gz file
  • gzcat filename --- lets y ou look at a gzipped file without having to gunzip it


Directories

  • pwd --- prints working directory (your current location)
  • cd /path/to/desired/location --- change directories by providing path
  • cd ../ --- go up one directory
  • mkdir directoryName --- make a new directory
  • rmdir directoryName --- remove directory (must be empty)...Remember that you cannot undo this move!
  • rmdir -r directoryName --- recursively remove directory and the files it contains...Remember that you cannot undo this move!
  • rmdir filename --- remove specified file...Remember that you cannot undo this move!


Finding things

  • whereis [filename, command] --- lists all occurances of filename or command
  • ~/path --- tilde designated a shortcut for the path to your home directory
  • nohup commands & --- to initiate a no-hangup background job (writes stdout to nohup.out)
  • screen --- to initiate a new screen session to start a new background job (ctrl+a+d if you need to detach; screen -ls to list running screens; reattach screen pid)



High throughput (HT) platform and read types

  • ABI-SOLiD
  • Illumina single-end vs. paired-end
  • Ion Torrent
  • MiSeq
  • Roche-454
  • Solexa


CBI Collaboratory

UCLA's

Computational Biosciences Institute Collaboratory hosts a variety of 3-day workshops that provide both a general introduction to genome/bioinformatic sciences as well as more advanced (focus) workshops (e.g. ChIP-Seq; BS-Seq; Exome sequencing). The CBI Collaboratory focuses on a set of publicly available resources, from the web-based bioinformatic tool Galaxy/UCLA (resource for HT workflows and is a central location of a variety of HT tools for multiple platforms and data types), but also tools such as R and Matlab. The introductory workshops do not require any programming experience and the Collaboratory Fellows additionally serve as a counseling resource for data analysis.


File formats and conversions

  • blc
  • qseq
  • fastq


Deplexing using barcoded sequence tags

  • Editing (or hamming) distance


Quality control

  • Fastx tools
  • Using mapping as the quality control for reads



Trimming and clipping

  • Trim based on low quality scored per nucleotide position within a read
  • Clip sequence artefacts (e.g. adapters, primers)



FASTQC and FASTX tools


BED and SAM tools


GATK variant calling


R basics



Python basics


HT sequence analysis using R (and Bioconductor)


DNA sequence analysis


RNA-seq analysis

Common objectives of transcriptome analysis:


SOLiD software tools