Wayne:High Throughput Sequencing Resources: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
Line 121: Line 121:


'''Data editing'''
'''Data editing'''
**vim ''filename'' ---  to edit the file
*vim ''filename'' ---  to edit the file
<br>
<br>



Revision as of 08:51, 20 February 2013


Basic unix and usage of Sirius (our lab server)

Sirius is our analytical powerhouse (64 cores, amazing for parallel computing; 512Gb memory; 64 bit file system in the x86_64 configuration) and we have specific locations on the server to do specific jobs. It is stored in a lovely server closet and so the way to access it is though a secure shell (ssh). Your username and password are obtained through our IT staff. Once you have logged on, there are a series of commands and "server etiquette" you will need to follow. For the PDF, click here.

Login

  • ssh user@sirius.eeb.ucla.edu --- to secure login
  • slogin user@sirius.eeb.ucla.edu --- to secure login
  • uname -a --- to learn about the server
  • passwd --- to change the default password you are given
  • logout (or control+D) --- to logout


Structure and organization

  • Your home (user) director holds <5Gb of data (be aware!)
    • /home/user
  • For genomes and databases
    • /databases
  • Location of installed programs
    • /usr/local/bin
    • /opt/
  • The location to store your data
    • /data/
    • /data/user
      • You can create your own personal directory if you'd like (see below for commands)
  • The location to place scripts and data ONLY while you are working with it
    • /work/user
  • whoami --- returns your username


Rules

  • Developing a pipeline:
    • copy a small but representative part of your data to sirius
    • run all the programs you need on them
    • debug and save final version of pipeline e.g. in a text file
    • copy all your data
    • run your pipeline on all data
    • debug and update pipeline
    • mv results wherever you want
    • erase data
  • Never start more jobs than the number of available cores (e.g. If there are 50 jobs running, do NOT submit more than 14 to make a total of 64 jobs)!!
  • Look at the memory and cpu usage before you start to load sirius with commands (cmd)
    • htop --- use to view real-time CPU usage
    • top --- displays the top CPU processes/jobs and provides an ongoing look at processor activity in real time. It displays a listing of the most CPU-intensive tasks on the system, and can provide an interactive interface for manipulating processes. It can sort the tasks by CPU usage, memory usage and runtime.
  • If you don't know something, use manual
    • man ls --- to look up the functionality of the ls tool, use Google, or ask admins (Jonathan or Ron) or in-lab (Rena or Pedro)
  • mpstat --- to display the utilization of each CPU individually. It reports processors related statistics
  • mpstat -P ALL --- the mpstat command display activities for each available processor, processor 0 being the first one. Global average activities among all processors are also reported
  • sar --- displays the contents of selected cumulative activity counters in the operating system
  • ps -u yourusername --- lists your processes
  • kill PID"" --- kills (ends) the process with that process ID


Installing programs yourself

  • Check if it's already installed
  • mkdir ~/bin --- to creak a directory in your home folder
  • cat .bash_profile --- put it in your path or check to see if it's already there
  • PATH=$PATH:$HOME/bin
  • export PATH
  • compile it with prefix ~/bin --- install programs to bin


Data transfer (network)

  • scp options user@host_source:path/to/file1 user@host_dest:/dest/path/to/file2 -- Command Line Interface (CLI) for moving files
  • scp -r user@host_source:path/to/dir user@host_dest:/dest/path -- Command Line Interface (CLI) for moving directories
  • FileZilla, Cyberduck, Fugu, etc..... -- Graphical User Interface (GUI)
  • df -h -- check disk usage
  • du -hs /path --- check disk space used by a directory
  • du -h -max-depth=1 /path --- check disk space used by a directory


Files

  • ls --- lists your files
  • ls -l --- lists your files in long format
  • ls -a --- shows hidden files
  • ls -t --- sorted by time modified instead of name
  • more filename --- shows first part of a file; hit space bar to see more
  • head filename --- print to screen the top 10 lines or so of the specified file
  • tail filename --- print to screen the last 10 lines or so of the specified file
  • emacs filename --- an editor for editing a file
  • cp filename1 filename2 --- copies a file in your current location
  • cp path/to/filename1 path/to/filename2 --- you can specify a file copy at another location
  • rm filename --- permanently remove a file (Caution! This cannot be undone!)
  • diff filename1 filename2 --- compares files and shows where they differ
  • wc filename --- tells you how many lines (whitespace or newline delimited), words, and characters (bytes) are in a file
  • wc -l filename --- tells you how many lines are in a file (whitespace or newline delimited)
  • wc -w filename --- tells you how many words are in a file
  • wc -c filename --- tells you how many characters (bytes) are in a file
  • chmod options filename --- change the read, write, and execute permissions for a file (Google this!)


File compression

  • gzip filename --- compresses files to make a file with a .gz extension
  • gzip -c filename >filename.gz --- compress file into tar.gz; the ">" means print to outfile filename.gz
  • gunzip filename ---uncompress a gzip file
  • tar -xzf filename.tar.gz --- decompressing a tar.gz file
  • gzcat filename --- lets y ou look at a gzipped file without having to gunzip it


Directories

  • pwd --- prints working directory (your current location)
  • cd /path/to/desired/location --- change directories by providing path
  • cd ../ --- go up one directory
  • mkdir directoryName --- make a new directory
  • rmdir directoryName --- remove directory (must be empty)...Remember that you cannot undo this move!
  • rmdir -r directoryName --- recursively remove directory and the files it contains...Remember that you cannot undo this move!
  • rmdir filename --- remove specified file...Remember that you cannot undo this move!


Finding things

  • whereis [filename, command] --- lists all occurances of filename or command
  • ff --- finds files anywhere on the system
  • ff -p --- finds a file by just typing in the beginning of the file name
  • grep string filename(s) --- looks for strings in the files (use man grep for more information)
  • ~/path --- tilde designated a shortcut for the path to your home directory
  • nohup commands & --- to initiate a no-hangup background job (writes stdout to nohup.out)
  • screen --- to initiate a new screen session to start a new background job (ctrl+a+d if you need to detach; screen -ls to list running screens; reattach screen pid)


Data editing

  • vim filename --- to edit the file


History

  • ctrl+r --- searching history
  • history --- display history
  • !#cmd_num --- display history
  • Arrow up is a short cut to scroll through recently used commands




High throughput (HT) platform and read types

  • ABI-SOLiD
  • Illumina single-end vs. paired-end
  • Ion Torrent
  • MiSeq
  • Roche-454
  • Solexa


CBI Collaboratory

UCLA's

Computational Biosciences Institute Collaboratory hosts a variety of 3-day workshops that provide both a general introduction to genome/bioinformatic sciences as well as more advanced (focus) workshops (e.g. ChIP-Seq; BS-Seq; Exome sequencing). The CBI Collaboratory focuses on a set of publicly available resources, from the web-based bioinformatic tool Galaxy/UCLA (resource for HT workflows and is a central location of a variety of HT tools for multiple platforms and data types), but also tools such as R and Matlab. The introductory workshops do not require any programming experience and the Collaboratory Fellows additionally serve as a counseling resource for data analysis.


File formats and conversions

  • blc
  • qseq
  • fastq


Deplexing using barcoded sequence tags

  • Editing (or hamming) distance


Quality control

  • Fastx tools
  • Using mapping as the quality control for reads



Trimming and clipping

  • Trim based on low quality scored per nucleotide position within a read
  • Clip sequence artefacts (e.g. adapters, primers)



FASTQC and FASTX tools


BED and SAM tools


GATK variant calling


R basics



Python basics


HT sequence analysis using R (and Bioconductor)


DNA sequence analysis


RNA-seq analysis

Common objectives of transcriptome analysis:


SOLiD software tools