Wayne:High Throughput Sequencing Resources: Difference between revisions
Line 28: | Line 28: | ||
<u>Rules</u> | <u>Rules</u> | ||
Developing a pipeline: | *Developing a pipeline: | ||
**copy a small but representative part of your data to sirius | |||
**run all the programs you need on them | |||
**debug and save final version of pipeline e.g. in a text file | |||
**copy all your data | |||
**run your pipeline on all data | |||
**debug and update pipeline | |||
**mv results wherever you want | |||
**erase data | |||
sirius with commands (cmd) | *Never start more jobs than the number of available cores | ||
* | *Look at the memory and cpu usage before you start to load sirius with commands (cmd) | ||
* | *Use to view real-time CPU usage: | ||
* | **htop | ||
**top | |||
*If you don't know something, use manual | |||
**man ls | |||
**Google | |||
**Ask admins (Jonathan or Ron) or in-lab (Rena or Pedro) | |||
<u>Installing programs yourself</u> | |||
*Check if it's already installed | |||
*Create a dir in your home folder | |||
**mkdir ~/bin | |||
*Put it in your path or check to see if it's already there | |||
**cat .bash_profile | |||
***PATH=$PATH:$HOME/bin | |||
***export PATH | |||
*Install programs to bin | |||
**compile it with prefix ~/bin | |||
<u>Data transfer (network)</u> | |||
*Command Line Interface (CLI) | |||
**scp [options] user@host_source:path/to/file1 user@host_dest:/dest/path/to/file2 | |||
**if you copy directories, use the option -r for recursive | |||
*Graphical User Interface (GUI) | |||
**FileZilla, Cyberduck, Fugu, etc..... | |||
*First check if there is enough space available for you to move data | |||
**check disk usage | |||
***df -h | |||
**check disk space used by a dir | |||
***du -hs /path | |||
***du -h -max-depth=1 /path | |||
<u>Data editing)</u> | |||
*Small modifications to a file on the server | |||
**vim filename | |||
*vi | |||
<br> | <br> |
Revision as of 16:20, 19 February 2013
Sirius Usage (our lab server)
Sirius is our analytical powerhouse (64 cores, amazing for parallel computing; 512Gb memory; 64 bit file system in the x86_64 configuration) and we have specific locations on the server to do specific jobs. It is stored in a lovely server closet and so the way to access it is though a secure shell (ssh). Your username and password are obtained through our IT staff. Once you have logged on, there are a series of commands and "server etiquette" you will need to follow.
Login
- ssh user@sirius.eeb.ucla.edu
- slogin user@sirius.eeb.ucla.edu
- to learn about the server:
- uname -a
Structure and organization
- Your home (user) director holds <5Gb of data (be aware!)
- /home/user
- For genomes and databases
- /databases
- Location of installed programs
- /usr/local/bin
- /opt/
- The location to store your data
- /data/
- /data/user
- You can create your own personal directory if you'd like (see below for commands)
- The location to place scripts
- /work/user
Rules
- Developing a pipeline:
- copy a small but representative part of your data to sirius
- run all the programs you need on them
- debug and save final version of pipeline e.g. in a text file
- copy all your data
- run your pipeline on all data
- debug and update pipeline
- mv results wherever you want
- erase data
- Never start more jobs than the number of available cores
- Look at the memory and cpu usage before you start to load sirius with commands (cmd)
- Use to view real-time CPU usage:
- htop
- top
- If you don't know something, use manual
- man ls
- Ask admins (Jonathan or Ron) or in-lab (Rena or Pedro)
Installing programs yourself
- Check if it's already installed
- Create a dir in your home folder
- mkdir ~/bin
- Put it in your path or check to see if it's already there
- cat .bash_profile
- PATH=$PATH:$HOME/bin
- export PATH
- cat .bash_profile
- Install programs to bin
- compile it with prefix ~/bin
Data transfer (network)
- Command Line Interface (CLI)
- scp [options] user@host_source:path/to/file1 user@host_dest:/dest/path/to/file2
- if you copy directories, use the option -r for recursive
- Graphical User Interface (GUI)
- FileZilla, Cyberduck, Fugu, etc.....
- First check if there is enough space available for you to move data
- check disk usage
- df -h
- check disk space used by a dir
- du -hs /path
- du -h -max-depth=1 /path
- check disk usage
Data editing)
- Small modifications to a file on the server
- vim filename
- vi
Basic server commands (for Sirius)
Here is a list of commonly used linux commands:
Command | Usage |
ssh username@sirius.eeb.ucla.edu | Secure shell login to the Sirius server |
logout (or control+D) | Logout of the Sirius server |
pwd | Print working directory (your current location |
ls | List (all contents of current location) |
ls options | ls -a (hidden files), ls -l (long/detailed list), ls -t (sorted by time modified instead of name) |
cd /give/path | Change directories |
cd .. | Go up one directory |
mkdir directoryName | Make a new directory |
rmdir directoryName | Remove directory (must be empty)...Remember that you cannot undo this move! |
rmdir -r directoryName | Recursively remove directory and the files it contains...Remember that you cannot undo this move! |
rmdir filename | Remove specified file...Remember that you cannot undo this move! |
head filename | Print to screen the top 10 lines or so of the specified file |
tail filename | Print to screen the last 10 lines or so of the specified file |
more filename | Allows file contents or piped output to be sent to the screen one page at a time |
less filename | Opposite of more command |
wc filename | Print byte, word, and line counts |
wc [options] filename | -c (bytes); -l (lines); -w (words) delimited by whitespace or newline |
whereis [filename, command] | Lists all occurances of filename or command |
mv current/path destination/path | Move (akin to cut/paste), to remove the file in the current location |
cp current/path destination/path | Copy (also used to rename files if you keep them in their current path), keeps a copy in the current path |
~/path | Tilde designated a shortcut for the path to your home directory |
nohup commands & | To initiate a no-hangup background job |
screen | To initiate a new screen session to start a new background job |
tar -xzf filename.tar.gz | Decompress tar.gz file |
gzip -c filename >filename.gz | Compress file into tar.gz; the ">" means print to outfile filename.gz |
Here is a list of commonly used linux commands for learning about the CPU utilization:
Command | Usage |
top | Display top CPU processes/jobs and provides an ongoing look at processor activity in real time. It displays a listing of the most CPU-intensive tasks on the system, and can provide an interactive interface for manipulating processes. It can sort the tasks by CPU usage, memory usage and runtime. |
mpstat | To display the utilization of each CPU individually. It reports processors related statistics. |
mpstat -P ALL | The mpstat command display activities for each available processor, processor 0 being the first one. Global average activities among all processors are also reported. |
sar | Displays the contents of selected cumulative activity counters in the operating system |
High throughput (HT) platform and read types
- ABI-SOLiD
- Illumina single-end vs. paired-end
- Ion Torrent
- MiSeq
- Roche-454
- Solexa
CBI Collaboratory
UCLA's
File formats and conversions
- blc
- qseq
- fastq
Deplexing using barcoded sequence tags
- Editing (or hamming) distance
Quality control
- Fastx tools
- Using mapping as the quality control for reads
Trimming and clipping
- Trim based on low quality scored per nucleotide position within a read
- Clip sequence artefacts (e.g. adapters, primers)
FASTQC and FASTX tools
BED and SAM tools
GATK variant calling
R basics
HT sequence analysis using R (and Bioconductor)
DNA sequence analysis
RNA-seq analysis
Common objectives of transcriptome analysis:
- Quantifying and annotating aligned reads
- Normalizing RNA-Seq read count data and identifying differentially expressed genes (DEG) (R packages):
- easyRNASeq (simplifies read counting per genome feature)
- DEXSeq (Inference of differential exon usage)
- baySeq (also see: segmentSeq)
- Genominator (Bullard et al. 2010)
- Detection of alternative splice junctions
SOLiD software tools