BioMicroCenter:Software: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
No edit summary
Line 45: Line 45:


The pipeline processes flowcell directories as they are generated by the Illumina sequencer software and postprocesses the output for use in downstream biological analyses. It is intended to be used by core facilities who own and/or operate Illumina sequencers for automation and consistency of processing Illumina data. The pipeline is a collection of command line utilities written primarily in the Python programming language. The commands are tied together using the ruffus pipelining package.
The pipeline processes flowcell directories as they are generated by the Illumina sequencer software and postprocesses the output for use in downstream biological analyses. It is intended to be used by core facilities who own and/or operate Illumina sequencers for automation and consistency of processing Illumina data. The pipeline is a collection of command line utilities written primarily in the Python programming language. The commands are tied together using the ruffus pipelining package.
'''Release Notes 1.5.1''' (06/12/2016)
* Added the support for HT3DGE project
* Publish infosite to Filemaker database


'''Release Notes 1.5''' (01/26/2016)
'''Release Notes 1.5''' (01/26/2016)

Revision as of 08:32, 11 June 2016

HOME -- SEQUENCING -- LIBRARY PREP -- HIGH-THROUGHPUT -- COMPUTING -- OTHER TECHNOLOGY

A large amount of bioinformatic software is available at MIT. This page is meant to summarize some of the most common requests we have. The BioMicro Center collaborates with the Koch Institute Bioinformatics Computing Core and the MIT Libraries to support different packages

Desktop Software

Desktop software is available from our Download Page. Access may be limited to MIT users only. Below is a list of the software available:

  • Agilent 2100 Expert This software package is used to control the Agilent 2100 Bioanalyzer and to perform analysis of the output, including microfluidic and electrophoretic assays for RNA, DNA and proteins, as well as two-color flow cytometry. The software can be installed on your desktop to allow users to do additional analyses.
  • SSH This software is what we recommend for UNIX access to rous and for downloading files from our servers
  • Spotfire is a widely used data analysis and visualization tool. It can handle a number of clustering functions and statistical tests and has very robust graphical capabilities. The BioMicro Center operates a Spotfire server that is available to anyone at MIT. Licenses for Spotfire are available through the BioMicro Center on a yearly basis.
  • MATLAB A mathematical programming language used for mathematical modeling, as well as analyzing and visualizing data. Contact Stephen Goldman for access.
  • Tecan EvoWare Standard This software is available as part of our robotics service. Identical to the software used on the Tecan EVO 150s, the software contains a simulator that can be used to design your robotics experiments at your bench. Note that this software is on a different server.
  • MacVector a comprehensive Macintosh application that provides sequence editing, primer design, internet database searching, protein analysis, sequence confirmation, multiple sequence alignment, phylogenetic reconstruction, coding region analysis, and a wide variety of other functions.
  • Lasergene v8.0 A software package that provides sequence assembly including next-generation sequence analysis; simplified primer design, and expanded SNP reporting and management.

Galaxy

Front Page of the MIT Galaxy Site

Galaxy is a bioinformatics platform that is designed to bring complicated informatics tools to bench scientists. Galaxy allows you to do analyses you cannot do anywhere else without the need to install or download anything. You can analyze multiple alignments, compare genomic annotations, profile metagenomic samples and much much more. For many users, the public Galaxy instance at Penn State can provide a very robust tool.

To make things even easier, we have created a galaxy server here at MIT. The Galaxy Server acts as a separate head node for ROUS. Users are required to have data storage space on Rowley or BMC-PUB and may be required to purchase a queue on ROUS.

Additional Resources

Software from MIT Libraries

  • BioBASE The BIOBASE Knowledge Library (BKL) contains comprehensive sets of protein databases such as HumanPSD, WormPD, GPGR-PD, PombePD, and MycopathPD in addition to analysis tools such as TRANSFAC, TRANSPATH, and ExPlain. BKL brings together curated data, analysis tools, and gene-centered information. BKL is one of the best ways to quickly assess a vast set of protein properties for a given protein or set of proteins.
  • GeneGO Metacore GeneGo is a leading provider of data mining & analysis solutions in systems biology. MetaCore, GeneGo's flapship product, is an integrated software suite for functional analysis of experimental data. MetaCore is based on a curated database of human protein-protein, protein-DNA interactions, transcription factors, signaling and metabolic pathways, disease and toxicity, and the effects of bioactive molecules.
  • INGENUITY PATHWAY ANALYSIS software that helps researchers model, analyze, and understand complex biological and chemical systems relevant to their experimental data. Researchers can search the scientific literature and find insights most relevant to their experimental data; analyze and build pathway models related to their experimental data;and share and collaborate with colleagues. IPA is currently licensed through June 2012.

UNIX SERVER

A large amount of software is installed on our cluster server. Please look at the ROUS page .

BMC-BCC Pipeline

The pipeline processes flowcell directories as they are generated by the Illumina sequencer software and postprocesses the output for use in downstream biological analyses. It is intended to be used by core facilities who own and/or operate Illumina sequencers for automation and consistency of processing Illumina data. The pipeline is a collection of command line utilities written primarily in the Python programming language. The commands are tied together using the ruffus pipelining package.

Release Notes 1.5.1 (06/12/2016)

  • Added the support for HT3DGE project
  • Publish infosite to Filemaker database


Release Notes 1.5 (01/26/2016)

  • Added phiX percent perfect plot to calculate sequencing error rate for HiSeq and MiSeq

    The percent perfect plot created by the PPPQC script is designed to calculate next generation sequencing error rate. The calculation can be applied to paired end sequencing or single end sequencing of either Nextseq, Miseq, or Hiseq, depending on the specific sequencing run. The script is based on the comparison between the sequenced spike in PhiX reads with PhiX reference genome sequence. To avoid potential alignment issues of sequencing reads with poor quality, the script first aligns the first 30 base pairs of the sequencing reads to identify PhiX reads as well as forward reads and reverse reads. Then full length PhiX forward and reverse reads were retrieved and compared to the reference sequence. The percentage of sequencing reads with zero mismatches, <=1 mismatches, <=2 mismatches, <=3 mismatches, and <=4 mismatches were calculated and plotted at each nucleotide position. For Nextseq sequencing, the reads from each camera were processed separately. For Hiseq sequencing, the reads from each lane were processed separately. For paired end reads, the reads from each mate pair were processed separately. Due to the nature of very low indel reading errors rate by Illumina sequencing, the reads with indels comparing to the reference sequence are excluded from the current calculation.

  • Added CNV quality control plot for ChIP, ReSeq and CGHSeq sample types

    The CNV quality control plot created by the CNVQC script uses downsampled bam files to plot DNA copy numbers along the reference genome. Both mapability and GC% were considered during the normalizing process. Potential gains were marked in red and losses were marked in green. Currently it supports hg19 and mm9 genomes.

  • Upgraded software tools including fastqc, bwa, samtools, and bedtools
    • fastqc upgraded from 0.11.2 to 0.11.4
    • bwa upgraded from 0.7.10 to 0.7.12
    • samtools upgraded from 0.1.19 to 1.3
    • bedtools upgraded from 2.20.1 to 2.25.0
  • Improved performance and robustness
    • Added a precheck flag -c to check the filemaker database to avoid human error
    • Allowed the recursive pulling of samples in a subpool when creating sample json file
    • Improved the robustness of sample json file when handling mixed barcodes
    • Added a second person to receive delivery email if specified
    • Enabled creating tarball of the flowcell directory after pipeline run ends
    • Simplified the process to create a new release
    • Reworked the code on publishing project data to avoid intermittent file system error
    • Added flowcell as part of SGE job name to easily identify pipeline runs in the cluster
    • Used 32 threads as default instead of 16 after new nodes were added to the rous cluster

Release Notes 1.4 (01/01/2015)

  • The quality scores of fastq files are now in Sanger format (previously the quality scores were in the Illumina 1.3+ format)
  • Add the support of NextSeq.

Release Notes 1.3 (07/25/2014)

  • Paired end quality control is added for samples aligned to genomes other than phiX. It summarizes basic mapping metrics from the BWA alignments to identify proper mapping reads and provides a distribution of insert lengths based on these mappings.
  • RNAseq quality control is added for RNASeq data for a list of genomes other than phiX. It checks distribution of the reads, 5' to 3' bias, strand specificity and ribosome RNA contamination. It also checks gene expression correlation between samples when applicable. .
  • Software upgrade: BWA is upgraded to 0.7.10 and fastqc is upgraded to 0.11.2
  • Performance enhancement. It uses 16 threads as default instead of 8 which reduces the pipeline runtime significantly for a HiSeq run.

Release Notes 1.2 (01/01/2014)

  • An information site about the pipeline run is delivered to MIT users
  • Sample data directory includes the flowcell code
  • Bug fix for pipeline re-run. When the pipeline was re-run, data may be duplicated in the fastq files. This is now fixed.
  • Performance enhancement. Data is written directly to the published directory for users, and copy is avoided whenever possible. This not only reduces disk storage, but also allows users to get their data faster.

Release Notes 1.0.2 (08/19/2013)

  • Switch from Bowtie to BWA for default alignments for generating SAM files.

    The BWA version 0.7.5a is used by default for alignment. For Illumina sequence reads up to 70bp, the alignment is done by aln/samse/sampe (the BWA-backtrack algorithm). For longer sequence read > 70bp, the mem subcommand (the BWA-MEM algorithm) is used.

  • Bug fix for large SAM/BAM files

    When processing large fastq files to generate a sam file, the sam file may be corrupted at the end of the file under certain circumstance if it is larger than 40GB. As a result, the SAM-BAM conversion may get a core dump. This is now fixed.

Release Notes 0.9 (10/18/2011)

Implemented all core functionality:

  • setting up and converting qseq files
  • qseq to fastq
  • fastqc and tag count statistics on flowcell-level sequences
  • splitting of barcoded samples into individual directories
  • individual fastqc
  • genome alignment using bowtie plus statistics
  • contamination qc checking
  • tag counts
  • conversion of alignments from SAM to BAM
  • production of bigWig files from SAM alignments
  • publishing user data to web directories

PPR Program

Generates Percent Perfect Reads for Miseq, Nextseq, and Hiseq data with phix spike-in