Open writing projects/Python all a scientist needs

From OpenWetWare
Revision as of 18:21, 17 February 2008 by Julius B. Lucks (talk | contribs) (→‎MatPlotLib: rough in Figure 1)
Jump to navigationJump to search

This is a paper/presentation for Pycon 2008 that I am writing on OWW. The paper is about how I used python and its libraries and extensions as a complete scientific programming package for a recent comparitive genomics study. I am experimenting with writing the paper on OWW and eventually submitting it to the arXiv. If anyone is interested in the topic, or the process, please email me through OWW.

Abstract

Any cutting-edge scientific research project requires a myriad of computational tools for data generation, analysis and visualization. Python is a flexible and extensible scientific programming platform that offered the perfect solution in our recent comparative genomics investigation (http://arxiv.org/abs/0708.2038). In this talk, I discuss the challenges of this project, and how the combined power of Biopython (http://biopython.org), SWIG (http://www.swig.org) and Matplotlib (http://matplotlib.sourceforge.net) were a perfect solution. I finish by discussing how python promotes good scientific practice, and how its use should be encouraged within the scientific community.

The Scientists Dilemma

A typical scientific research project requires a variety of computational tasks to be performed. At the very heart of every investigation is the generation of data to test hypotheses. An experimental physicist might build instruments to collect light scattering data; a crystallographer will collect X-ray diffraction data; a biologist might collect flouresence intensity data for a variety of reporter genes or dna sequence data; and a computational researcher might write programs to collect simulation data. All of these types of data collection require computer programming to control instruments (or simulation code) to collect and manage data in an electronic format.

Once data is collected, the next task is to analyze it in the context of hypothesis-driven models that help scientists understand the phenomenon they are studying. In the case of light, or X-ray scattering data, there is a well-proven physical theory of light scattering that is used to process the data and calculate the observed structure function of the material being studied (reference). This structure function is then compared to predictions made by the hypotheses begin tested. In the case of biological reporter gene data, the light intensity data is matched up with genetic or DNA sequencing data, and statistically analyzed for trends. Large scale biomedical genetic studies might search for correlations between DNA sequences and cancer rates among patients.

What is clear is that the original raw data in each case is typically extensively processed by more computational programs. Visualization tools to create a variety of scientific plots are often a preferred tool for both troubleshooting ongoing experiments (laboratory or computational). Scientific plots and charts are often the end product of a scientific investigation in the form data-rich graphics that demonstrate the truth of a hypothesis compared to its alternatives (cite Tufte).

Unfortunately, too often scientists resort to a grab-bag of tools to perform these varied computational tasks. For physicists, C or FORTRAN is often used to generate simulation data, and C code is used to control experimental apparatus. Data analysis is performed in external software packages such as Matlab or Mathematica for equation solving (reference), or Stata, SPSS or R for statistical data (reference). Furthermore, separate data visualization packages can be used making the toolset extremely varied.

Such a hodge-podge of tools is an inadequate solution for a variety of reasons. From a computational perspective, most of these tools cannot be pipelined easily which necessitates many manual steps or excessive glue code that most scientists are not trained to write. Far more important than just an inconvenience associated with gluing these tools together is the extreme burden placed on the scientist in terms of data management. This in turn provides poor at best data provenance when it is in fact of utmost importance in scientific studies where data integrity is the foundation of every conclusion reached and every fact established. In such a complicated system there are often a plethora of different data files in several different formats residing at many different locations. Most tools do not produce adequate metadata for these files, and scientists typically fall back on cryptic file naming schemes to indicate what type of data the files contain and how it was generated. Such complications can easily lead to mistakes.

Furthermore, when data files are manually moved around from tool to tool, it is not clear if an error is due to program error, or human error in using the wrong file. Analyses can only be repeated by following work flows that have to be manually recorded in a paper or electronic lab notebook. This makes steps easily forgotten, and hard to pass on to future generations of scientists.

The Python programming language and associated community tools (reference) can help scientists overcome some of these problems by providing a general scientific programming platform that allows scientists to generate, analyze, visualize and manage their data within the same computational framework. Python can be used to generate simulation data, or control instrumentation to capture data. Data analysis can be accomplished in the same way, and there are graphics libraries that can produce scientific charts and graphs. Furthermore python code can be used to glue all of these python solutions together so that visualization code resides alongside the code that generates the data it is applied to. This allows streamlined generation of data which makes data management feasible. Most importantly such a uniform tool set allows the scientist to record the steps used in data work flows to be written down in python code allowing automatic provenance tracking.

In this paper, we outline a recent comparative genomics case study where python and associated libraries were used as a complete scientific programming platform. We introduce several specific python libraries, and how they were used to facilitate input of standardized biological data, generate comparative genomic data based on a desgned statistical test, and provide solutions to speed bottle-necks in the code. Throughout, we point to resources for further reading on these topics. We conclude with ideas about how python promotes good scientific programing practice, and ...

Comparative Genomics Case Study

  • Brief Project Description - Compare DNA sequences of viruses

Recently we performed a comparative genomics study of the genomic DNA sequences of the 74 sequenced bacteriophages that infect E. coli, P. aeruginosa, or L. lactis (reference). Bacteriophages are viruses that infect bacteria, and are known ...

Specifically, we examined the codon usage of the protein coding genes in these bacteria for any non-random patterns, and performed statistical tests to associate these patterns with certain aspects about the proteins


The computational requirements of this study included:

  • Downloading and parsing the genome files for viruses from GenBank in order to get the genomic DNA sequence, the gene regions and annotations: For each virus, we needed to obtain the genomic DNA sequence, the parts of the genome that code for genes, and the annotated function of these genes. Figure 1 displays this information for lambda phage, a well-studied bacteropphage that infects E. coli (reference), in GenBank format (reference), obtained from the National Center for Biotechnology Information (NCBI) (reference). Once these files were downlosaded and stored, they were parsed for the required information.
  • Storing the genomic information: The parsed information was stored in a custom genome python class which also included methods for detecting which genes overlapped each other and removing these overlaps.
  • Drawing random genomes to compare to the sequenced genome: For each genome, we drew random genomes according to the degeneracy rules of the genetic code so that each random genome would theoretically encode the same proteins as the sequenced genome. These genomes were then visually compared to the sequenced genome through 'genome landscape' plots defined below.
  • Visualize the comparisons through 'genome landscape' plots: Genome landscapes were computed for both the sequenced genome, and each drawn genome. The genome landscape of the sequenced genome was compared to the distribution of genome landscapes generated from the random genomes to detect regions of the genomes that have extremely non-random patterns in codon usage.
  • Statistically analyzing the non-random regions with annotation and host information: To understand the observed trends, we performed analysis of variance (ANOVA) (reference) analysis to detect correlations between protein function annotation or host lifestyle information with these regions.

Figure 1 Lambda phage GenBank file snippet. The full file can be found online - see (reference).

LOCUS       NC_001416              48502 bp    DNA     linear   PHG 28-NOV-2007
DEFINITION  Enterobacteria phage lambda, complete genome.
ACCESSION   NC_001416
VERSION     NC_001416.1  GI:9626243
PROJECT     GenomeProject:14204
KEYWORDS    .
SOURCE      Enterobacteria phage lambda
  ORGANISM  Enterobacteria phage lambda
            Viruses; dsDNA viruses, no RNA stage; Caudovirales; Siphoviridae;
            Lambda-like viruses.
REFERENCE   1  (sites)
  AUTHORS   Chen,C.Y. and Richardson,J.P.
  TITLE     Sequence elements essential for rho-dependent transcription
            termination at lambda tR1
  JOURNAL   J. Biol. Chem. 262 (23), 11292-11299 (1987)
   PUBMED   3038914
...
FEATURES             Location/Qualifiers
     source          1..48502
                     /organism="Enterobacteria phage lambda"
                     /mol_type="genomic DNA"
                     /specific_host="Escherichia coli"
                     /db_xref="taxon:10710"
     gene            191..736
                     /gene="nu1"
                     /locus_tag="lambdap01"
                     /db_xref="GeneID:2703523"
     CDS             191..736
                     /gene="nu1"
                     /locus_tag="lambdap01"
                     /codon_start=1
                     /transl_table=11
                     /product="DNA packaging protein"
                     /protein_id="NP_040580.1"
                     /db_xref="GI:9626244"
                     /db_xref="GeneID:2703523"
                     /translation="MEVNKKQLADIFGASIRTIQNWQEQGMPVLRGGGKGNEVLYDSA
                     AVIKWYAERDAEIENEKLRREVEELRQASEADLQPGTIEYERHRLTRAQADAQELKNA
                     RDSAEVVETAFCTFVLSRIAGEIASILDGLPLSVQRRFPELENRHVDFLKRDIIKAMN
                     KAAALDELIPGLLSEYIEQSG"
...
ORIGIN      
        1 gggcggcgac ctcgcgggtt ttcgctattt atgaaaattt tccggtttaa ggcgtttccg
       61 ttcttcttcg tcataactta atgtttttat ttaaaatacc ctctgaaaag aaaggaaacg
      121 acaggtgctg aaagcgaggc tttttggcct ctgtcgtttc ctttctctgt ttttgtccgt
      181 ggaatgaaca atggaagtca acaaaaagca gctggctgac attttcggtg cgagtatccg
      241 taccattcag aactggcagg aacagggaat gcccgttctg cgaggcggtg gcaagggtaa
      301 tgaggtgctt tatgactctg ccgccgtcat aaaatggtat gccgaaaggg atgctgaaat
      361 tgagaacgaa aagctgcgcc gggaggttga agaactgcgg caggccagcg aggcagatct
      421 ccagccagga actattgagt acgaacgcca tcgacttacg cgtgcgcagg ccgacgcaca
...

Python was used in every aspect of this computational work flow. Below we discuss in more detail how python was used in several of these areas specifically.

BioPython

Biopython is an open-source suite of bioinfomatics tools for the python language (http://biopython.org/wiki/Main_Page). The suite is comprehensive in scope, and offers python modules and routines to parse bio-database files, facilitate the computation of alignments between biological sequences (DNA and protein), interact with biological web-services such as xxx, examine protein crystallographic data to name a few.

In this project, Biopython was used both to download and parse genomic viral DNA sequence files from the NCBI Genbank database (reference) as outlined in Example 1.


Example 1: Downloading and parsing the GenBank genome file for lambda phage (refseq number NC_001416).

import biopython (1)
...

(1) Import the biopython module. ...


The benefits of using Biopython in this project are several including

  1. Not having to write or maintain this code yourself. This is an important point as the number of web-available databases and services come online. These often change rapidly, and require rigorous maintenance to keep up with tweaks to API's and formats - a monumental task that is completed by an international group of volunteers for the Biopython project.
  2. The Biopython parsing code can be wrapped in custom classes that make sense for a particular project. Example 2 illustrates the latter by outlining a custom genome class used in this project to store the location of coding sequences for genes (CDS_seq), and to remove overlapping sequence segments in genes.

Example 2: A custom Genome class which wraps the biopython parsing code outlined in Example 1.

import biopython (1)

class Genome(object): (2)
    def __init__:
...

(1) Import the biopython module. (2) Defining a custom Genome class ...


MatPlotLib

Matplotlib is a suite of open-source python modules that provide a framework for creating scientific plots similar to the Matlab graphical tools (reference). In this project, matplotlib was used to create genome landscape plots both to have a quick look at data as it was generated, and to produce publication quality figures.

In this project, matplotlib was used to quickly generate graphics to test raw simulation data as it was being generated, as is illustrated in Example 3.


Example 3: Sample matplotlib script that calculates and plots the zero-mean cumulative sum of the numbers listed in a single column of an input file.

import matplotlib # (1)

(1) Importing matplotlib for plotting, and fileinput for easy reading of files.


Matplotlib was also used to make custom graphics classes for creating publication-quality plots. Example 4 lists python code that decsribes a LandscapePlot class that combines the code in Example 3 to input a single list of numbers, and plot the zero-mean cumulative sum of the numbers. It also shows how the mean plus or minus the variance of a distribution of genome landscapes can be plotted. This example illustrates the object oriented interface to matplotlib plotting routines.


Example 4: A custom LandscapePlot class that plots zero-mean cumulative sums as well as the mean plus or minus the variance of a distribution of genome landscapes. An example of the plots generated by this code for the lambda phage genome under the randomization test preserving amino-acid content of the genome exactly (see Section xxx) is shown in Figure 1.

import matplotlib # (1)

class LandscapePlot(matplotlib.figure): # (2)
    __init__:

(1) Importing matplotlib for plotting. (2) Defining the LandscapePlot class, which is derived from the matplotlib figure class.



Figure 1: (put the lambda phage genome landscape with AA constraint test here, and refer to paper).


The benefits of using matplotlib in this project were several:

  1. The code that produced the scientific plots resided alongside the code that produced the underlying data that was used to produce the plots. The importance of this cannot be stressed enough as having the code structured in this way removed many opportunities for human error involved in manually shuffling raw data files into separate graphical programs. Moreover, the instructions for producing the plots from the underlying raw data was python code, which not only described these instructions, but could be executed to produce the plots. Imagine instead the often practiced use of spreadsheets to create plots from raw data - in these spreadsheets, formulas are hidden by the results of the calculations, and it is often very confusing to construct a picture of the computational flow used to produce a specific plot.
  2. Having the graphics instructions in code allowed for quick trouble shooting when creating the plots, or evaluating raw data as it was generated.
  3. Complicated plots were easily regenerated by tweaking the code for particular graphical plots.

SWIG

  • Overview of SWIG
    • allows you to speed up selected parts of an application by writing in another language (C,C++)
  • Use of SWIG in this project
    • speed up of the random genome drawing routine
    • example code
  • Benefits of using SWIG
    • get all the benefits of python with the speed for critical parts
    • sped up parts are used in the exact same context - no need for glue code
    • can leverage experience in other languages that scientists typically have, within python
  • code that draws numbers from a specified distribution

Conclusions

  • Practical Conclusions
    • community modules are useful for a variety of scientific tasks
    • python can easily be used by more scientists
  • Bigger picture conclusions for good scientific practice
    • code readability and package structure promotes good scientific practice
    • python and its modules provide a consistent framework to promote data provenance
    • can plug into other community tools and practices to help science - e.g. unit testing


References/Resources