Open writing projects/Python all a scientist needs: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
(New page: This is a paper/presentation for [http://us.pycon.org/2008/ Pycon 2008] that I am writing on OWW. The paper is about how I used python and its libraries and extensions as a complete scien...)
 
(imported summary and outline)
Line 1: Line 1:
This is a paper/presentation for [http://us.pycon.org/2008/ Pycon 2008] that I am writing on OWW.  The paper is about how I used python and its libraries and extensions as a complete scientific programming package for a [http://arxiv.org/abs/0708.2038 recent comparitive genomics study]. I am experimenting with writing the paper on OWW and eventually submitting it to the [http://arxiv.org arXiv].  If anyone is interested in the topic, or the process, please [[Special:Emailuser/Julius_B._Lucks|email me through OWW.]]
This is a paper/presentation for [http://us.pycon.org/2008/ Pycon 2008] that I am writing on OWW.  The paper is about how I used python and its libraries and extensions as a complete scientific programming package for a [http://arxiv.org/abs/0708.2038 recent comparitive genomics study]. I am experimenting with writing the paper on OWW and eventually submitting it to the [http://arxiv.org arXiv].  If anyone is interested in the topic, or the process, please [[Special:Emailuser/Julius_B._Lucks|email me through OWW.]]
== Summary ==
Any cutting-edge scientific research project requires a myriad of computational tools for data generation, analysis and visualization. Python is a flexible and extensible scientific programming platform that offered the perfect solution in our recent comparative genomics investigation (http://arxiv.org/abs/0708.2038). In this talk, I discuss the challenges of this project, and how the combined power of Biopython (http://biopython.org), SWIG (http://www.swig.org) and Matplotlib (http://matplotlib.sourceforge.net) were a perfect solution. I finish by discussing how python promotes good scientific practice, and how its use should be encouraged within the scientific community.


== Outline ==
== Outline ==
=== The Scientists Dilemma ===
* A typical research project requires a variety of computational tasks
** Data generation
** Data analysis
** Data visualization
* The most common solution is to use separate tools for each task
** Data generation in C
** Data analysis in proprietory software
** Data visualization in separate graphing package
* This is an inadequate solution
** These tools can't be pipelined easily
*** Many manual steps have to be repeated if something changes
** Poor at best data provenance
*** Not sure if an error is due to a program or human error
*** Can only repeat analysis by following written steps in a lab notebook
*** Steps are easily forgotten and hard to pass on
* Python overcomes these weaknesses
=== Comparative Genomics Case Study ===
* Brief Project Description - Compare DNA sequences of viruses
** Download and parse the genome files of a many viruses
** Store the genome in a project-specific genome class
** Draw random genomes to compare to the 'real' genome
** Visualize the genomic data in a 'genome landscape' plot
=== BioPython ===
* Overview of BioPython
** A suite of bioinformatics tools for tasks such as parsing bio-database files, computing alignments between biological sequences, interacting with bio web-services
* Use of BioPython in this project
** Parsing GenBank files from the National Center for Biotechnology Information
** example code
* Benefits of using Biopython
** parsing code can be wrapped in custom classes that make sense for the particular project
=== MatPlotLib ===
* Overview of MatPlotLib
** Matlab-like graphical environment
* Use of MatPlotLib in this project
** generating genome landscapes
** example code
* Benefits of using MatPlotLib
** graphics code resides along-side of data generation code
** quick trouble shooting
** can easily re-generate complicated plots since by tweaking the code
=== SWIG ===
* Overview of SWIG
** allows you to speed up selected parts of an application by writing in another language (C,C++)
* Use of SWIG in this project
** speed up of the random genome drawing routine
** example code
* Benefits of using SWIG
** get all the benefits of python with the speed for critical parts
** sped up parts are used in the exact same context - no need for glue code
** can leverage experience in other languages that scientists typically have, within python
=== Conclusions ===
* Practical Conclusions
** community modules are useful for a variety of scientific tasks
** python can easily be used by more scientists
* Bigger picture conclusions for good scientific practice
** code readability and package structure promotes good scientific practice
** python and its modules provide a consistent framework to promote data provenance
** can plug into other community tools and practices to help science - e.g. unit testing


== References/Resources ==
== References/Resources ==

Revision as of 18:14, 6 February 2008

This is a paper/presentation for Pycon 2008 that I am writing on OWW. The paper is about how I used python and its libraries and extensions as a complete scientific programming package for a recent comparitive genomics study. I am experimenting with writing the paper on OWW and eventually submitting it to the arXiv. If anyone is interested in the topic, or the process, please email me through OWW.

Summary

Any cutting-edge scientific research project requires a myriad of computational tools for data generation, analysis and visualization. Python is a flexible and extensible scientific programming platform that offered the perfect solution in our recent comparative genomics investigation (http://arxiv.org/abs/0708.2038). In this talk, I discuss the challenges of this project, and how the combined power of Biopython (http://biopython.org), SWIG (http://www.swig.org) and Matplotlib (http://matplotlib.sourceforge.net) were a perfect solution. I finish by discussing how python promotes good scientific practice, and how its use should be encouraged within the scientific community.

Outline

The Scientists Dilemma

  • A typical research project requires a variety of computational tasks
    • Data generation
    • Data analysis
    • Data visualization
  • The most common solution is to use separate tools for each task
    • Data generation in C
    • Data analysis in proprietory software
    • Data visualization in separate graphing package
  • This is an inadequate solution
    • These tools can't be pipelined easily
      • Many manual steps have to be repeated if something changes
    • Poor at best data provenance
      • Not sure if an error is due to a program or human error
      • Can only repeat analysis by following written steps in a lab notebook
      • Steps are easily forgotten and hard to pass on
  • Python overcomes these weaknesses

Comparative Genomics Case Study

  • Brief Project Description - Compare DNA sequences of viruses
    • Download and parse the genome files of a many viruses
    • Store the genome in a project-specific genome class
    • Draw random genomes to compare to the 'real' genome
    • Visualize the genomic data in a 'genome landscape' plot

BioPython

  • Overview of BioPython
    • A suite of bioinformatics tools for tasks such as parsing bio-database files, computing alignments between biological sequences, interacting with bio web-services
  • Use of BioPython in this project
    • Parsing GenBank files from the National Center for Biotechnology Information
    • example code
  • Benefits of using Biopython
    • parsing code can be wrapped in custom classes that make sense for the particular project

MatPlotLib

  • Overview of MatPlotLib
    • Matlab-like graphical environment
  • Use of MatPlotLib in this project
    • generating genome landscapes
    • example code
  • Benefits of using MatPlotLib
    • graphics code resides along-side of data generation code
    • quick trouble shooting
    • can easily re-generate complicated plots since by tweaking the code

SWIG

  • Overview of SWIG
    • allows you to speed up selected parts of an application by writing in another language (C,C++)
  • Use of SWIG in this project
    • speed up of the random genome drawing routine
    • example code
  • Benefits of using SWIG
    • get all the benefits of python with the speed for critical parts
    • sped up parts are used in the exact same context - no need for glue code
    • can leverage experience in other languages that scientists typically have, within python

Conclusions

  • Practical Conclusions
    • community modules are useful for a variety of scientific tasks
    • python can easily be used by more scientists
  • Bigger picture conclusions for good scientific practice
    • code readability and package structure promotes good scientific practice
    • python and its modules provide a consistent framework to promote data provenance
    • can plug into other community tools and practices to help science - e.g. unit testing


References/Resources