Talk:Open writing projects/Python all a scientist needs: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
(outline)
 
(→‎Software Carpentry: new section)
 
(8 intermediate revisions by 3 users not shown)
Line 1: Line 1:
== Outline ==
== Please Leave A Comment ==


=== The Scientists Dilemma ===
[[User:Adrian Del Maestro|Adrian Del Maestro]] 15:04, 24 February 2008 (EST): Very nice article on the scientific uses of python.  Using python to produce publication quality plots via matplotlib has saved me hours of time as the scope and results of a project evolve.  I also enjoyed your comments on data provenance, which is an very important topic that many scientists doing numerics are somewhat cavalier about.
* A typical research project requires a variety of computational tasks
** Data generation
** Data analysis
** Data visualization
* The most common solution is to use separate tools for each task
** Data generation in C
** Data analysis in proprietory software
** Data visualization in separate graphing package
* This is an inadequate solution
** These tools can't be pipelined easily
*** Many manual steps have to be repeated if something changes
** Poor at best data provenance
*** Not sure if an error is due to a program or human error
*** Can only repeat analysis by following written steps in a lab notebook
*** Steps are easily forgotten and hard to pass on
* Python overcomes these weaknesses


=== Comparative Genomics Case Study ===
My usual approach to large numerical projects is to use python as a scripting glue for analysis, provenance and plotting of data produced using large scale c++ codes.  After reading the article, I think that I will attempt to do all my prototyping, coding and profiling completely in python, then use SWIG where appropriate for my next numerical project.
* Brief Project Description - Compare DNA sequences of viruses
** Download and parse the genome files of a many viruses
** Store the genome in a project-specific genome class
** Draw random genomes to compare to the 'real' genome
** Visualize the genomic data in a 'genome landscape' plot


=== BioPython ===
One thing that might be helpful for scientists that are new to python would be some more elaborate discussion on the confusion surrounding the various array packages (i.e. numarray vs. numpy etc.)
* Overview of BioPython
** A suite of bioinformatics tools for tasks such as parsing bio-database files, computing alignments between biological sequences, interacting with bio web-services
* Use of BioPython in this project
** Parsing GenBank files from the National Center for Biotechnology Information
** example code
* Benefits of using Biopython
** parsing code can be wrapped in custom classes that make sense for the particular project


=== MatPlotLib ===
Great work, keep the articles coming!
* Overview of MatPlotLib
** Matlab-like graphical environment
* Use of MatPlotLib in this project
** generating genome landscapes
** example code
* Benefits of using MatPlotLib
** graphics code resides along-side of data generation code
** quick trouble shooting
** can easily re-generate complicated plots since by tweaking the code


=== SWIG ===
* Overview of SWIG
** allows you to speed up selected parts of an application by writing in another language (C,C++)
* Use of SWIG in this project
** speed up of the random genome drawing routine
** example code
* Benefits of using SWIG
** get all the benefits of python with the speed for critical parts
** sped up parts are used in the exact same context - no need for glue code
** can leverage experience in other languages that scientists typically have, within python


=== Conclusions ===
----
* Practical Conclusions
 
** community modules are useful for a variety of scientific tasks
Joao Xavier: I read quickly your article and I liked it very much. Here are a few comments. I liked the paper a lot. Its great that you wrote up your  experience with the genomic project. I certainly relate to the  scientist who strugles with a number of programming tools for each project (java, matlab and python scripts). Although this works well for me, it's a nightmare when I have to pass it on to somebody else. Not to mention that it is even embarassing trying to explain all that "just do it" type of code that I use to glue many steps for the processing of large data sets. As you say, python could be the solution for this. My main problem with python is that being completely open there are many tools out there available to perform the same task. This can be overwhelming for someone starting up, like me. It would be great if you can say how you coped with this in your paper. For exqample, Im sure you tried other libraries before deciding to use matplotlib or the numeric packages you use. what do you suggest people do to avoid having to try many packages themselves? Are there webpages or discussion groups where python scientists can go for advice?
** python can easily be used by more scientists
 
* Bigger picture conclusions for good scientific practice
Also, did you try "sage", the free software for python that does a lot of maths including symbolic math? A colleague told me it's great and that it's growing amonh scientists.
** code readability and package structure promotes good scientific practice
 
** python and its modules provide a consistent framework to promote data provenance
* [[User:Julius B. Lucks|Julius B. Lucks]] 02:04, 5 May 2008 (EDT): Recently [[User:Marshall_Hampton|Marshall Hampton]] wrote [[Open_writing_projects/Sage_and_cython_a_brief_introduction|an introductory article on Sage and Cython]]
** can plug into other community tools and practices to help science - e.g. unit testing
 
----
 
The most complicated part of this discussion is the SWIG typemap stuff. The whole point of SWIG is to avoid the need to have to write these types of functions yourself. I think you could just have passed in a numpy array (of type "float64" I think) and it would just have worked without any need to typemap. Alternatively, SWIG provides some convenience functions for just this purpose. Just add the following to your interface file, and it provides the user a double_array function which do the Python list --> C array conversion for them.
<pre>%include "carrays.i"
%array_class(double, doubleArray)
%pythoncode %{
def double_array(mylist):
    """Create a C array of doubles from a list."""
    c = doubleArray(len(mylist))
    for i,v in enumerate(mylist):
        c[i] = v
    return c
%}</pre>
 
*'''[[User:Noel M. O'Boyle|baoilleach]] 07:01, 18 April 2008 (EDT)''':
 
== Comment from Peter Cock ==
 
Hello Julius,
 
I've just stumbled across your page:
http://openwetware.org/wiki/Julius_B._Lucks/Projects/Python_All_A_Scientist_Needs
 
I just thought I'd point out a slight improvement to the Biopython
issue raised in Note (5),
 
<pre>
gb_parsed_record = SeqIO.parse(gb_file,"genbank").next() # (5)
...
(5) The Bio.SeqIO.parse method can parse a variety of formats. Here we
use it to parse the GenBank files on our local disk using the "genbank"
format parameter. The method returns a generator, who's next() method
is used to retrieve an object representing the parsed file.
</pre>
 
I see you were using Biopython 1.44, but I just wanted to let you know that
Biopython 1.45 introduced another function, Bio.SeqIO.read()  for use in exactly
this situation (when there is one and only one record in the sequence file).
 
i.e.  
 
<pre>
gb_parsed_record = SeqIO.read(gb_file,"genbank")
</pre>
 
If the file contained no records, or more than one, then an exception would
be raised.  This prevents the possible problem of silently ignoring an
unexpected second record which could happen with the original code using parse(...).next().
 
Peter
 
== Software Carpentry ==
 
 
Not sure if you have seen this, but its based on Python:
 
http://www.osl.iu.edu/~lums/swc/

Latest revision as of 07:58, 19 June 2008

Please Leave A Comment

Adrian Del Maestro 15:04, 24 February 2008 (EST): Very nice article on the scientific uses of python. Using python to produce publication quality plots via matplotlib has saved me hours of time as the scope and results of a project evolve. I also enjoyed your comments on data provenance, which is an very important topic that many scientists doing numerics are somewhat cavalier about.

My usual approach to large numerical projects is to use python as a scripting glue for analysis, provenance and plotting of data produced using large scale c++ codes. After reading the article, I think that I will attempt to do all my prototyping, coding and profiling completely in python, then use SWIG where appropriate for my next numerical project.

One thing that might be helpful for scientists that are new to python would be some more elaborate discussion on the confusion surrounding the various array packages (i.e. numarray vs. numpy etc.)

Great work, keep the articles coming!



Joao Xavier: I read quickly your article and I liked it very much. Here are a few comments. I liked the paper a lot. Its great that you wrote up your experience with the genomic project. I certainly relate to the scientist who strugles with a number of programming tools for each project (java, matlab and python scripts). Although this works well for me, it's a nightmare when I have to pass it on to somebody else. Not to mention that it is even embarassing trying to explain all that "just do it" type of code that I use to glue many steps for the processing of large data sets. As you say, python could be the solution for this. My main problem with python is that being completely open there are many tools out there available to perform the same task. This can be overwhelming for someone starting up, like me. It would be great if you can say how you coped with this in your paper. For exqample, Im sure you tried other libraries before deciding to use matplotlib or the numeric packages you use. what do you suggest people do to avoid having to try many packages themselves? Are there webpages or discussion groups where python scientists can go for advice?

Also, did you try "sage", the free software for python that does a lot of maths including symbolic math? A colleague told me it's great and that it's growing amonh scientists.


The most complicated part of this discussion is the SWIG typemap stuff. The whole point of SWIG is to avoid the need to have to write these types of functions yourself. I think you could just have passed in a numpy array (of type "float64" I think) and it would just have worked without any need to typemap. Alternatively, SWIG provides some convenience functions for just this purpose. Just add the following to your interface file, and it provides the user a double_array function which do the Python list --> C array conversion for them.

%include "carrays.i"
%array_class(double, doubleArray)
%pythoncode %{
def double_array(mylist):
    """Create a C array of doubles from a list."""
    c = doubleArray(len(mylist))
    for i,v in enumerate(mylist):
        c[i] = v
    return c
%}

Comment from Peter Cock

Hello Julius,

I've just stumbled across your page: http://openwetware.org/wiki/Julius_B._Lucks/Projects/Python_All_A_Scientist_Needs

I just thought I'd point out a slight improvement to the Biopython issue raised in Note (5),

gb_parsed_record = SeqIO.parse(gb_file,"genbank").next() # (5)
...
(5) The Bio.SeqIO.parse method can parse a variety of formats. Here we
use it to parse the GenBank files on our local disk using the "genbank"
format parameter. The method returns a generator, who's next() method
is used to retrieve an object representing the parsed file.

I see you were using Biopython 1.44, but I just wanted to let you know that Biopython 1.45 introduced another function, Bio.SeqIO.read() for use in exactly this situation (when there is one and only one record in the sequence file).

i.e.

gb_parsed_record = SeqIO.read(gb_file,"genbank")

If the file contained no records, or more than one, then an exception would be raised. This prevents the possible problem of silently ignoring an unexpected second record which could happen with the original code using parse(...).next().

Peter

Software Carpentry

Not sure if you have seen this, but its based on Python:

http://www.osl.iu.edu/~lums/swc/