Difference between revisions of "Julius B. Lucks/Projects/Python Articles/Scientific Pipelines"

From OpenWetWare
Jump to: navigation, search
(The Beginning)
Line 50: Line 50:
* The scientific support is very extensive:   
* The scientific support is very extensive:   
** It has a very good numerical module, numpy (link), so you can prototype serious computations in it.
** It has a very good numerical module, numpy (link), so you can prototype serious computations in it.
** It has a BioPython (link) project similar to BioPerl's (link) (though not as mature), supporting many common biological tasks.
** It has a BioPython (link) project similar to [[BioPerl]]'s (though not as mature), supporting many common biological tasks.
** You can write C++/C extensions in it very easily so you can speed up parts of the code to be truly production speed.
** You can write C++/C extensions in it very easily so you can speed up parts of the code to be truly production speed.
** It has R (link) bindings so you can do statistics in it.  
** It has R (link) bindings so you can do statistics in it.  

Revision as of 13:46, 13 April 2007

I have been planning sharing my experiences with learning scientific programming for a long time. In this computer age in which we do science, it is very surprising to me that there is no great resource to teach nascent sciences the art of scientific computer programming. Computers enter the scientific data collection and analysis steps of practically every scientific investigation done today, yet the typical training methods seem to be a hodge-podge collection of labmate hand-me-downs that do not meet the needs of someone learning to program computers for their science.

This is not only an inconvenience, but it is unfortunate for the quality of the science being produced. Often times people resort to tedious and time consuming ways of doing things when a simple script could do the job. It is important to use a script instead of a human for several reasons: it frees up the humans time to think about other things; there is less human involvement so less human error; and almost most importantly, the script represents a record of exactly what was done! The script can be treated as an entry in a lab notebook, and with a few good code practices such as good documentation and version control, can replace the lab notebook, at least for these types of investigations.

But where can scientists learn these techniques? If they are lucky, they will have a good lab mate that knows them and will pass them along. That doesn't happen too often, because too often there is no incentive to learn a new computing technique. Many people settle on what they know and just concentrate on the science they are doing, without appreciating that a new tool will improve the science, and even open new doors.

There are many tools and programming techniques that have been developed by the computer programming industry that have not migrated to the scientists. While I am no professional programmer, I have taken an interest in improving scientific programming, and I hope to write a series of pieces aimed at improving scientific programming. As a start, I introduce my own path to where I am now. This is mostly to show all the things I have tried, and in what situations, and where I made mistakes or picked up new techniques along the way. I hope some of you find it familiar, and motivating to learn a few more things to improve your scientific programming.

The Beginning

I first have to start off by explaining a bit about my crusade to find the perfect tools to do science. I first started programming as an undegraduate chemistry student at UNC Chapel Hill with FORTRAN. Back in the day, FORTRAN was evidently a god send to scientists because it gave them a language so close to the machine that they could write programs that made the absolute most of precious memory and clock cycles. While this is a good thing, it is not a fun thing. I was delighted at first to even be able to program a computer, but my quest to improve my tools when I encountered comment less 300,000 line monolithic FORTRAN program, CADPAC at Cambridge University as a master's student.

Now you could argue that there was more to the huge pain that CADPAC was than FORTRAN (such as the almost complete lack of comments), but there is also a lot of it that was FORTRAN. In particular, FORTRAN code does not look much different than a string of 0's and 1's if you look at it for long enough, which as I mentioned earlier was a good thing for people that wanted to waste not a resource. However, this leads to some really awkward looking code that is hard to read, and even remember what you were doing if your were an author. In other words, FORTRAN code is not very self-documenting, something I now consider very valuable.

So at that time I was given a brand new project, and in my mind a brand new environment to learn a new tool. A lot of my friends new C++, so I thought I would give it a try. Now C++, being a completely object oriented language, was a totally new world to me. It was both delightful in what doors it opened in terms of thinking about a problem, and a royal pain in how to make it do some things. C++ still requires a very keen eye to detail that a novice can quickly get lost in. I was thrilled to be able to create variables when I needed them (unlike FORTRAN where all memory usage was declared at the beginning of the program so compilers can optimize the code easier), but I was confused by having to tidy up after myself, especially after I had already created way more object complexity than I had to just because I could. I want to stress that I was a novice, and that the experts don't have these problems. But I was also trying to do a lot of science with my code and had other things to worry about. I needed a solution that was a little more friendly to the novice.

Along came a trip to a damaged book sale where I picked up Randall Schwartz and Tom Christiansen's Learning Perl. I know that Perl and C++ don't even try to do the same thing, so they are really not comparable from the language point of view. But I am telling the story of a novice scientific programmer who has many different tasks at hand. I had heard of Perl, so I bought the book for 2 pounds and read it over the weekend. It was the first time I had realized that there is room for more than one tool in my scientific toolkit.

Well I shouldn't say the first time. I had already been familiar with unix systems and the variety of tools they offer. I had even spent a considerable amount of time becoming a relative awk master (at least compared to my other skills). Perl opened my eyes to a consistent tool that I could write all my other, non-number-crunching tasks in.

So my first project pipeline started to develop. I would write code to do serious number crunching in C++. This code took a parameter file and made several output files. I used perl to write the parameter files and move around the output files, and to script multiple runs of the number code. This system worked pretty well, but the C++ code took me a long time to develop, and there was the disadvantage of having two languages. This only got more complicated as I started to need to make plots and graphics of the results.

I won't bore you with the details about my exploration into how to make good graphics, but I will tell you that I finally added a third tool to my toolkit, MATLAB. The idea was two-fold: that I could prototype number crunching code in matlab, and eventually move it to C++ if I needed the speed; and I could make graphics in matlab. So eventually I had another step to my pipeline of having perl scripts mash up data generated by the C++ code, write matlab input files to generate graphics, and call batch runs of matlab to generate the graphics. At the time I was also starting to get into the idea of putting my data on the web for my collaborators to see, so I was using perl to generate a lot of HTML with these graphics as well.

Scientific Pipeline's

It took me about 7 years to come up with this pipeline in the midst of my many scientific projects I was working on at various times. There are two important points here. The minor point is that I had many different tools to accomplish the various tasks of the pipeline. This was not a bad learning experience, but certainly hard to maintain. Sometimes I would go months without having to scale a code up with C++, so I would naturally by rusty when I came back to it.

The major point is that the scientific pipeline process is more general than my particular experience. I did not receive any formal training in this stuff, which I think speaks volumes about the level of training of the majority of scientific programmers. Most classes I have seen listed involve the algorithms, but not the day to day of how you bring those algorithms to bear on your projects.

So I want to extract the abstract ideas of what a good scientific pipeline is. It is first a development and prototyping phase. This usually happens when you are playing around with various hypotheses in your research. You need to be able to test many ideas in a rapid succession, as well as keep track of what you have tried and if it worked or not (the lab notebook concept.) You don't want to write code that will save you computer time at this stage, you want to write code that will save your time. After all, you aren't going for the gold run here, you are merely sniffing around a bit to find out where the big leads and clues are.

Next comes the first round of production. You have isolated a hypothesis or two and you need to do some serious analysis. Here you might want to consider re-writing some portions of your code so that they will save computer time.

After that, you still want the flexibility to go back and readjust your code rapidly because inevitably you made some mistakes in your thinking. You are thus back to the development and prototyping phase, armed with new knowledge you learned in your first production phase. Typically you can go back and forth between these phases many times in the life of a single project.

At every point you need to be able to communicate your results with others, and often scientific graphics are a primary means of communication. They are useful as intermediate results, and are often crucial for that final stamp of approval of a project, the publication.


This discussion finally turns to why I started writing this article in the first place. Recently I have been proselytizing python as my language of choice for almost any new programming project I undertake. Every time I talk to someone, I think of new reasons to like python, and I find better ways to explain my existing reasons. I think python and the python community effectively addresses each step of the scientific pipeline, and I think it could become a new defacto scientific computing language. When you couple this with all of the industry related uses it has, it becomes an extremely powerful platform to blend the latest technology with scientific methods.

Briefly, here is why I love it:

  • It is Object Oriented from the ground up (unlike Perl which tacked it on later), so has a better structure than perl.
  • It is a little more verbose than other languages, which makes the code more self-documenting naturally than perl. This makes it easy to write code fast, that you will still understand many months later.
  • The code is very clean looking which is extremely important for maintaining code.
  • It comes with a very good unit-testing module that makes it trivial to change the internals of the code, while still making sure it does the job you want it to. (Unit-testing is not prevalent in scientific programming yet, something which I hope changes.)
  • It has an interactive interpreter (unlike perl), with great object introspection so you can really develop fast in it.
  • The scientific support is very extensive:
    • It has a very good numerical module, numpy (link), so you can prototype serious computations in it.
    • It has a BioPython (link) project similar to BioPerl's (though not as mature), supporting many common biological tasks.
    • You can write C++/C extensions in it very easily so you can speed up parts of the code to be truly production speed.
    • It has R (link) bindings so you can do statistics in it.
    • It has a very good plotting library, matplotlib (link), that is designed to be similar to matlab so you can do all of your graphics in it.

I plan to write several articles highlighting these features of python, and how these ideas from software engineering can be applied to scientific pipelines.