Open writing projects/Beginning Python for Bioinformatics

From OpenWetWare
Revision as of 21:06, 17 July 2008 by Ricardo Vidal (talk | contribs)
Jump to navigationJump to search
This page is part of the Open Writing Project Beginning_Python_for_Bioinformatics. (More Open Writing Projects.)



01

Beginning Perl for Bioinformatics by James Tisdal

which was published in 2001. My idea here is to follow the structure of the book, analysing each chapter and converting the Perl scripts into Python. The original book is very well written and an excellent starting point for any aspiring bioinformatician, either if you are a biologist that does not understand programming or a computer scientist that does not know a lot of biology and maybe even Perl.

In no way this site tries to plagiarize the book, as it is only used as an starting point (a very good one indeed) to this journey into Python. Here you will not find biological concept explanations and criticisms towards Perl. Making this clear, I will start from the beginning.

Why Python (and not Perl)?

02

According to the official Python website: http://www.python.org/doc/essays/comparisons.html

Python and Perl come from a similar background (Unix scripting, which both have long outgrown) [to learn more about that check this tutorial], and sport many similar features, but have a different philosophy. Perl emphasizes support for common application-oriented tasks, e.g. by having built-in regular expressions, file scanning and report generating features. Python emphasizes support for common programming methodologies such as data structure design and object-oriented programming, and encourages programmers to write readable (and thus maintainable) code by providing an elegant but not overly cryptic notation. As a consequence, Python comes close to Perl but rarely beats it in its original application domain; however Python has an applicability well beyond Perl's niche.

I couldn't explain better than that. But still I have to give my take on why I prefer Python over Perl, and why I decided to use it in my day-to-day programming. First I have to admit that I am lousy Perl programmer (not even close to an apprentice monger) and I always get confused by its syntax. Second I come from a Basic/Pascal/C++ background, all of them having slightly better syntax than Perl. Thus, it was natural to get on the Python bandwagon, and as the paragraph above states Python code is "extremely" readable (emphasis are mine); in no-time you can grasp it completely. OK, I admit that it has at least one odd feature for the non-computer savvy: the "mandatory" indentation. In Python you have to tabulate (using tabs or space <- recommended) loops, if clauses, functions, anything. Maybe this is the first and only hard step to get, but after a couple of hours of coding you will be satisfied on how good your code look.

03

Hands on code: Sequences and strings - part I

As pointed in Beginning Perl for Bioinformatics, a large percentage of bioinformatics procedures deals with strings, especially DNA and amino acids sequence data. As is largely known DNA is composed of four different nucleotides: A, C, T and G and proteins can contain up to 20 amino acids. Each one of these elements have one letter of the alphabet assigned to them. In the DNA case some letters represent one or more nucleotides that can be present at some sequence position (click here for more http://www.cns.fr/externe/English/Projets/Resultats/iupaciub.html).

So as the amino acid is the basic building block of proteins (AKA polypeptides), strings containing sequence is our most basic building block, from where all the bioinformatics magic will work on.

Usually in Perl a string is represented by the dollar sign in front of the variable name, like this $sequence. Python is dynamically typed, meaning variable types are assigned/discovered by the interpreter at run time. This means that the value after the equal sign will tell the interpreter what variable type you are declaring. So in Python if you want to store a DNA sequence you can just enter

<syntax type=python>myDNA = "ACGTACGTACGTACGTACGTACGT"</syntax>

a quick note: Python can be used with the interpreter command line or by previously saved scripts. I will try to use the latter in the code examples.

OK, we are ready to create our first Bioinformatics Python Hello World Script. Let's get the sequence above and print it on the screen. The first line of our code will tell the operating system what to use and where to find it

<syntax type=python>#! /usr/bin/env python</syntax>

Next we will create the variable myDNA and assign the corresponding sequence

<syntax type=python>myDNA = "ACGTACGTACGTACGTACGTACGT"</syntax>

And finally, we will print the contents of the variable to the screen:

<syntax type=python> print myDNA </syntax>

As mentioned above, Python mandates that you have your code indented, so our final code will look like this:

<syntax type=python>

  1. ! /usr/bin/env python
   myDNA = "ACGTACGTACGTACGTACGTACGT"
   print myDNA

</syntax>

04

Another important task for many biologists is to merge/concatenate different strings of DNA in one unique sequence. We modify the previous script in order to have two distinct DNA sequences in one.

We start using code_01 structure:

<syntax type=python>

  1. ! /usr/bin/env python

myDNA = "ACGTACGTACGTACGTACGTACGT" myDNA2 = "TCGATCGATCGATCGATCGA" print myDNA, myDNA2 </syntax>

So far, we added a new string containing an extra DNA sequence and we print both sequences. In Python print statement automatically adds a new line at the end of the string to be printed, unless you add a comma (,) at the end. The comma is also needed if you are going to print more than one string in order to separate them (try removing the comma from the code above).

Now, how do we merge myDNA and myDNA2? Easy in Python: just sum them with a plus signal:

<syntax type=python> myDNA3 = myDNA + myDNA2 print myDNA3 </syntax>

Notice that in Python strings are immutable, meaning they cannot be changed. This immutability confer some advantages to the code where strings (in Python strings are not variables) cannot be modified anywhere in the program and also allowing some performance gain in the interpreter. So, in order to have our sequences merged we created a third sequence that received both strings. Finally our code will be (some captions were added):

<syntax type=python>

  1. ! /usr/bin/env python

myDNA = "ACGTACGTACGTACGTACGTACGT" myDNA2 = "TCGATCGATCGATCGATCGA" print "First and Second sequences" print myDNA, myDNA2 myDNA3 = myDNA + myDNA2 print "Concatenated sequence" print myDNA3 </syntax>