Open writing projects/Beginning Python for Bioinformatics

From OpenWetWare
Revision as of 21:12, 17 July 2008 by Ricardo Vidal (talk | contribs)
Jump to navigationJump to search
This page is part of the Open Writing Project Beginning_Python_for_Bioinformatics. (More Open Writing Projects.)



01

Beginning Perl for Bioinformatics by James Tisdal

which was published in 2001. My idea here is to follow the structure of the book, analysing each chapter and converting the Perl scripts into Python. The original book is very well written and an excellent starting point for any aspiring bioinformatician, either if you are a biologist that does not understand programming or a computer scientist that does not know a lot of biology and maybe even Perl.

In no way this site tries to plagiarize the book, as it is only used as an starting point (a very good one indeed) to this journey into Python. Here you will not find biological concept explanations and criticisms towards Perl. Making this clear, I will start from the beginning.

Why Python (and not Perl)?

02

According to the official Python website: http://www.python.org/doc/essays/comparisons.html

Python and Perl come from a similar background (Unix scripting, which both have long outgrown) [to learn more about that check this tutorial], and sport many similar features, but have a different philosophy. Perl emphasizes support for common application-oriented tasks, e.g. by having built-in regular expressions, file scanning and report generating features. Python emphasizes support for common programming methodologies such as data structure design and object-oriented programming, and encourages programmers to write readable (and thus maintainable) code by providing an elegant but not overly cryptic notation. As a consequence, Python comes close to Perl but rarely beats it in its original application domain; however Python has an applicability well beyond Perl's niche.

I couldn't explain better than that. But still I have to give my take on why I prefer Python over Perl, and why I decided to use it in my day-to-day programming. First I have to admit that I am lousy Perl programmer (not even close to an apprentice monger) and I always get confused by its syntax. Second I come from a Basic/Pascal/C++ background, all of them having slightly better syntax than Perl. Thus, it was natural to get on the Python bandwagon, and as the paragraph above states Python code is "extremely" readable (emphasis are mine); in no-time you can grasp it completely. OK, I admit that it has at least one odd feature for the non-computer savvy: the "mandatory" indentation. In Python you have to tabulate (using tabs or space <- recommended) loops, if clauses, functions, anything. Maybe this is the first and only hard step to get, but after a couple of hours of coding you will be satisfied on how good your code look.

03

Hands on code: Sequences and strings - part I

As pointed in Beginning Perl for Bioinformatics, a large percentage of bioinformatics procedures deals with strings, especially DNA and amino acids sequence data. As is largely known DNA is composed of four different nucleotides: A, C, T and G and proteins can contain up to 20 amino acids. Each one of these elements have one letter of the alphabet assigned to them. In the DNA case some letters represent one or more nucleotides that can be present at some sequence position (click here for more http://www.cns.fr/externe/English/Projets/Resultats/iupaciub.html).

So as the amino acid is the basic building block of proteins (AKA polypeptides), strings containing sequence is our most basic building block, from where all the bioinformatics magic will work on.

Usually in Perl a string is represented by the dollar sign in front of the variable name, like this $sequence. Python is dynamically typed, meaning variable types are assigned/discovered by the interpreter at run time. This means that the value after the equal sign will tell the interpreter what variable type you are declaring. So in Python if you want to store a DNA sequence you can just enter

<syntax type=python>myDNA = "ACGTACGTACGTACGTACGTACGT"</syntax>

a quick note: Python can be used with the interpreter command line or by previously saved scripts. I will try to use the latter in the code examples.

OK, we are ready to create our first Bioinformatics Python Hello World Script. Let's get the sequence above and print it on the screen. The first line of our code will tell the operating system what to use and where to find it

<syntax type=python>#! /usr/bin/env python</syntax>

Next we will create the variable myDNA and assign the corresponding sequence

<syntax type=python>myDNA = "ACGTACGTACGTACGTACGTACGT"</syntax>

And finally, we will print the contents of the variable to the screen:

<syntax type=python> print myDNA </syntax>

As mentioned above, Python mandates that you have your code indented, so our final code will look like this:

<syntax type=python>

  1. ! /usr/bin/env python
   myDNA = "ACGTACGTACGTACGTACGTACGT"
   print myDNA

</syntax>

04

Another important task for many biologists is to merge/concatenate different strings of DNA in one unique sequence. We modify the previous script in order to have two distinct DNA sequences in one.

We start using code_01 structure:

<syntax type=python>

  1. ! /usr/bin/env python

myDNA = "ACGTACGTACGTACGTACGTACGT" myDNA2 = "TCGATCGATCGATCGATCGA" print myDNA, myDNA2 </syntax>

So far, we added a new string containing an extra DNA sequence and we print both sequences. In Python print statement automatically adds a new line at the end of the string to be printed, unless you add a comma (,) at the end. The comma is also needed if you are going to print more than one string in order to separate them (try removing the comma from the code above).

Now, how do we merge myDNA and myDNA2? Easy in Python: just sum them with a plus signal:

<syntax type=python> myDNA3 = myDNA + myDNA2 print myDNA3 </syntax>

Notice that in Python strings are immutable, meaning they cannot be changed. This immutability confer some advantages to the code where strings (in Python strings are not variables) cannot be modified anywhere in the program and also allowing some performance gain in the interpreter. So, in order to have our sequences merged we created a third sequence that received both strings. Finally our code will be (some captions were added):

<syntax type=python>

  1. ! /usr/bin/env python

myDNA = "ACGTACGTACGTACGTACGTACGT" myDNA2 = "TCGATCGATCGATCGATCGA" print "First and Second sequences" print myDNA, myDNA2 myDNA3 = myDNA + myDNA2 print "Concatenated sequence" print myDNA3 </syntax>

05

Just an apart from the bioinformatics aspect of programming: Python's print statement.

As in most computer languages Python allows an easy way to write to the standard output. Python's print only accepts output of strings, and if the variable sent to it is not a string it is first converted and then output.

The print always put a line-break ('\n' or "\n") at the end of the expression to be output, except when the print statement ends with a comma. For example:

<syntax type=python>print "This is a"</syntax> <syntax type=python>print "test"</syntax>

will print

This is a

test

while

<syntax type=python>print "This is a",</syntax> <syntax type=python>print "test",</syntax>

will print

This is a test


Of course Python's print statement allows any programming escape character, such as '\n' and '\t'.


Concatenating strings on output

To concatenate two strings on output there are two possible ways in Python. You can either separate the strings with a comma, like we did here

<syntax type=python>print myDNA, myDNA2</syntax>

or you can use the "+" sign in order to obtain almost the same result. This is similar to what was used here

<syntax type=python>myDNA3 = myDNA + myDNA2</syntax>

but instead we would use the print command as

<syntax type=python>print myDNA3 + myDNA</syntax>

In the latter case, both strings will not be separated by a space and will be merged. To get the same result you would have to concatenate an extra space between the strings like

<syntax type=python>print myDNA3 + " " + myDNA</syntax>

06

Tisdal's book on Perl introduces next the ability to transcripts DNA sequences to RNA. In order to do that we need to check a different aspect of programming: regular expressions (or regex). Regular expressions is a pattern/string expression that works matching/describing/filtering other strings.

Let's say you want to examine or extract all vowels contained in one phrase, one page, one word. Another example would be to remove all HTML tags from a downloaded webpage. As HTML tags are encapsulated between < and > signs we can create a regex that will search for any characters in between the signs and remove (parse) them from our page. We will deal very briefly with regex, and if you are interested in learning more about it you can search for countless references on the Internet (such as <a href="http://www.regular-expressions.info/"> this one</a>).

In order to use regular expression in Python we need to check another concept present in the language: importing modules. Python core functionality provides most of the usual operations and also comes with a built-in library of functions that "access to operations that are not part of the core of the language but are nevertheless built in". One of this operations is the ability to interpret regular expression that in Python is located in the re module. Apart from the language core, built in modules, Python can be further extended by using third-party modules imported into the language. Anyone can create a module and distribute to every Python user and programmer.

So, in order to have regex capabilities we literally have to import the regex module. We do that by entering the line:

<syntax type=python> import re</syntax>

Python's code style guide suggests that import statements should be on separate lines

<syntax type=python>import sys import re ...</syntax>

and to be put on the top of the file (usually below the line that tells your OS to use Python's interpreter on the script).

So the first two lines of our new script would be

<syntax type=python>

  1. ! /usr/bin/env python

import re </syntax>

Now that we have the ability to use regex, we need to create one expression that will transcribe our DNA sequences to RNA.