Open writing projects/Beginning Python for Bioinformatics: Difference between revisions

Revision as of 18:44, 23 July 2008

This page is part of the Open Writing Project Beginning_Python_for_Bioinformatics. (More Open Writing Projects.)

01

Beginning Perl for Bioinformatics by James Tisdal

which was published in 2001. My idea here is to follow the structure of the book, analysing each chapter and converting the Perl scripts into Python. The original book is very well written and an excellent starting point for any aspiring bioinformatician, either if you are a biologist that does not understand programming or a computer scientist that does not know a lot of biology and maybe even Perl.

In no way this site tries to plagiarize the book, as it is only used as an starting point (a very good one indeed) to this journey into Python. Here you will not find biological concept explanations and criticisms towards Perl. Making this clear, I will start from the beginning.

Why Python (and not Perl)?

02

According to the official Python website: http://www.python.org/doc/essays/comparisons.html

Python and Perl come from a similar background (Unix scripting, which both have long outgrown) [to learn more about that check this tutorial], and sport many similar features, but have a different philosophy. Perl emphasizes support for common application-oriented tasks, e.g. by having built-in regular expressions, file scanning and report generating features. Python emphasizes support for common programming methodologies such as data structure design and object-oriented programming, and encourages programmers to write readable (and thus maintainable) code by providing an elegant but not overly cryptic notation. As a consequence, Python comes close to Perl but rarely beats it in its original application domain; however Python has an applicability well beyond Perl's niche.

I couldn't explain better than that. But still I have to give my take on why I prefer Python over Perl, and why I decided to use it in my day-to-day programming. First I have to admit that I am lousy Perl programmer (not even close to an apprentice monger) and I always get confused by its syntax. Second I come from a Basic/Pascal/C++ background, all of them having slightly better syntax than Perl. Thus, it was natural to get on the Python bandwagon, and as the paragraph above states Python code is "extremely" readable (emphasis are mine); in no-time you can grasp it completely. OK, I admit that it has at least one odd feature for the non-computer savvy: the "mandatory" indentation. In Python you have to tabulate (using tabs or space <- recommended) loops, if clauses, functions, anything. Maybe this is the first and only hard step to get, but after a couple of hours of coding you will be satisfied on how good your code look.

03

Hands on code: Sequences and strings - part I

As pointed in Beginning Perl for Bioinformatics, a large percentage of bioinformatics procedures deals with strings, especially DNA and amino acids sequence data. As is largely known DNA is composed of four different nucleotides: A, C, T and G and proteins can contain up to 20 amino acids. Each one of these elements have one letter of the alphabet assigned to them. In the DNA case some letters represent one or more nucleotides that can be present at some sequence position (click here for more http://www.cns.fr/externe/English/Projets/Resultats/iupaciub.html).

So as the amino acid is the basic building block of proteins (AKA polypeptides), strings containing sequence is our most basic building block, from where all the bioinformatics magic will work on.

Usually in Perl a string is represented by the dollar sign in front of the variable name, like this $sequence. Python is dynamically typed, meaning variable types are assigned/discovered by the interpreter at run time. This means that the value after the equal sign will tell the interpreter what variable type you are declaring. So in Python if you want to store a DNA sequence you can just enter

<syntax type=python>myDNA = "ACGTACGTACGTACGTACGTACGT"</syntax>

a quick note: Python can be used with the interpreter command line or by previously saved scripts. I will try to use the latter in the code examples.

OK, we are ready to create our first Bioinformatics Python Hello World Script. Let's get the sequence above and print it on the screen. The first line of our code will tell the operating system what to use and where to find it

<syntax type=python>#! /usr/bin/env python</syntax>

Next we will create the variable myDNA and assign the corresponding sequence

<syntax type=python>myDNA = "ACGTACGTACGTACGTACGTACGT"</syntax>

And finally, we will print the contents of the variable to the screen:

<syntax type=python> print myDNA </syntax>

As mentioned above, Python mandates that you have your code indented, so our final code will look like this:

! /usr/bin/env python

   myDNA = "ACGTACGTACGTACGTACGTACGT"
   print myDNA

</syntax>

04

Another important task for many biologists is to merge/concatenate different strings of DNA in one unique sequence. We modify the previous script in order to have two distinct DNA sequences in one.

We start using code_01 structure:

! /usr/bin/env python

myDNA = "ACGTACGTACGTACGTACGTACGT" myDNA2 = "TCGATCGATCGATCGATCGA" print myDNA, myDNA2 </syntax>

So far, we added a new string containing an extra DNA sequence and we print both sequences. In Python print statement automatically adds a new line at the end of the string to be printed, unless you add a comma (,) at the end. The comma is also needed if you are going to print more than one string in order to separate them (try removing the comma from the code above).

Now, how do we merge myDNA and myDNA2? Easy in Python: just sum them with a plus signal:

<syntax type=python> myDNA3 = myDNA + myDNA2 print myDNA3 </syntax>

Notice that in Python strings are immutable, meaning they cannot be changed. This immutability confer some advantages to the code where strings (in Python strings are not variables) cannot be modified anywhere in the program and also allowing some performance gain in the interpreter. So, in order to have our sequences merged we created a third sequence that received both strings. Finally our code will be (some captions were added):

! /usr/bin/env python

myDNA = "ACGTACGTACGTACGTACGTACGT" myDNA2 = "TCGATCGATCGATCGATCGA" print "First and Second sequences" print myDNA, myDNA2 myDNA3 = myDNA + myDNA2 print "Concatenated sequence" print myDNA3 </syntax>

05

Just an apart from the bioinformatics aspect of programming: Python's print statement.

As in most computer languages Python allows an easy way to write to the standard output. Python's print only accepts output of strings, and if the variable sent to it is not a string it is first converted and then output.

The print always put a line-break ('\n' or "\n") at the end of the expression to be output, except when the print statement ends with a comma. For example:

<syntax type=python>print "This is a" print "test"</syntax>

will print

This is a

test

while

<syntax type=python>print "This is a", print "test",</syntax>

will print

This is a test

Of course Python's print statement allows any programming escape character, such as '\n' and '\t'.

Concatenating strings on output

To concatenate two strings on output there are two possible ways in Python. You can either separate the strings with a comma, like we did here

<syntax type=python>print myDNA, myDNA2</syntax>

or you can use the "+" sign in order to obtain almost the same result. This is similar to what was used here

<syntax type=python>myDNA3 = myDNA + myDNA2</syntax>

but instead we would use the print command as

<syntax type=python>print myDNA3 + myDNA</syntax>

In the latter case, both strings will not be separated by a space and will be merged. To get the same result you would have to concatenate an extra space between the strings like

<syntax type=python>print myDNA3 + " " + myDNA</syntax>

06

Tisdal's book on Perl introduces next the ability to transcripts DNA sequences to RNA. In order to do that we need to check a different aspect of programming: regular expressions (or regex). Regular expressions is a pattern/string expression that works matching/describing/filtering other strings.

Let's say you want to examine or extract all vowels contained in one phrase, one page, one word. Another example would be to remove all HTML tags from a downloaded webpage. As HTML tags are encapsulated between < and > signs we can create a regex that will search for any characters in between the signs and remove (parse) them from our page. We will deal very briefly with regex, and if you are interested in learning more about it you can search for countless references on the Internet (such as this one).

In order to use regular expression in Python we need to check another concept present in the language: importing modules. Python core functionality provides most of the usual operations and also comes with a built-in library of functions that "access to operations that are not part of the core of the language but are nevertheless built in". One of this operations is the ability to interpret regular expression that in Python is located in the re module. Apart from the language core, built in modules, Python can be further extended by using third-party modules imported into the language. Anyone can create a module and distribute to every Python user and programmer.

So, in order to have regex capabilities we literally have to import the regex module. We do that by entering the line:

<syntax type=python> import re</syntax>

Python's code style guide suggests that import statements should be on separate lines

<syntax type=python>import sys import re ...</syntax>

and to be put on the top of the file (usually below the line that tells your OS to use Python's interpreter on the script).

So the first two lines of our new script would be

! /usr/bin/env python

import re </syntax>

Now that we have the ability to use regex, we need to create one expression that will transcribe our DNA sequences to RNA.

07

As mentioned above, regex in Python are provided by the re module, which provides an interface for the regular expression engine. First thing we have to do is to tell the interpreter what to do and what expression to use.

Let's start with a DNA sequence.

<syntax type=python>myDNA = 'ACGTTGCAACGTTGCAACGTTGCA'</syntax>

How to transcribe it to RNA? Transcription creates a single-strand RNA molecule from the double-strand DNA; basically the final result is a similar sequence, with all T's changed to U's. So our regular expression has to find all T nucleotides in the above sequence and then replace them.

Regular expressions in Python need to be compiled into a RegexObject, that contains all possible regular expression operations. In our case we need to search and replace, what can be done by using the sub() method. According to the Python's Regular Expression HOWTO sub() returns the string obtained by replacing the leftmost non-overlapping occurrences of the RE in string by the replacement replacement. If the pattern isn't found, string is returned unchanged.

Let's put everything above in real code. First we need to compile a new RegexObject that will search for all thymines in our sequence. It can be achieved by using this:

<syntax type=python>regexp = re.compile('T')</syntax>

Simple as that. This line of code tells the Python interpreter that our "regular expression" is every T in our string. Now, we have to make replace those Ts with Us. In order to do that we just tell the interpreter:

<syntax type=python>myRNA = regexp.sub('U', myDNA)M</syntax>

Let's look at the last two lines of code. On the first line we created a new RegexObject, regexp (that could have any name, as any variable) and compiled it, making our regular expression to be every T in our string. On the second line, we assigned our soon to be created RNA sequence to a new string (remember that strings in Python are immutable) and used the command sub to replace in the Ts by Us present in our original DNA string. Putting all together our transcription code will be

! /usr/bin/env python

import re myDNA = 'ACGTTGCAACGTTGCAACGTTGCA' regexp = re.compile('T') myRNA = regexp.sub('U', myDNA) print myRNA </syntax>

You can download the resulting script here.

08

We have seen how to transcribe DNA using regular expression, even though the regex we have used cannot be considered a real one. Now we are going to simplify our small script even more and take advantage of some string capabilities of Python. Instead of using two lines, we are going to use only one. And also we won't need to import anything.

Let's start again with the same DNA sequence

<codemultiple>myDNA = 'ACGTTGCAACGTTGCAACGTTGCA'</codemultiple>

This time we are going to use replace. This is one of the Python's methods to manipulate strings. Basically, you are asking the interpreter to get a certain string by another. The method returns a new copy of your string. We add this line

<syntax type=python>myRNA - myDNA.replace('T', 'U')</syntax>

This tells Python: myRNA will receive a copy of myDNA where all Ts were changed by Us. the "dot" after myDNA means that the method replace will get that variable as input on that variable.

So our code from above would like this

<syntax type=python>#! /usr/bin/env python myDNA = 'ACGTTGCAACGTTGCAACGTTGCA' myRNA = myDNA.replace('T', 'U') print myRNA</syntax>

Simple and efficient. Next we will use the same approach on generating the reverse complement of a DNA sequence, with no regex pattern.

09

We are currently following Chapter 4 of Beginning Perl for Bioinformatics, which is the first chapter of the book that actually has code snippets and real programming. The last exercises in this chapter deal with the ability to read files and operate with information extracted from these files, to create arrays and scalar list in Perl.

We are going to check how to read files in python. The book tells you how to read protein sequences. Here we are going to read DNA and protein sequences from files and change them.

Let say you have a file with a DNA sequence in some directory in your hard disk. The file cannot be a FASTA type (we will see later how to handle FASTA files), just pure sequence, something like this:

<syntax type=python>GTGACTTTGTTCAACGGCCGCGGTATCCTAACCGTGCGAAGGTAGCGTAATCACTTGTTC TTTAAATAAGGACTAGTATGAATGGCATCACGAGGGCTTTACTGTCTCCTTTTTCTAATC AGTGAAACTAATCTCCCGTGAAGAAGCGGGAATTAACTTATAAGACGAGAAGACCCTATG GAGCTTTAAACCAAATAACATTTGCTATTTTACAACATTCAGATATCTAATCTTTATAGC ACTATGATTACAAGTTTTAGGTTGGGGTGACCGCGGAGTAAAAATTAACCTCCACATTGA AGGAATTTCTAAGCAAAAAGCTACAACTTTAAGCATCAACAAATTGACACTTATTGACCC AATATTTTGATCAACGAACCATTACCCTAGGGATAACAGCGCAATCCATTATGAGAGCTA TTATCGACAAGTGGGCTTACGACCTCGATGTTGGATCAGGG</syntax>

You can download the file here. This is a partial sequence of a mitochondrial gene from a South American frog species called Hylodes ornatus. For our purposes you can save the file in the same directory you are going to run the script from, or if you are using the Python interpreter start it in the directory that contains the file.

The file name is not important but I will use AY162388.seq from now on. The first thing we have to do is to open the file for reading. We define a string variable that will contain the file name.

<syntax type=python>dnafile = "AY162388.seq"</syntax>

In order to open the file, we can use the command open, that receives two strings: the first is the file name (it can be the whole location too) to be opened and the mode to be used, which is what you want to do with the file. The mode can be one or more letters that tell the interpreter what to do. For now we are going to use the r mode , which tells Python to read the file, and only do that. So our code is

<syntax type=python> file = open(dnafile, 'r')</syntax>

file is a file object that contains the directives to read our DNA sequence. Now, we have actually read the contents of the file but they are stored in a file object and we did not accessed it yet. We can achieve that by using a myriad of commands. We will start with the commonest one: we are going to read the file line by line. We will present different ways of improving our "reading performance" later.

So, file is our file object. We have just opened it, but Python already knows that any file contains lines (remember that this is a regular ASCII file). We are going to use a loop to read each line of the file, one by one

<syntax type=python>for line in file:

  print line</syntax>

In Python, the for loop/statement iterates over items in a sequence of items, usually a string or a list (we will see Python's list soon), instead of iterating over a progression of numbers. Basically, our for above will iterate over each line in the file until EOF (end-of-file) is reached. Our simple script to read a DNA sequence from a file and output to the screen is

<syntax type=python>#! /usr/bin/env python dnafile = "AY162388.seq" file = open(dnafile, 'r') for line in file:

   print line</syntax>

You can download the above script here. To run it have the AY162388.seq in the same directory.

10

Let's improve our previous script and put the contents of the file in a variable similar to an array. Python understands different formats of compound data types, and list is the most versatile. A list in Python can be assigned by a series of elements (or values) separated by a comma and surrounded by square brackets

<syntax type=python> shoplist = ['milk', 1, 'lettuce', 2, 'coffee', 3]</syntax>

Now, we are going to read the same file and store the DNA sequence in a list and output this variable. The beginning of the script is the same, where we basically tell Python that the file name is AY162388.seq.

<syntax type=python>#! /usr/bin/env python dnafile = "AY162388.seq"</syntax>

We are going to change the way we read the file. Instead of just opening and then reading line-by-line, we are going to open it a read all the lines at once, by using this

<syntax type=python>file = open(dnafile, 'r').readlines()</syntax>

Notice the part in bold? In the previous script, we open and store the contents of the file in a file object. Now , we are opening the file and just after it is opened, we are reading all the lines of the file at once and storing them in file</code, which in this case is not a file object, but is a Python's list of strings.

Before, if we wanted to manipulate our DNA sequence, we would had to read it, and then in the loop store in a variable of our choice. In this script, we do that all at once, and the result is a variable that we can change the way we wanted. The code without the output part is

<syntax type=python>#! /usr/bin/env python dnafile = "AY162388.seq" file = open(dnafile, 'r').readlines()</syntax>

Try putting a print statement after the last line to print the file list. You will get something like this

<syntax type=python>['GTGACTTTGTTCAACGGCCGCGGTATCCTAACCGTGCGAAGGTAGCGTAATCACTTGTTC\n', 'TTTAAATAAGGACTAGTATGAATGGCATCACGAGGGCTTTACTGTCTCCTTTTTCTAATC\n', 'AGTGAAACTAATCTCCCGTGAAGAAGCGGGAATTAACTTATAAGACGAGAAGACCCTATG\n', 'GAGCTTTAAACCAAATAACATTTGCTATTTTACAACATTCAGATATCTAATCTTTATAGC\n', 'ACTATGATTACAAGTTTTAGGTTGGGGTGACCGCGGAGTAAAAATTAACCTCCACATTGA\n', 'AGGAATTTCTAAGCAAAAAGCTACAACTTTAAGCATCAACAAATTGACACTTATTGACCC\n', 'AATATTTTGATCAACGAACCATTACCCTAGGGATAACAGCGCAATCCATTATGAGAGCTA\n', 'TTATCGACAAGTGGGCTTACGACCTCGATGTTGGATCAGGG\n']</syntax>

which is exactly the description of a Python's list. You see all lines, separated by comma and surrounded by square brackets. Notice that each line has a carriage return (\n) symbol at the end.

Let's make the output a little nicer including a loop. Remember when I introduced loop I wrote that Python iterates over "items in a sequence of items", what is a good synonym for list. So the loop should be as straightforward as

<syntax type=python>for line in file:

   print line</syntax>

Putting everything together gives us
<syntax type=python>#! /usr/bin/env python
dnafile = "AY162388.seq"
file = open(dnafile, 'r').readlines()
print file
for line in file:

   print line

</syntax>
that can be downloaded here. Next we will work on improving the output again and maybe modify/convert the list.

11

Now, I want to manipulate my DNA sequence, extract some nucleotides, check lines, etc. Simple things. We start with the same basic code to read the file:
<syntax type=python>#! /usr/bin/env python
dnafile = "AY162388.seq"
file = open(dnafile, 'r').readlines()</syntax>
Our nucleotides are stored in the variable file. Remember that each line is one item of the list and the lines still contain the carriage return present in the ASCII file. Let's get the first and the last lines of the sequence. The first line is easy to get, as Python's lists start at 0. To access one list item just add square brackets with the index number of the item you want to get (this is also known as slicing). Something like this
<syntax type=python>file[0]</syntax>
will return the item 0 from the list, that in our case is the firs line of the sequence. If you add a print command 
<syntax type=python>print file[0]</syntax>
you should expect
<syntax type=python>GTGACTTTGTTCAACGGCCGCGGTATCCTAACCGTGCGAAGGTAGCGTAATCACTTGTTC</syntax>
The last line is a little bit trickier. Let's assume that we don't know the number of lines in the list, and here we want to make our script as general as possible, so it can handle some simple files later. It is also good code practice to think ahead and plan what you want, first to have a detailed project to follow, and second it allows you to be prepared to errors/bugs that your code might have or situations not expected in your original plan.
In our file, we have eight lines of DNA, so it would be just adding this print file[7] and we would output the last line. But, the right way to do it is to check the length of the list and output the item which has an index equal to the list length. In Python, you can check the length of a list by adding the built-in function len before the list name, like this
<syntax type=python>len(file)</syntax>
So who do we print the last line of our sequence? Simple. len(file) should return an integer of value 8, which is the actual number of elements in our list. We already know that to access any item in a list we just add its index (that has to be an integer) to the list name. One idea then would be to use len(file) as the index, like this
<syntax type=python>print file[len(file)]</syntax>
Why would that be wrong? Our list has eight items, but the indexes are from 0 to 7. So eight would be one index over the list length, which is not accessible because it does not exist. Solution? Let's use the list length minus one:
<syntax type=python>print file[len(file)-1]</syntax>
and there you are, the last line of the sequence.
Putting everything together, we have
<syntax type=python>#! /usr/bin/env python
dnafile = "AY162388.seq"
file = open(dnafile, 'r').readlines()
print 'I want the first line'
print file[0]
print 'now the last line'
print file[len(file)-1]</syntax>
Two points worth mentioning: differently of strings, Python's lists are mutable, items can be removed, deleted, changed, and strings also can be sliced by using indexes that access characters. 
Next we will see some more features of lists and strings, and how to manipulate them. It will probably be the last entry in the first section as we finish the Chapter 4 in the book. As you may have noticed some items in the Perl book will not be covered, at least not immediately. We will jump back and forth sometimes.

12

As mentioned we will see in this entry some other features of Python lists. We will start with a similar example to the one in the book and then use our DNA file. So let's assume we have this simple list
<syntax type=python>nucleotides =  [ 'A', 'C', 'G'. 'T']</syntax>
If we print it directly we would get something like this
<syntax type=python>['A', 'C', 'G', 'T']</syntax>

which is fine for now, as we are not worried (yet) with the output (what we will do further below). Let's remove the last nucleotide. To accomplish that, we use pop

with no specific index, like this

<syntax type=python>nucleotides.pop()</syntax>
which gives me this when printed
<syntax type=python>['A', 'C', 'G']</syntax>
Remember that lists are mutable, so the removed item is lost. We can also remove any other in the list, let's say 'C'. First, we reassign the original list items and then remove the second item
<syntax type=python>nucleotides = [ 'A', 'C', 'G'. 'T']
nucleotides.pop(1)</syntax>
The list when printed will return this
<syntax type=python>['A', 'G', 'T']</syntax>
pop accepts any valid index of the list. Any index larger that the length of the list will return an error. For future reference, remember that when any item is removed (and inserted) the indexes change and the length also. It may seems obvious but mistakes are common.
Shifting from our 'destructive' mode, we cal also add elements to the list. Adding to the end of the list is trivial, by using append
<syntax type=python>nucleotides = [ 'A', 'C', 'G'. 'T']
nucleotides.append('A')</syntax>
that returns
<syntax type=python>nucleotides = [ 'A', 'C', 'G'. 'T', 'A']</syntax>
Adding to any position is also very straightforward with insert, like this
<syntax type=python>nucleotides = [ 'A', 'C', 'G'. 'T']
nucleotides.insert(0, 'A')</syntax>
where insert takes two arguments: first is the index of the element before which to insert and second the element to be inserted. So our line above will insert an 'A' just before the 'A' at position zero. We can try this
<syntax type=python>nucleotides = [ 'A', 'C', 'G'. 'T']
nucleotides.insert(0, 'A1')
nucleotides.insert(2, 'C1')
nucleotides.insert(4, 'G1')
nucleotides.insert(6, 'T1')</syntax>
that will result in 
<syntax type=python>['A1', 'A', 'C1', 'C', 'T1', 'T', 'G1', 'G']</syntax>
Notice that we add every new item at an even position, due to the fact that for every insertion the list's length and indexes change. 
And for last, we will take care of the output. Of course if are creating a script that requires a nicer output, printing a list is not the best way. We could create a loop and merge all entries in the list, but that would be a couple of lines and we ought to have an easier way (otherwise we could be using C++ instead). There is a way, by using the method join. This method will take all the elements in a list into a single string, and even a delimiter can be used. So something like this
<syntax type=python>nucleotides = [ 'A', 'C', 'G'. 'T']
"".join(nucleotides)</syntax>
will generate this output
<syntax type=python>ACGT</syntax>
Join is a method that applies to strings. So the first "item" is a string, that could be anything (in our case is an empty one). The code line tells Python to get the empty string an join it to the list of strings that we call nucleotides. We could replace the line for something easier to understand
<syntax type=python>nucleotides = [ 'A', 'C', 'G'. 'T']
myresult = 
myresult.join(nucleotides)
print myresult
</syntax>
So we initialize an empty string, join it with the list and print in the end with identical results. Try changing myresult initialization and see what happens.
With this we finish the first section of the site and we are moving to chapter 5 in the book.

1314

We start following the fifth chapter of BPB. The first item is about flow control and code layout, which are very relevant for our tutorial. I already introduced briefly both aspects in past entries on the site, but it is always good to check. As the book, I will start with flow control.
Flow control is the way your instructions are executed by the program. Most languages have a linear flow control, meaning every line is executed from top to bottom. This linear flow control can be disrupted by two types of statements: looping and branching. Looping statements tell the computer to execute a determined set of commands until certain condition is met. This can be a numeric value (ie from 1 to 100) or the number of items in a list (like our shop list from before). Branching statements are also known as conditional statements, tell the computer to execute/or not determined lines depending on certain conditions.
The for loop was shown before. In its for loop Python iterates over the elements in a list like this
<syntax type=python>for lines in text:

   do something</syntax>

Notice that the first line of the loop ends in a colon. This and the word for in the line> tell the interpreter that this a for loop and the indented block below is the code to be executed repeatedly until the last element in the list is reached. How Python knows where the loop ends? Indentation. Many languages use curly braces, parentheses, etc. In Python the loop ends by checking the indentation level of lines (this will help us a lot when discussing code layout).
We've already seen one example of loop in Python, for, but Python accepts other types of loop structures, such as while, that uses the same indented properties to execute the commands.
<syntax type=python>mycounter = 0
while mycounter == 0:

   do something</syntax>

Take a closer look at the while line. See something different? Maybe not, if you are used to programming. In Python equality is tested with a double equal sign (==), while a sole equal sign (=) assigns a value to a variable. You can also test for inequality, greater and less than, with !=, < and > respectively.

Attention: is quite common to generate errors by substituting == by =, so pay attention when coding.

Branching statements are the conditional commands in a computer language, usually governed by if ... then ... else. In Python a branching statement would look like
<syntax type=python>if value == 1:

   do something

elif value == 2:

   do something else

else:

   do even more</syntax>

Notice the colon ending each line of the conditions and again the indented code, telling the interpreter where the corresponding code for each condition ends.
We are going to use a lot of conditions and loops, but as you might have noticed Python has some tricks that make us avoid these statements. 
About code layout just one word: indentation. No brackets, parentheses, curly braces, etc. Yes, we have seen brackets and parentheses, but not to tell the interpreter where loops and conditions start and end. This is taken care by indentation, making our life easier and the code more beautiful.
Next we will see Python's ability to find motifs in words, mainly on DNA sequences.

Open writing projects/Beginning Python for Bioinformatics: Difference between revisions

Revision as of 18:44, 23 July 2008

Contents

01

02

03

04

05

06

07

08

09

10

11

12

13

14

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools

@@ Line 133: / Line 133: @@
 Tisdal's book on Perl introduces next the ability to transcripts DNA sequences to RNA. In order to do that we need to check a different aspect of programming: regular expressions (or regex). Regular expressions is a pattern/string expression that works matching/describing/filtering other strings.
-Let's say you want to examine or extract all vowels contained in one phrase, one page, one word. Another example would be to remove all HTML tags from a downloaded webpage. As HTML tags are encapsulated between <code><</code> and <code>></code> signs we can create a regex that will search for any characters in between the signs and remove (parse) them from our page. We will deal very briefly with regex, and if you are interested in learning more about it you can search for countless references on the Internet (such as <a href="http://www.regular-expressions.info/"> this one</a>).
+Let's say you want to examine or extract all vowels contained in one phrase, one page, one word. Another example would be to remove all HTML tags from a downloaded webpage. As HTML tags are encapsulated between <code><</code> and <code>></code> signs we can create a regex that will search for any characters in between the signs and remove (parse) them from our page. We will deal very briefly with regex, and if you are interested in learning more about it you can search for countless references on the Internet (such as [http://www.regular-expressions.info/ this one]).
 In order to use regular expression in Python we need to check another concept present in the language: importing modules. Python core functionality provides most of the usual operations and also comes with a built-in library of functions that "access to operations that are not part of the core of the language but are nevertheless built in". One of this operations is the ability to interpret regular expression that in Python is located in the <code>re</code> module. Apart from the language core, built in modules, Python can be further extended by using third-party modules imported into the language. Anyone can create a module and distribute to every Python user and programmer.
@@ Line 157: / Line 157: @@
 Now that we have the ability to use regex, we need to create one expression that will transcribe our DNA sequences to RNA.
+==07==
+As mentioned above, regex in Python are provided by the <code>re</code> module, which provides an interface for the regular expression engine. First thing we have to do is to tell the interpreter what to do and what expression to use.
+Let's start with a DNA sequence.
+<syntax type=python>myDNA = 'ACGTTGCAACGTTGCAACGTTGCA'</syntax>
+How to transcribe it to RNA? Transcription creates a single-strand RNA molecule from the double-strand DNA; basically the final result is a similar sequence, with all <code>T</code>'s changed to <code>U</code>'s. So our regular expression has to find all <code>T</code> nucleotides in the above sequence and then replace them.
+Regular expressions in Python need to be compiled into a RegexObject, that contains all possible regular expression operations. In our case we need to search and replace, what can be done by using the <code>sub()</code> method. According to the Python's [http://www.amk.ca/python/howto/regex/regex.html Regular Expression HOWTO] <code>sub()</code> returns the string obtained by replacing the leftmost non-overlapping occurrences of the RE in string by the replacement replacement. If the pattern isn't found, string is returned unchanged.
+Let's put everything above in real code. First we need to compile a new RegexObject that will search for all thymines in our sequence. It can be achieved by using this:
+<syntax type=python>regexp = re.compile('T')</syntax>
+Simple as that. This line of code tells the Python interpreter that our "regular expression" is every T in our string. Now, we have to make replace those Ts with Us. In order to do that we just tell the interpreter:
+<syntax type=python>myRNA = regexp.sub('U', myDNA)M</syntax>
+Let's look at the last two lines of code. On the first line we created a new RegexObject, <code>regexp</code> (that could have any name, as any variable) and compiled it, making our regular expression to be every T in our string. On the second line, we assigned our soon to be created RNA sequence to a new string (remember that strings in Python are immutable) and used the command <code>sub</code> to replace in the Ts by Us present in our original DNA string. Putting all together our transcription code will be
+<syntax type=python>
+#! /usr/bin/env python
+import re
+myDNA = 'ACGTTGCAACGTTGCAACGTTGCA'
+regexp = re.compile('T')
+myRNA = regexp.sub('U', myDNA)
+print myRNA
+</syntax>
+You can download the resulting script [http://python.genedrift.org/code/code_03.py here].
+==08==
+We have seen how to transcribe DNA using regular expression, even though the regex we have used cannot be considered a real one. Now we are going to simplify our small script even more and take advantage of some string capabilities of Python. Instead of using two lines, we are going to use only one. And also we won't need to import anything.
+Let's start again with the same DNA sequence
+<codemultiple>myDNA = 'ACGTTGCAACGTTGCAACGTTGCA'</codemultiple>
+This time we are going to use <code>replace</code>. This is one of the Python's methods to manipulate strings. Basically, you are asking the interpreter to get a certain string by another. The method returns a new copy of your string. We add this line
+<syntax type=python>myRNA - myDNA.replace('T', 'U')</syntax>
+This tells Python: <code>myRNA</code> will receive a copy of <code>myDNA</code> where all Ts were changed by Us. the "dot" after <code>myDNA</code> means that the method <code>replace</code> will get that variable as input on that variable.
+So our code from above would like this
+<syntax type=python>#! /usr/bin/env python
+myDNA = 'ACGTTGCAACGTTGCAACGTTGCA'
+myRNA = myDNA.replace('T', 'U')
+print myRNA</syntax>
+Simple and efficient. Next we will use the same approach on generating the reverse complement of a DNA sequence, with no regex pattern.
+==09==
+We are currently following Chapter 4 of Beginning Perl for Bioinformatics, which is the first chapter of the book that actually has code snippets and real programming. The last exercises in this chapter deal with the ability to read files and operate with information extracted from these files, to create arrays and scalar list in Perl.
+We are going to check how to read files in python. The book tells you how to read protein sequences. Here we are going to read DNA and protein sequences from files and change them.
+Let say you have a file with a DNA sequence in some directory in your hard disk. The file cannot be a FASTA type (we will see later how to handle FASTA files), just pure sequence, something like this:
+<syntax type=python>GTGACTTTGTTCAACGGCCGCGGTATCCTAACCGTGCGAAGGTAGCGTAATCACTTGTTC
+TTTAAATAAGGACTAGTATGAATGGCATCACGAGGGCTTTACTGTCTCCTTTTTCTAATC
+AGTGAAACTAATCTCCCGTGAAGAAGCGGGAATTAACTTATAAGACGAGAAGACCCTATG
+GAGCTTTAAACCAAATAACATTTGCTATTTTACAACATTCAGATATCTAATCTTTATAGC
+ACTATGATTACAAGTTTTAGGTTGGGGTGACCGCGGAGTAAAAATTAACCTCCACATTGA
+AGGAATTTCTAAGCAAAAAGCTACAACTTTAAGCATCAACAAATTGACACTTATTGACCC
+AATATTTTGATCAACGAACCATTACCCTAGGGATAACAGCGCAATCCATTATGAGAGCTA
+TTATCGACAAGTGGGCTTACGACCTCGATGTTGGATCAGGG</syntax>
+You can download the file [http://python.genedrift.org/codepy/AY162388.seq here]. This is a partial sequence of a mitochondrial gene from a South American frog species called ''[http://calphotos.berkeley.edu/imgs/128x192/0000_0000/1101/0201.jpeg Hylodes ornatus]''. For our purposes you can save the file in the same directory you are going to run the script from, or if you are using the Python interpreter start it in the directory that contains the file.
+The file name is not important but I will use AY162388.seq from now on. The first thing we have to do is to open the file for reading. We define a string variable that will contain the file name.
+<syntax type=python>dnafile = "AY162388.seq"</syntax>
+In order to open the file, we can use the command <code>open</code>, that receives two strings: the first is the file name (it can be the whole location too) to be opened and the <em>mode</em> to be used, which is what you want to do with the file. The <em>mode</em> can be one or more letters that tell the interpreter what to do. For now we are going to use the <code>r</code> <em>mode</em> , which tells Python to read the file, and only do that. So our  code is
+<syntax type=python> file = open(dnafile, 'r')</syntax>
+<code>file</code> is a file object that contains the directives to read our DNA sequence. Now, we have actually read the contents of the file but they are stored in a file object and we did not accessed it yet. We can achieve that by using a myriad of commands. We will start with the commonest one: we are going to read the file line by line. We will present different ways of improving our "reading performance" later.
+So, <code>file</code> is our file object. We have just opened it, but Python already knows that any file contains lines (remember that this is a regular ASCII file). We are going to use a loop to read each line of the file, one by one
+<syntax type=python>for line in file:
+   print line</syntax>
+In Python, the <code>for</code> loop/statement iterates over items in a sequence of items, usually a string or a list (we will see Python's <code>list</code> soon), instead of iterating over a progression of numbers. Basically, our <code>for</code> above will iterate over each line in the file until EOF (end-of-file) is reached. Our simple script to read a DNA sequence from a file and output to the screen is
+<syntax type=python>#! /usr/bin/env python
+dnafile = "AY162388.seq"
+file = open(dnafile, 'r')
+for line in file:
+    print line</syntax>
+You can download the above script [http://python.genedrift.org/codepy/code_04.py here]. To run it have the <code>AY162388.seq</code> in the same directory.
+==10==
+Let's improve our previous script and put the contents of the file in a variable similar to an array. Python understands different formats of compound data types, and <code>list</code> is the most versatile. A <code>list</code> in Python can be assigned by a series of elements (or values) separated by a comma and surrounded by square brackets
+<syntax type=python> shoplist = ['milk', 1, 'lettuce', 2, 'coffee', 3]</syntax>
+Now, we are going to read the same file and store the DNA sequence in a <code>list</code> and output this variable. The beginning of the script is the same, where we basically tell Python that the file name is AY162388.seq.
+<syntax type=python>#! /usr/bin/env python
+dnafile = "AY162388.seq"</syntax>
+We are going to change the way we read the file. Instead of just opening and then reading line-by-line, we are going to open it a read all the lines at once, by using this
+<syntax type=python>file = open(dnafile, 'r').readlines()</syntax>
+Notice the part in bold? In the previous script, we open and store the contents of the file in a <code>file object</code>. Now , we are opening the file and just after it is opened, we are reading all the lines of the file at once and storing them in <code>file</code, which in this case is not a <code>file object</code>, but is a Python's <code>list</code> of strings.
+Before, if we wanted to manipulate our DNA sequence, we would had to read it, and then in the loop store in a variable of our choice. In this script, we do that all at once, and the result is a variable that we can change the way we wanted. The code without the output part is
+<syntax type=python>#! /usr/bin/env python
+dnafile = "AY162388.seq"
+file = open(dnafile, 'r').readlines()</syntax>
+Try putting a <code>print</code> statement after the last line to print the <code>file</code> list. You will get something like this
+<syntax type=python>['GTGACTTTGTTCAACGGCCGCGGTATCCTAACCGTGCGAAGGTAGCGTAATCACTTGTTC\n',
+'TTTAAATAAGGACTAGTATGAATGGCATCACGAGGGCTTTACTGTCTCCTTTTTCTAATC\n',
+'AGTGAAACTAATCTCCCGTGAAGAAGCGGGAATTAACTTATAAGACGAGAAGACCCTATG\n',
+'GAGCTTTAAACCAAATAACATTTGCTATTTTACAACATTCAGATATCTAATCTTTATAGC\n',
+'ACTATGATTACAAGTTTTAGGTTGGGGTGACCGCGGAGTAAAAATTAACCTCCACATTGA\n',
+'AGGAATTTCTAAGCAAAAAGCTACAACTTTAAGCATCAACAAATTGACACTTATTGACCC\n',
+'AATATTTTGATCAACGAACCATTACCCTAGGGATAACAGCGCAATCCATTATGAGAGCTA\n',
+'TTATCGACAAGTGGGCTTACGACCTCGATGTTGGATCAGGG\n']</syntax>
+which is exactly the description of a Python's list. You see all lines, separated by comma and surrounded by square brackets. Notice that each line has a carriage return (<code>\n</code>) symbol at the end.
+Let's make the output a little nicer including a loop. Remember when I introduced loop I wrote that Python iterates  over "items in a sequence of items", what is a good synonym for <code>list</code>. So the loop should be as straightforward as
+<syntax type=python>for line in file:
+    print line</syntax>
+Putting everything together gives us
+<syntax type=python>#! /usr/bin/env python
+dnafile = "AY162388.seq"
+file = open(dnafile, 'r').readlines()
+print file
+for line in file:
+    print line
+</syntax>
+that can be downloaded [http://python.genedrift.org/codepy/code_05.py here]. Next we will work on improving the output again and maybe modify/convert the <code>list</code>.
+==11==
+Now, I want to manipulate my DNA sequence, extract some nucleotides, check lines, etc. Simple things. We start with the same basic code to read the file:
+<syntax type=python>#! /usr/bin/env python
+dnafile = "AY162388.seq"
+file = open(dnafile, 'r').readlines()</syntax>
+Our nucleotides are stored in the variable <code>file</code>. Remember that each line is one item of the list and the lines still contain the carriage return present in the ASCII file. Let's get the first and the last lines of the sequence. The first line is easy to get, as Python's lists start at 0. To access one list item just add square brackets with the index number of the item you want to get (this is also known as slicing). Something like this
+<syntax type=python>file[0]</syntax>
+will return the item 0 from the list, that in our case is the firs line of the sequence. If you add a <code>print</code> command
+<syntax type=python>print file[0]</syntax>
+you should expect
+<syntax type=python>GTGACTTTGTTCAACGGCCGCGGTATCCTAACCGTGCGAAGGTAGCGTAATCACTTGTTC</syntax>
+The last line is a little bit trickier. Let's assume that we don't know the number of lines in the list, and here we want to make our script as general as possible, so it can handle some simple files later. It is also good code practice to think ahead and plan what you want, first to have a detailed project to follow, and second it allows you to be prepared to errors/bugs that your code might have or situations not expected in your original plan.
+In our file, we have <strong>eight</strong> lines of DNA, so it would be just adding this <code>print file[<strong>7</strong>]</code> and we would output the last line. But, the right way to do it is to check the length of the list and output the item which has an index equal to the list length. In Python, you can check the length of a list by adding the built-in function <code>len</code> before the list name, like this
+<syntax type=python>len(file)</syntax>
+So who do we print the last line of our sequence? Simple. <code>len(file)</code> should return an integer of value 8, which is the actual number of elements in our list. We already know that to access any item in a list we just add its index (that has to be an integer) to the list name. One idea then would be to use <code>len(file)</code> as the index, like this
+<syntax type=python>print file[len(file)]</syntax>
+Why would that be wrong? Our list has eight items, but the indexes are from 0 to '''7'''. So eight would be one index over the list length, which is not accessible because it does not exist. Solution? Let's use the list length minus one:
+<syntax type=python>print file[len(file)-1]</syntax>
+and there you are, the last line of the sequence.
+Putting everything together, we have
+<syntax type=python>#! /usr/bin/env python
+dnafile = "AY162388.seq"
+file = open(dnafile, 'r').readlines()
+print 'I want the first line'
+print file[0]
+print 'now the last line'
+print file[len(file)-1]</syntax>
+Two points worth mentioning: differently of strings, Python's lists are mutable, items can be removed, deleted, changed, and strings also can be sliced by using indexes that access characters.
+Next we will see some more features of lists and strings, and how to manipulate them. It will probably be the last entry in the first section as we finish the Chapter 4 in the book. As you may have noticed some items in the Perl book will not be covered, at least not immediately. We will jump back and forth sometimes.
+==12==
+As mentioned we will see in this entry some other features of Python lists. We will start with a similar example to the one in the book and then use our DNA file. So let's assume we have this simple list
+<syntax type=python>nucleotides =  [ 'A', 'C', 'G'. 'T']</syntax>
+If we print it directly we would get something like this
+<syntax type=python>['A', 'C', 'G', 'T']</syntax>
+which is fine for now, as we are not worried (yet) with the output (what we will do further below). Let's remove the last nucleotide. To accomplish that, we use <code>pop</p> with no specific index, like this
+<syntax type=python>nucleotides.pop()</syntax>
+which gives me this when printed
+<syntax type=python>['A', 'C', 'G']</syntax>
+Remember that lists are mutable, so the removed item is lost. We can also remove any other in the list, let's say 'C'. First, we reassign the original list items and then remove the second item
+<syntax type=python>nucleotides = [ 'A', 'C', 'G'. 'T']
+nucleotides.pop(1)</syntax>
+The list when printed will return this
+<syntax type=python>['A', 'G', 'T']</syntax>
+<code>pop</code> accepts any valid index of the list. Any index larger that the length of the list will return an error. For future reference, remember that when any item is removed (and inserted) the indexes change and the length also. It may seems obvious but mistakes are common.
+Shifting from our 'destructive' mode, we cal also add elements to the list. Adding to the end of the list is trivial, by using <code>append</code>
+<syntax type=python>nucleotides = [ 'A', 'C', 'G'. 'T']
+nucleotides.append('A')</syntax>
+that returns
+<syntax type=python>nucleotides = [ 'A', 'C', 'G'. 'T', 'A']</syntax>
+Adding to any position is also very straightforward with <code>insert</code>, like this
+<syntax type=python>nucleotides = [ 'A', 'C', 'G'. 'T']
+nucleotides.insert(0, 'A')</syntax>
+where <code>insert</code> takes two arguments: first is the index of the element before which to insert and second the element to be inserted. So our line above will insert an 'A' just before the 'A' at position zero. We can try this
+<syntax type=python>nucleotides = [ 'A', 'C', 'G'. 'T']
+nucleotides.insert(0, 'A1')
+nucleotides.insert(2, 'C1')
+nucleotides.insert(4, 'G1')
+nucleotides.insert(6, 'T1')</syntax>
+that will result in
+<syntax type=python>['A1', 'A', 'C1', 'C', 'T1', 'T', 'G1', 'G']</syntax>
+Notice that we add every new item at an even position, due to the fact that for every insertion the list's length and indexes change.
+And for last, we will take care of the output. Of course if are creating a script that requires a nicer output, printing a list is not the best way. We could create a loop and merge all entries in the list, but that would be a couple of lines and we ought to have an easier way (otherwise we could be using C++ instead). There is a way, by using the method <code>join</code>. This method will take all the elements in a list into a single string, and even a delimiter can be used. So something like this
+<syntax type=python>nucleotides = [ 'A', 'C', 'G'. 'T']
+"".join(nucleotides)</syntax>
+will generate this output
+<syntax type=python>ACGT</syntax>
+Join is a method that applies to strings. So the first "item" is a string, that could be anything (in our case is an empty one). The code line tells Python to get the empty string an join it to the list of strings that we call nucleotides. We could replace the line for something easier to understand
+<syntax type=python>nucleotides = [ 'A', 'C', 'G'. 'T']
+myresult = ''
+myresult.join(nucleotides)
+print myresult
+</syntax>
+So we initialize an empty string, join it with the list and print in the end with identical results. Try changing <code>myresult</code> initialization and see what happens.
+With this we finish the first section of the site and we are moving to chapter 5 in the book.
+==13==
+==14==
+We start following the fifth chapter of BPB. The first item is about flow control and code layout, which are very relevant for our tutorial. I already introduced briefly both aspects in past entries on the site, but it is always good to check. As the book, I will start with flow control.
+<strong>Flow control</strong> is the way your instructions are executed by the program. Most languages have a linear flow control, meaning every line is executed from top to bottom. This linear flow control can be disrupted by two types of statements: looping and branching. Looping statements tell the computer to execute a determined set of commands until certain condition is met. This can be a numeric value (ie from 1 to 100) or the number of items in a list (like our shop list from before). Branching statements are also known as conditional statements, tell the computer to execute/or not determined lines depending on certain conditions.
+The <code>for</code> loop was shown before. In its for loop Python iterates over the elements in a list like this
+<syntax type=python>for lines in text:
+    do something</syntax>
+Notice that the first line of the loop ends in a colon. This and the word <code>for</code> in the line> tell the interpreter that this a for loop and the <strong>indented</strong> block below is the code to be executed repeatedly until the last element in the list is reached. How Python knows where the loop ends? Indentation. Many languages use curly braces, parentheses, etc. In Python the loop ends by checking the indentation level of lines (this will help us a lot when discussing code layout).
+We've already seen one example of loop in Python, <code>for</code>, but Python accepts other types of loop structures, such as <code>while</code>, that uses the same indented properties to execute the commands.
+<syntax type=python>mycounter = 0
+while mycounter == 0:
+    do something</syntax>
+Take a closer look at the <code>while</code> line. See something different? Maybe not, if you are used to programming. In Python equality is tested with a double equal sign (<code>==</code>), while a sole equal sign (<code>=</code>) assigns a value to a variable. You can also test for inequality, greater and less than, with <code>!=</code>, <code><</code> and <code>></code> respectively.
+<blockquote>Attention: is quite common to generate errors by substituting <code>==</code> by <code>=</code>, so pay attention when coding.</blockquote>
+Branching statements are the conditional commands in a computer language, usually governed by <code>if ... then ... else</code>. In Python a branching statement would look like
+<syntax type=python>if value == 1:
+    do something
+elif value == 2:
+    do something else
+else:
+    do even more</syntax>
+Notice the colon ending each line of the conditions and again the indented code, telling the interpreter where the corresponding code for each condition ends.
+We are going to use a lot of conditions and loops, but as you might have noticed Python has some tricks that make us avoid these statements.
+About <strong>code layout</strong> just one word: indentation. No brackets, parentheses, curly braces, etc. Yes, we have seen brackets and parentheses, but not to tell the interpreter where loops and conditions start and end. This is taken care by indentation, making our life easier and the code more beautiful.
+Next we will see Python's ability to find motifs in words, mainly on DNA sequences.