Open writing projects/Beginning Python for Bioinformatics

Beginning the begin
This site is based on the book Beginning Perl for Bioinformatics by James Tisdal which was published in 2001. My idea here is to follow the structure of the book, analysing each chapter and converting the Perl scripts into Python. The original book is very well written and an excellent starting point for any aspiring bioinformatician, either if you are a biologist that does not understand programming or a computer scientist that does not know a lot of biology and maybe even Perl.

In no way this site tries to plagiarize the book, as it is only used as an starting point (a very good one indeed) to this journey into Python. Here you will not find biological concept explanations and criticisms towards Perl. Making this clear, I will start from the beginning.

Why Python (and not Perl)?
According to the official Python website:

''Python and Perl come from a similar background (Unix scripting, which both have long outgrown) [to learn more about that check this tutorial], and sport many similar features, but have a different philosophy. Perl emphasizes support for common application-oriented tasks, e.g. by having built-in regular expressions, file scanning and report generating features. Python emphasizes support for common programming methodologies such as data structure design and object-oriented programming, and encourages programmers to write readable (and thus maintainable) code by providing an elegant but not overly cryptic notation. As a consequence, Python comes close to Perl but rarely beats it in its original application domain; however Python has an applicability well beyond Perl's niche.''

I couldn't explain better than that. But still I have to give my take on why I prefer Python over Perl, and why I decided to use it in my day-to-day programming. First I have to admit that I am lousy Perl programmer (not even close to an apprentice monger) and I always get confused by its syntax. Second I come from a Basic/Pascal/C++ background, all of them having slightly better syntax than Perl. Thus, it was natural to get on the Python bandwagon, and as the paragraph above states Python code is "extremely" readable (emphasis are mine); in no-time you can grasp it completely. OK, I admit that it has at least one odd feature for the non-computer savvy: the "mandatory" indentation. In Python you have to tabulate (using tabs or space <- recommended) loops, if clauses, functions, anything. Maybe this is the first and only hard step to get, but after a couple of hours of coding you will be satisfied on how good your code look.

Hands on code: Sequences and strings - part I
As pointed in Beginning Perl for Bioinformatics, a large percentage of bioinformatics procedures deals with strings, especially DNA and amino acids sequence data. It is largely known that DNA is composed of four different nucleotides: A, C, T and G and proteins can contain up to 20 amino acids. Each one of these elements have one letter of the alphabet assigned to them. In the DNA case some letters represent one or more nucleotides that can be present at some sequence position (click here for more).

So as the amino acid is the basic building block of proteins (AKA polypeptides), strings containing sequence is our most basic building block, from where all the bioinformatics magic will work on.

Usually in Perl a string is represented by the dollar sign in front of the variable name, like this. Python is dynamically typed, meaning variable types are assigned/discovered by the interpreter at run time. This means that the value after the equal sign will tell the interpreter what variable type you are declaring. So in Python if you want to store a DNA sequence you can just enter

myDNA = "ACGTACGTACGTACGTACGTACGT"

&rarr; '' a quick note: Python can be used with the interpreter command line or by scripts edited and saved in any text editor. I will try to use the latter in the code examples.''

OK, we are ready to create our first Bioinformatics Python Hello World Script (pompous name for a simple thing). Let's get the sequence above and print it on the screen. The first line of our code will tell the operating system what to use and where to find it


 * 1) ! /usr/bin/env python

Next we will create the variable myDNA and assign the corresponding sequence

myDNA = "ACGTACGTACGTACGTACGTACGT"

And finally, we will print the contents of the variable to the screen:

print myDNA

As mentioned above, Python mandates that you have your code indented, so our final code will look like this:

myDNA = "ACGTACGTACGTACGTACGTACGT" print myDNA
 * 1) ! /usr/bin/env python

Hands on code: Sequences and Strings - part II
Another important task for many biologists is to merge/concatenate different strings of DNA in one unique sequence. We modify the previous script in order to have two distinct DNA sequences in one.

We start using code_01 structure:

myDNA = "ACGTACGTACGTACGTACGTACGT" myDNA2 = "TCGATCGATCGATCGATCGA" print myDNA, myDNA2
 * 1) ! /usr/bin/env python

So far, we added a new string containing an extra DNA sequence and we  both sequences. In Python the  statement automatically adds a new line at the end of the string to be printed, unless you add a comma  at the end. The comma is also needed if you are going to print more than one string in order to separate them (try removing the comma from the code above).

Now, how do we merge myDNA and myDNA2? Easy in Python: just sum them with a plus signal:

myDNA3 = myDNA + myDNA2 print myDNA3

Notice that in Python strings are immutable, meaning they cannot be changed. This immutability confer some advantages to the code where strings (in Python strings are not variables) cannot be modified anywhere in the program and also allowing some performance gain in the interpreter. So, in order to have our sequences merged we created a third sequence that received both strings. Finally our code will be (some captions were added):

myDNA = "ACGTACGTACGTACGTACGTACGT" myDNA2 = "TCGATCGATCGATCGATCGA" print "First and Second sequences" print myDNA, myDNA2 myDNA3 = myDNA + myDNA2 print "Concatenated sequence" print myDNA3
 * 1) ! /usr/bin/env python

Importing and regular expressions
Tisdall's book on Perl introduces next the ability to transcripts DNA sequences into RNA. In order to do that we need to check a different aspect of programming: regular expressions (or regex). Regular expressions is a pattern/string expression that works matching/describing/filtering other strings.

Let's say you want to examine or extract all vowels contained in one phrase, one page, one word. Another example would be to remove all html tags from a downloaded webpage. As HTML tags are encapsulated between  and   signs we can create a regex that will search for any characters in between the signs and remove (parse) them from our page. We will deal very briefly with regex, and if you are interested in learning more about it you can search for countless references on the internet (such as this one).

In order to use regular expression in Python we need to check another concept present in the language: importing modules. Python core functionality provides most of the usual operations and also comes with a built-in library of functions that "access to operations that are not part of the core of the language but are nevertheless built in". One of this operations is the ability to interpret regular expression that in Python is located in the  module. Apart from the language core, built in modules, Python can be further extended by using third-party modules imported into the language. Anyone can create a module and distribute to every Python user and programmer.

So, in order to have regex capabilities we literally have to  the regex module. We do that by entering the line:

import re

Python's code style guide suggests that  statements should be on separate lines

import sys import re ...

and to be put on the top of the file (usually below the line that tells your OS to use Python's interpreter on the script).

So the first two lines of our new script would be

import re
 * 1) ! /usr/bin/env python

Now that we have the ability to use regex, we need to create one expression that will transcribe our DNA sequences into RNA.

Python's print statement
Just an apart from the bioinformatics aspect of programming: Python's  statement.

As in most computer languages Python allows an easy way to write to the standard output. Python's  only accepts output of strings, and if the variable sent to it is not a string it is first converted and then output.

The  always put a line-break (  or  ) at the end of the expression to be output, except when the   statement ends with a comma. For example:

print "This is a" print "test"

will print

This is a

test

while

print "This is a", print "test",

will print

This is a test

Of course Python's  statement allows any programming escape character, such as   and.

Concatenating strings on output

To concatenate two strings on output there are two possible ways in Python. You can either separate the strings with a comma, like we did here

print myDNA, myDNA2

or you can use the "+" sign in order to obtain almost the same result. This is similar to what was used here

myDNA3 = myDNA + myDNA2

but instead we would use the  command as

print myDNA3 + myDNA

In the latter case, both strings will not be separated by a space and will be merged. To get the same result you would have to concatenate an extra space between the strings like

print myDNA3 + " " + myDNA

The Regular Expression
As mentioned above, regex in Python are provided by the  module, which provides an interface for the regular expression engine. First thing we have to do is to tell the interpreter what to do and what expression to use.

Let's start with a DNA sequence.

myDNA = 'ACGTTGCAACGTTGCAACGTTGCA'

How to transcribe it to RNA? Transcription creates a single-strand RNA molecule from the double-strand DNA; basically the final result is a similar sequence, with all 's changed to  's. So our regular expression has to find all   nucleotides in the above sequence and then replace them.

Regular expressions in Python need to be compiled into a RegexObject, that contains all possible regular expression operations. In our case we need to search and replace, what can be done by using the  method. According to the Python's Regular Expression HOWTO  returns the string obtained by replacing the leftmost non-overlapping occurrences of the RE in string by the replacement replacement. If the pattern isn't found, string is returned unchanged.

Let's put everything above in real code. First we need to compile a new RegexObject that will search for all thymines in our sequence. It can be achieved by using this:

regexp = re.compile('T')

Simple as that. This line of code tells the Python interpreter that our "regular expression" is every T in our string. Now, we have to make replace those Ts with Us. In order to do that we just tell the interpreter:

myRNA = regexp.sub('U', myDNA)M

Let's look at the last two lines of code. On the first line we created a new RegexObject,  (that could have any name, as any variable) and compiled it, making our regular expression to be every T in our string. On the second line, we assigned our soon to be created RNA sequence to a new string (remember that strings in Python are immutable) and used the command  to replace in the Ts by Us present in our original DNA string. Putting all together our transcription code will be

import re myDNA = 'ACGTTGCAACGTTGCAACGTTGCA' regexp = re.compile('T') myRNA = regexp.sub('U', myDNA) print myRNA
 * 1) ! /usr/bin/env python

You can download the resulting script here.

Transcribing: the "other" way
We have seen how to transcribe DNA using regular expression, even though the regex we have used cannot be considered a real one. Now we are going to simplify our small script even more and take advantage of some string capabilities of Python. Instead of using two lines, we are going to use only one. And also we won't need to import anything.

Let's start again with the same DNA sequence

myDNA = 'ACGTTGCAACGTTGCAACGTTGCA'

This time we are going to use. This is one of the Python's methods to manipulate strings. Basically, you are asking the interpreter to get a certain string by another. The method returns a new copy of your string. We add this line

myRNA - myDNA.replace('T', 'U')

This tells Python:  will receive a copy of   where all Ts were changed by Us. the "dot" after  means that the method   will get that variable as input on that variable.

So our code from above would like this

myDNA = 'ACGTTGCAACGTTGCAACGTTGCA' myRNA = myDNA.replace('T', 'U') print myRNA
 * 1) ! /usr/bin/env python

Simple and efficient. Next we will use the same approach on generating the reverse complement of a DNA sequence, with no regex pattern.

Reading files in Python
We are currently following Chapter 4 of Beginning Perl for Bioinformatics, which is the first chapter of the book that actually has code snippets and real programming. The last exercises in this chapter deal with the ability to read files and operate with information extracted from these files, to create arrays and scalar list in Perl.

We are going to check how to read files in python. The book tells you how to read protein sequences. Here we are going to read DNA and protein sequences from files and change them.

Let say you have a file with a DNA sequence in some directory in your hard disk. The file cannot be a FASTA type (we will see later how to handle FASTA files), just pure sequence, something like this:

GTGACTTTGTTCAACGGCCGCGGTATCCTAACCGTGCGAAGGTAGCGTAATCACTTGTTC TTTAAATAAGGACTAGTATGAATGGCATCACGAGGGCTTTACTGTCTCCTTTTTCTAATC AGTGAAACTAATCTCCCGTGAAGAAGCGGGAATTAACTTATAAGACGAGAAGACCCTATG GAGCTTTAAACCAAATAACATTTGCTATTTTACAACATTCAGATATCTAATCTTTATAGC ACTATGATTACAAGTTTTAGGTTGGGGTGACCGCGGAGTAAAAATTAACCTCCACATTGA AGGAATTTCTAAGCAAAAAGCTACAACTTTAAGCATCAACAAATTGACACTTATTGACCC AATATTTTGATCAACGAACCATTACCCTAGGGATAACAGCGCAATCCATTATGAGAGCTA TTATCGACAAGTGGGCTTACGACCTCGATGTTGGATCAGGG

You can download the file here. This is a partial sequence of a mitochondrial gene from a South American frog species called Hylodes ornatus. For our purposes you can save the file in the same directory you are going to run the script from, or if you are using the Python interpreter start it in the directory that contains the file.

The file name is not important but I will use AY162388.seq from now on. The first thing we have to do is to open the file for reading. We define a string variable that will contain the file name.

dnafile = "AY162388.seq"

In order to open the file, we can use the command, that receives two strings: the first is the file name (it can be the whole location too) to be opened and the mode to be used, which is what you want to do with the file. The mode can be one or more letters that tell the interpreter what to do. For now we are going to use the  mode, which tells Python to read the file, and only do that. So our code is

file = open(dnafile, 'r')

is a file object that contains the directives to read our DNA sequence. Now, we have actually read the contents of the file but they are stored in a file object and we did not accessed it yet. We can achieve that by using a myriad of commands. We will start with the commonest one: we are going to read the file line by line. We will present different ways of improving our "reading performance" later.

So,  is our file object. We have just opened it, but Python already knows that any file contains lines (remember that this is a regular ASCII file). We are going to use a loop to read each line of the file, one by one

for line in file: print line

In Python, the  loop/statement iterates over items in a sequence of items, usually a string or a list (we will see Python's   soon), instead of iterating over a progression of numbers. Basically, our  above will iterate over each line in the file until EOF (end-of-file) is reached. Our simple script to read a DNA sequence from a file and output to the screen is

dnafile = "AY162388.seq" file = open(dnafile, 'r') for line in file: print line
 * 1) ! /usr/bin/env python

You can download the above script here. To run it have the  in the same directory.

Reading files in Python: using lists
Let's improve our previous script and put the contents of the file in a variable similar to an array. Python understands different formats of compound data types, and  is the most versatile. A  in Python can be assigned by a series of elements (or values) separated by a comma and surrounded by square brackets

shoplist = ['milk', 1, 'lettuce', 2, 'coffee', 3]

Now, we are going to read the same file and store the DNA sequence in a  and output this variable. The beginning of the script is the same, where we basically tell Python that the file name is AY162388.seq.

dnafile = "AY162388.seq"
 * 1) ! /usr/bin/env python

We are going to change the way we read the file. Instead of just opening and then reading line-by-line, we are going to open it a read all the lines at once, by using this

file = open(dnafile, 'r').readlines

Notice the part in bold? In the previous script, we open and store the contents of the file in a. Now, we are opening the file and just after it is opened, we are reading all the lines of the file at once and storing them in , but is a Python's   of strings.

Before, if we wanted to manipulate our DNA sequence, we would had to read it, and then in the loop store in a variable of our choice. In this script, we do that all at once, and the result is a variable that we can change the way we wanted. The code without the output part is

dnafile = "AY162388.seq" file = open(dnafile, 'r').readlines
 * 1) ! /usr/bin/env python

Try putting a  statement after the last line to print the   list. You will get something like this

['GTGACTTTGTTCAACGGCCGCGGTATCCTAACCGTGCGAAGGTAGCGTAATCACTTGTTC\n', 'TTTAAATAAGGACTAGTATGAATGGCATCACGAGGGCTTTACTGTCTCCTTTTTCTAATC\n', 'AGTGAAACTAATCTCCCGTGAAGAAGCGGGAATTAACTTATAAGACGAGAAGACCCTATG\n', 'GAGCTTTAAACCAAATAACATTTGCTATTTTACAACATTCAGATATCTAATCTTTATAGC\n', 'ACTATGATTACAAGTTTTAGGTTGGGGTGACCGCGGAGTAAAAATTAACCTCCACATTGA\n', 'AGGAATTTCTAAGCAAAAAGCTACAACTTTAAGCATCAACAAATTGACACTTATTGACCC\n', 'AATATTTTGATCAACGAACCATTACCCTAGGGATAACAGCGCAATCCATTATGAGAGCTA\n', 'TTATCGACAAGTGGGCTTACGACCTCGATGTTGGATCAGGG\n']

which is exactly the description of a Python's list. You see all lines, separated by comma and surrounded by square brackets. Notice that each line has a carriage return symbol at the end.

Let's make the output a little nicer including a loop. Remember when I introduced loop I wrote that Python iterates over "items in a sequence of items", what is a good synonym for. So the loop should be as straightforward as

for line in file: print line

Putting everything together gives us

dnafile = "AY162388.seq" file = open(dnafile, 'r').readlines print file for line in file: print line
 * 1) ! /usr/bin/env python

that can be downloaded here. Next we will work on improving the output again and maybe modify/convert the.

"Manipulating" Python lists I
Now, we want to manipulate the DNA sequence, extract some nucleotides, check lines, etc. Simple things. We start with the same basic code to read the file:

dnafile = "AY162388.seq" file = open(dnafile, 'r').readlines
 * 1) ! /usr/bin/env python

Our nucleotides are stored in the variable. Remember that each line is one item of the list and the lines still contain the carriage return present in the ASCII file. Let's get the first and the last lines of the sequence. The first line is easy to get, as Python's lists start at 0. To access one list item just add square brackets with the index number of the item you want to get (this is also known as slicing). Something like this

file[0]

will return the item 0 from the list, that in our case is the firs line of the sequence. If you add a  command

print file[0]

you should expect

GTGACTTTGTTCAACGGCCGCGGTATCCTAACCGTGCGAAGGTAGCGTAATCACTTGTTC

The last line is a little bit trickier. Let's assume that we don't know the number of lines in the list, and here we want to make our script as general as possible, so it can handle some simple files later. It is also good code practice to think ahead and plan what you want, first to have a detailed project to follow, and second it allows you to be prepared to errors/bugs that your code might have or situations not expected in your original plan.

In our file, we have eight lines of DNA, so it would be just adding this  and we would output the last line. But, the right way to do it is to check the length of the list and output the item which has an index equal to the list length. In Python, you can check the length of a list by adding the built-in function  before the list name, like this

len(file)

So who do we print the last line of our sequence? Simple. should return an integer of value 8, which is the actual number of elements in our list. We already know that to access any item in a list we just add its index (that has to be an integer) to the list name. One idea then would be to use  as the index, like this

print file[len(file)]

Why would that be wrong? Our list has eight items, but the indexes are from 0 to 7. So eight would be one index over the list length, which is not accessible because it does not exist. Solution? Let's use the list length minus one:

print file[len(file)-1]

and there you are, the last line of the sequence.

Putting everything together, we have

dnafile = "AY162388.seq" file = open(dnafile, 'r').readlines print 'I want the first line' print file[0] print 'now the last line' print file[len(file)-1]
 * 1) ! /usr/bin/env python

Two points worth mentioning: differently of strings, Python's lists are mutable, items can be removed, deleted, changed, and strings also can be sliced by using indexes that access characters.

Next we will see some more features of lists and strings, and how to manipulate them. It will probably be the last entry in the first section as we finish the Chapter 4 in the book. As you may have noticed some items in the Perl book will not be covered, at least not immediately. We will jump back and forth sometimes.

"Manipulating" Python lists I
As mentioned we will see in this entry some other features of Python lists. We will start with a similar example to the one in the book and then use our DNA file. So let's assume we have this simple list

nucleotides = [ 'A', 'C', 'G'. 'T']

If we print it directly we would get something like this

['A', 'C', 'G', 'T']

which is fine for now, as we are not worried (yet) with the output (what we will do further below). Let's remove the last nucleotide. To accomplish that, we use  accepts any valid index of the list. Any index larger that the length of the list will return an error. For future reference, remember that when any item is removed (and inserted) the indexes change and the length also. It may seems obvious but mistakes are common.

Shifting from our 'destructive' mode, we cal also add elements to the list. Adding to the end of the list is trivial, by using

nucleotides = [ 'A', 'C', 'G'. 'T'] nucleotides.append('A')

that returns

nucleotides = [ 'A', 'C', 'G'. 'T', 'A']

Adding to any position is also very straightforward with, like this

nucleotides = [ 'A', 'C', 'G'. 'T'] nucleotides.insert(0, 'A')

where  takes two arguments: first is the index of the element before which to insert and second the element to be inserted. So our line above will insert an 'A' just before the 'A' at position zero. We can try this

nucleotides = [ 'A', 'C', 'G'. 'T'] nucleotides.insert(0, 'A1') nucleotides.insert(2, 'C1') nucleotides.insert(4, 'G1') nucleotides.insert(6, 'T1')

that will result in

['A1', 'A', 'C1', 'C', 'T1', 'T', 'G1', 'G']

Notice that we add every new item at an even position, due to the fact that for every insertion the list's length and indexes change.

And for last, we will take care of the output. Of course if are creating a script that requires a nicer output, printing a list is not the best way. We could create a loop and merge all entries in the list, but that would be a couple of lines and we ought to have an easier way (otherwise we could be using C++ instead). There is a way, by using the method. This method will take all the elements in a list into a single string, and even a delimiter can be used. So something like this

nucleotides = [ 'A', 'C', 'G'. 'T'] "".join(nucleotides)

will generate this output

ACGT

Join is a method that applies to strings. So the first "item" is a string, that could be anything (in our case is an empty one). The code line tells Python to get the empty string an join it to the list of strings that we call nucleotides. We could replace the line for something easier to understand

nucleotides = [ 'A', 'C', 'G'. 'T'] myresult = '' myresult.join(nucleotides) print myresult

So we initialize an empty string, join it with the list and print in the end with identical results. Try changing  initialization and see what happens.

With this we finish the first section of the site and we are moving to chapter 5 in the book.

Flow control in Python
We start following the fifth chapter of BPB. The first item is about flow control and code layout, which are very relevant for our tutorial. I already introduced briefly both aspects in past entries on the site, but it is always good to check. As the book, I will start with flow control.

Flow control is the way your instructions are executed by the program. Most languages have a linear flow control, meaning every line is executed from top to bottom. This linear flow control can be disrupted by two types of statements: looping and branching. Looping statements tell the computer to execute a determined set of commands until certain condition is met. This can be a numeric value (ie from 1 to 100) or the number of items in a list (like our shop list from before). Branching statements are also known as conditional statements, tell the computer to execute/or not determined lines depending on certain conditions.

The  loop was shown before. In its for loop Python iterates over the elements in a list like this

for lines in text: do something

Notice that the first line of the loop ends in a colon. This and the word  in the line> tell the interpreter that this a for loop and the indented block below is the code to be executed repeatedly until the last element in the list is reached. How Python knows where the loop ends? Indentation. Many languages use curly braces, parentheses, etc. In Python the loop ends by checking the indentation level of lines (this will help us a lot when discussing code layout).

We've already seen one example of loop in Python,, but Python accepts other types of loop structures, such as  , that uses the same indented properties to execute the commands.

mycounter = 0 while mycounter == 0: do something

Take a closer look at the  line. See something different? Maybe not, if you are used to programming. In Python equality is tested with a double equal sign, while a sole equal sign assigns a value to a variable. You can also test for inequality, greater and less than, with,   and   respectively.

"Attention: is quite common to generate errors by substituting by, so pay attention when coding."

Branching statements are the conditional commands in a computer language, usually governed by. In Python a branching statement would look like

if value == 1: do something elif value == 2: do something else else: do even more

Notice the colon ending each line of the conditions and again the indented code, telling the interpreter where the corresponding code for each condition ends.

We are going to use a lot of conditions and loops, but as you might have noticed Python has some tricks that make us avoid these statements.

About code layout just one word: indentation. No brackets, parentheses, curly braces, etc. Yes, we have seen brackets and parentheses, but not to tell the interpreter where loops and conditions start and end. This is taken care by indentation, making our life easier and the code more beautiful.

Next we will see Python's ability to find motifs in words, mainly on DNA sequences.

Finding motifs in DNA sequences
As you might have noticed, BPB generally uses protein sequences. I am a "DNA guy", and basically in our simple examples either type of sequence (except the example on transcribing) could have been used. I will stick with this molecule for a while, or until I can.

We are going to use our good old  file, still assigning the file name inside the script there will be a twist in the end. Also, remember the Regular Expression module? We are going to use it too. I am going to do just as the book: I will paste the code below with comments in the middle and explain it afterwards.

import re import string dnafile = "AY162388.seq" seqlist = open(dnafile, 'r').readlines temp = ''.join(seqlist) sequence = temp.replace('\n', '') inputfromuser = True while inputfromuser: #raw_input received the user input as string inmotif = raw_input('Enter motif to search: ') #now we check for the size of the input if len(inmotif) >= 1: #we compile a regex with the input given motif = re.compile('%s' % inmotif) #looking to see if the entered motif is in the sequence if re.search(motif, sequence): print 'Yep, I found it' else: print 'Sorry, try another one' else: print 'Done, thanks for using motif_search' inputfromuser = False
 * 1) !/usr/bin/env python
 * 2) finding motifs in DNA sequences
 * 3) we use the RegEx module
 * 1) still keep the file fixed
 * 1) opening the file, reading the sequence and storing in a list
 * 1) let's join the the lines in a temporary string
 * 1) assigning our sequence, with no carriage returns to our
 * 2) final variable/object
 * 1) we start to deal with user input
 * 2) first we use a boolean variable to check for valid input
 * 1) while loop: while there is an motif larger than 0
 * 2) the loop continues

Now let's dissect the code, in a biological way. We already seen everything up to the part the list's lines are joined. First new lines for us is this

sequence = temp.replace('\n', '')

Most of the methods in Python have very intuitive names (ok, most languages do), so it is easy to deduce that  actually replaces something. And why do we need to use this method? First, we joined the lines in one temporary string (yep, strings are immutable), but the lines come with everything, including carriage returns that we need to get rid of. So our "final" string  receives the value in   and we apply the method   to modify it. has two arguments, the first is the string we want to change from, while the second is the string we want to replace with (in fact  accepts three arguments, the last being the number of times we want to do the change). In our case, we are telling the interpreter to get every substring  and put an empty one in their places.

The next line is a simple value assignment:

inputfromuser = True

and the variable will manage the  that checks input from the user. Basically we will run the loop until a certain type of input is given, that will make the variable value become. That's why we have the line

while inputfromuser

with no check. Yep, notice that we don't need to check for the variable's value, Python assumes that it is True. To understand better, imagine that  is a flag that appears when True and disappears when False. Python can only "see" it when it appears, and when it does disappear it needs to check for it. We will elaborate more later.

Another different line awaits us

inmotif = raw_input('Enter motif to search: ')

is a function that takes a line input by the user and returns a string. In our case, we assign the value returned by the function to a new string called. Also,  has one prompt argument which is written to the screen with no trailing line, waiting for the user to input something. This takes us to the  loop

if len(inmotif) >= 1: #we compile a regex with the input given motif = re.compile(r'%s' % inmotif) #looking to see if the entered motif is in the sequence if re.search(motif, sequence): print 'Yep, I found it' else: print 'Sorry, try another one' else: print 'Done, thanks for using motif_search' inputfromuser = False

After getting the input from the user, we need to check the size of such input: if it is greater or equal to one, we will search it in our sequence, otherwise we will finish the loop. So if we get a valid input (valid in the sense of size, for now) we will compile the regex and search for it.

There is a difference in regex compilation. In the DNA transcribing we assigned a string to the regex directly, now we have a string coming from a variable/object

motif = re.compile(r'%s' % inmotif)

Notice the difference in the argument that is passed to the  function. In this case the regex to be compiled is coming from the string entered by the user, and we have to pass it by using Python's string formatting operator, noted as a. This formatting operator uses two parameters: a string with formatting characters and the data to be formatted, separated by the percent sign. In our case the formatting character will receive a string, hence the  (s for string), and the data to be formatted that is the input.

Previously, we used the regex function to replace characters/substrings in a sequence. This time, we are interested to know if the motif entered by the user is in our sequence. So we use the search method instead

if re.search(motif, sequence):

Notice that we do the search in at the same time we are testing for its result. The same case as the checking we do at the while loop. If there is a positive result from the regex search a True flag will be raised and the interpreter will execute the code of the initial branch, not testing for the  and

print 'Yep, I found it' else: print 'Sorry, try another one'

This condition is nested inside another condition, the one that tests for the size of the input entered. So, if our search returns a regex object, we print, otherwise the user will receive. Now back to our upper, if the user input length is equal to zero (just pressing the Enter key) the interpreter will process the line

print 'Done, thanks for using motif_search'

and then

inputfromuser = False

that will tell the while loop that there is no True variable anymore, ending the loop and consequently our script.

Let's review the script and its flow: 1) assign a filename to be opened 2) read the file 3) join the lines 4) remove carriage returns 5) ask for user input, while is valid 6) if input length is greater or equal to 1, process it    6a) compile regex with input     6b) search for it          6b1) if it is found print yes          6b2) if it is not found print no 7) if input length is equal to zero, end while loop, end string

I know it is a lot of information, take your time. The code can be downloaded here. Next I will change a bit this code, using other methods to find the motifs, and making the promised twist in the method that reads the file.

Finding motifs in DNA sequences: twist
As promised, let's change a bit our previous code, and make it more effective. If you are reading this tutorial in one-entry mode, let's check the code import re import string dnafile = "AY162388.seq" seqlist = open(dnafile, 'r').readlines temp = ''.join(seqlist) sequence = temp.replace('n', '') inputfromuser = True while inputfromuser: inmotif = raw_input('Enter motif to search: ') if len(inmotif) &gt;= 1: motif = re.compile('%s' % inmotif) if re.search(motif, sequence): print 'Yep, I found it' else: print 'Sorry, try another one' else: print 'Done, thanks for using motif_search' inputfromuser = False In order to make it more effective, let's allow the input of any file, maybe asking for the file as soon as the script is started. To accomplish that we need to "protect" the script and make it error proof (almost at least). Yes, you thought it right: we need to check if the input file exists before opening it.
 * 1) !/usr/bin/env python

We can use  statements to do that. Pretty handy. This is called an exception handler, so basically we try the validity of some command/method and depending on the result we continue our program flow or we catch the exception and do something else. Under  we put the expression to evaluate, and under   what to do in case of failure. Like this try: opening a file except: as the file does not exist, print something So, let's put in our code above. We remove this dnafile = "AY162388.seq" seqlist = open(dnafile, 'r').readlines and include this lines fileinput = True while fileinput == True: filename = raw_input('Enter file name:') if len(filename) &gt; 0: try: dnafile = open(filename, 'r') fileinput = False except: print 'File does not exist, please try again' else: fileinput = False sys.exit I know, a lot of new code. But if you take a closer look, there is only three lines we have never seen:  and the last line with. Here, the script tries to open the file provided as input, if it does find the file normal operation resumes, if does not, the script asks for another input. As in the other while loop, we control it with a boolean variable, and in the case of empty input we end the loop and the script, using a system command, in the last line of the new code.

Our script is better now, not bulletproof, but a little bit more efficient. There still a "flaw", that you can only check one file for each run of the script. At least we not stuck to our usual DNA sequence.

Download the commented code here.

Commenting in Python
As you might have noticed from the previous topics, comments in Python are defined mainly by the  sign. This is the signal used for single line comment like If you are used to C++, this would be equivalent to.
 * 1) this is a single line comment

On the other hand, multi line comments are defined by triple double quotes, opening and closing, similar to C++  , like this """this is a multi line comment""" Just a note. Resuming bioinformatics mode.

Counting nucleotides
We are going to jump the regular expression explanation that you can find in the book, and we will go directly to the section that introduces string/list/array manipulation and adapt it to Python.

We are going to see two different methods: a "long" and a "short" one. In many places and computer languages you will see that there are different ways of doing the same thing, with advantages and disadvantages. It is up to you to define which methods are better or worse, as this is a very personal matter. Some people prefer the longer way because it might be clearer and easier to understand, or it might be necessary to use it due to code maintainability. Other people would rather use the shorter path because they want to generate code faster.

It does not matter the path you select, as long as you get your task done. That's the key: focus on the end product not on how exactly got there. If you used 10 lines of code more, or 10 less, that's irrelevant as long as you did what you wanted. Live and learn and someday you will use the even shorter way.

So we start with the long way. We want to count the individual number of nucleotides in a sequence, determining they relative frequency. Remember that we read the lines of a sequence file into a list, using. But on the long method we will read the file and store the data into a string like file = open(filename, 'r') sequence = '' for line in file sequence += line This way it will be easier to "explode" the sequence in separated items. After the "explosion" we can check each item in the list and get our result. The code is below, I will be back after it. import string dnafile = "AY162388.seq" file = open(dnafile, 'r') sequence = '' for line in file: sequence += line.strip #notice the strip, to remove n seqlist = list(sequence) totalA = 0 totalC = 0 totalG = 0 totalT = 0 for base in seqlist: if base == 'A': totalA += 1 elif base == 'C': totalC += 1 elif base == 'G': totalG += 1 elif base == 'T': totalT += 1 print str(totalA) + ' As found' print str(totalC) + ' Cs found' print str(totalG) + ' Gs found' print str(totalT) + ' Ts found' Most of the above code was already covered before. Let's look at the different stuff, like the "explosion line" seqlist = list(sequence) Here we basically transform our string  into a list, by putting the object type we want before the object we want converted, like we do here print str(totalA) + ' As found' This construction  tells Python to get whatever is inside the parentheses and transform into whatever is outside. Converting the string to a list will get GTGACTTTGTTCAACGGCCGCGGTATCCTAACCGT GCGAAGGTAGCGTAATCACTTGTTCTTTAAATAAGGACTAGTATG AATGGCATCACGAGGGCTTTACTGTCTCCTTTTTCTAATCAGTGAA ... and transform it into ['G', 'T', 'G', 'A', 'C', 'T', 'T', 'T', 'G', 'T', 'T', 'C', 'A', 'A', 'C', 'G', 'G', 'C', 'C', 'G', 'C', 'G', 'G', 'T', 'A', 'T', 'C', 'C', 'T', 'A', 'A', 'C', 'C', 'G', 'T', 'G', 'C', 'G', 'A', 'A', 'G', 'G', 'T', 'A', 'G', 'C', 'G', 'T', 'A', 'A', 'T', 'C', 'A', 'C', 'T', 'T', 'G', 'T', 'T', 'C', 'T', 'T', 'T', 'A', 'A', 'A', 'T', 'A', 'A', 'G', 'G', 'A', 'C', 'T', 'A', 'G', 'T', 'A', 'T', 'G', 'A', 'A', 'T', 'G', 'G', 'C', 'A', 'T',...] After "exploding", we use a  loop to iterate over every item in the list and use conditional statements to do the counts. Regarding the counts, we use this operator totalT += 1 that, in C/C++, tells the interpreter to get the value of  and add 1 to it.
 * 1) !/usr/bin/env python
 * 1) let's keep the file fixed for now
 * 1) opening the file, reading the sequence and storing in a list
 * 1) initialize a string to receive the data
 * 1) "exploding" the sequence in a list
 * 1) initializing integers to store the counts
 * 1) checking each item in the list and updating counts
 * 1) printing results

On the next post, we will see the short way and further modify the above script, that can be downloaded here.

Counting nucleotides: part II
Let's check the "short" way, that is basically a method that avoid the "explosion" of the string. Instead of transforming the sequence from the file from a string to a list, we go and use the string directly, applying one of the methods available to manipulate strings. This time we read the file at once and convert the list to a string using. To count we simply use the method  on our string. Pretty nice. import string dnafile = "AY162388.seq" seqlist = open(dnafile, 'r').readlines temp = ''.join(seqlist) totalA = temp.count('A') totalC = temp.count('C') totalG = temp.count('G') totalT = temp.count('T') print str(totalA) + ' As found' print str(totalC) + ' Cs found' print str(totalG) + ' Gs found' print str(totalT) + ' Ts found' That's the short way: using. This is a method that when applied on a string counts the number of times the substring appears in our string. We can even set a start and ending point to count. Notice one difference in this script to the previous examples: after we join the items of the list into a string we do not remove the carriage returns. Because in this case we don't need it, as it will be an extra character there that won't make any harm.
 * 1) !/usr/bin/env python
 * 1) still keep the file fixed
 * 1) opening the file, reading the sequence and storing in a list
 * 1) let's join the the lines in a temporary string
 * 1) counting
 * 1) printing results

Download the above code here.

Manipulating strings: again
I am going to finish the book's chapter 5 (our section 2) in the next topic, where I will give a small introduction on how to output data to a file in Python. On this post we will check some of the methods that can be used to manipulate strings. The book gives only a couple of methods to be used in Perl on string, but here I will show a longer list of Python methods that can be used on its immutable strings.

A full list of the methods can be found here and I will will give brief explanations on the ones I think are key for bioinformatics. Remember that string cannot be changed in Python, so we will always going to use a buffer/temp variable to store our changed string when needed.

- count this method returns the number of times you see a substring (a letter/number, a word, etc) in another string. You can also specify a start and an end positions to look for. This is handy if you are counting nucleotides/aminoacids in a sequence. This method returns an integer

totalA = sequence.count('A')

- endwith this method checks the end of your string for a determined substring. Very handy of you need to check the tail end of your sequence right away.

- find returns the position of the substring being searched, and -1 if it is not found. It is similar to the re.search we used before. This is very useful if you are looking for a determined motif/subsequence in a hurry.

- islower returns a true bool (true or false) if all characters in the string are lowercase. False if at least one of the characters is uppercase. Very handy if you need to convert lowercase to UPPERCASE files for input in some application. Works in conjunction with the isupper which is basically the opposite.

- join. We have seen this before: it concatenates strings using a determined separator.

- lower and upper, that as their name might indicate return the string converted to lowercase/uppercase. This is also good for the conversion of sequence format in input files.

I will stop here. There are many other methods that can be used. Some of the above were already covered here and in the next topics we will take a look at the other ones, creating an application that actually performs some useful function.

Writing to files
We already know how to read from files, now we are going to see how to write to them. We will start with a simple example, writing some content to a "fresh" file that does not exist in the system.

Sometime ago, we have seen the  statement in Python, that prints to the system standard output (usually the screen). This output can be redirected using  to a stream/file. But if we are going to create really professional applications (even to our own use), usually stream redirection is not really the nicest approach. In some cases the best alternative is to save a file.

Also, some posts ago, we covered the methodology to  a file. We are going to use here the same command,  to (in our case) create the file. Remember that to read the file we used

file = open(dnafile, 'r')

where the  is the mode we are using to open the file. In that case we used the read mode, now we are going to use the write mode. When we open a file with this command in write mode if the file (with the name with pass) does not exist it is created. If it does exist, all the contents of the current file will be erased, so be careful when opening a file from your script. Check for the location, file name, etc before opening the file.

So, basically we have

file = open(output, 'w')

and the file is open and ready to receive data.

Let's cheat and use the previous script that counts nucleotides and modify it to save a  file wit the results:

import string dnafile = "AY162388.seq" file = open(dnafile, 'r') sequence = '' for line in file: sequence += line.strip #notice the strip, to remove n seqlist = list(sequence) totalA = 0 totalC = 0 totalG = 0 totalT = 0 for base in seqlist: if base == 'A': totalA += 1 elif base == 'C': totalC += 1 elif base == 'G': totalG += 1 elif base == 'T': totalT += 1 resultfile = open('counts.txt', 'w') resultfile.write(str(totalA) + ' As found \n') resultfile.write(str(totalC) + ' Cs found \n') resultfile.write(str(totalG) + ' Gs found \n') resultfile.write(str(totalT) + ' Ts found \n')
 * 1) !/usr/bin/env python
 * 1) let's keep the file fixed for now
 * 1) opening the file, reading the sequence and storing in a list
 * 1) initialize a string to receive the data
 * 1) "exploding" the sequence in a list
 * 1) initializing integers to store the counts
 * 1) checking each item in the list and updating counts
 * 1) printing results

The only difference is at the end of the script. Here instead of  we use. Notice that  is a method of the opened file.(in our case called  . This method only accepts strings, so we need to convert everything to string before writing in the file. Notice also that we need to add a carriage return/newline at the end of the string to be written. Differently of ,   does not automatically puts a new line at the end of the output.

A nice summary script
Closing section two, let's use everything we saw before and write a nice script that will read a sequence file (DNA) and report us of any "errors" and the number of different nucleotides. It is very simple, but a good exercise. Depending on your needs you can easily modify it to check for other characteristics of sequences, even change it to read amino acids sequences. We start with the code, comments coming after it.


 * 1) !/usr/bin/env python

import string import sys import re

fileentered = True while fileentered == True: filename = raw_input('Please enter a file to check: ') if len(filename) >= 1: try: seqlist = open(filename, 'r').readlines sequence = ''.join(seqlist) sequence = sequence.replace('\n', '') totalA = sequence.count('A') totalC = sequence.count('C') totalG = sequence.count('G') totalT = sequence.count('T') otherletter = re.compile('[BDEFHIJKLMNOPQRSUVXZ]') extra = re.findall(otherletter, sequence) output = open(filename+'.count', 'w') output.write('Count report for file ' + filename + '\n') output.write('A = ' + str(totalA) + '\n') output.write('C = ' + str(totalC) + '\n') output.write('G = ' + str(totalG) + '\n') output.write('T = ' + str(totalT) + '\n') if len(extra) > 0: output.write('Also were found ' + str(len(extra)) + ' errors\n') for i in extra: output.write(i + ' ') else: output.write('No error found') output.close print 'Result file saved on ' + filename + '.count' except: print 'File not found. Please try again.' else: fileentered = False sys.exit

The main change here is that we use a  loop to control de program flow. Notice that we import  (not really necessary though),   and. Basically we ask for an user input, the filename, and depending on the input given we process the file or exit the program. The early exit is done with the  method which is a shortcut to get out of the script processing. Very handy. If the input is valid we  to open it. If the operation is successful, great, we read the file, count the nucleotides and use a quite scary regular expression to search all the "errors" in our sequence.

The regex is compiled with the pattern  which means "match any character in this range". Take a closer look at the pattern and you will see that is every letter in the alphabet, except for ACGT. And when searching for these "errors" instead of using  we use , which conveniently returns a list with all the substrings found. That's even more handy.

On the final part of the script we take care of the output, opening a file called  where we print the counts and the errors, if they actually exist. One thing I left from the previous post, is that we need to close the file opened to write. In some cases if the file is not properly closed, errors might occur. So always remember to close the files used for writing/appending, using.

Debugging in Python
Beginning the third section in our tutorial/guide, we are going to see the chapter six of BPB. This chapter discusses the topics of creating subroutines (in Python's case functions) and debugging the code.

We are going to start by the end. In Python, code debugging can be done as in any other programming language: Perl has pdb, C/C++ has gdb, etc. Python also has a pdb module that can be imported and run to check for errors in your code. Using this command line:

$> python -m pdb myscript

will start the debug module and this will run your script. If you are an experienced programmer, who is just starting Python, pdb usage might look simple and straightforward. On the other hand, if you don't have a lot of experience in programming I would suggest a different approach, as you become more comfortable with the language. Python has a great advantage over some other interpreted languages, allowing you to interactively code using the interpreter. So if your code is not working properly, maybe a wrong output or a value that is not being correctly calculated you have the options of coding the part of your script that is not working using the interpreter or use the first rule of debugging: include  statements that output the value of variables/objects.

Another option is to use a Python code editor, what will also help you with highlight your code. I have little experience with Python code editors, as I normally code in Linux and use Kate. Lately I have been trying Komodo edit which is a cross-platform freeware from Active State. It looks pretty good but I never tried debugging my code with it.

So, these are my advices if you are just starting to program. Maybe because of the age of Beginning Perl for Bioinformatics (published in 2001), Perl's pdb was the only option back then. Thanks to major advances on open-source and free software there are many other options nowadays to debug your code.

Python functions: simple example
Python subroutines do not exist. Everything is a function, all functions return a value (even if it's None), and all functions start with. This statement is from Dive into Python, a book on Python programming available for free. As mentioned Python functions start with the word, which is followed by the function name that is followed by the arguments the function receives in between parentheses. Something like

def my_first_function(somevalue):

Usually Python coders (sometime called Pythonistas, among others), following the Python coding style (that states: Function names should be lowercase, with words separated by underscores as necessary to improve readability.) name their functions with words separated by underscores. And we are going to use this style here, whenever a function becomes handy. The parameters passed to the function (above ) do not have a data type, Python should handle it whatever is being passed. It is also attribute of your code to handle the parameter/value passed inside the function and avoid errors. Functions also follow the same indentation of normal programming and the line after the declaration should be indented with four spaces

def my_first_function(somevalue): do_something

So, let's warm-up with functions. The following script is just the start: it adds a poly-T tail to a DNA sequence. We are going to use our old friend AY162388.seq. I will be back after the script


 * 1) ! /usr/bin/env python

def add_tail(seq): result = seq + 'TTTTTTTTTTTTTTTTTTTTT' return seq

dnafile = 'AY162388.seq' file = open(dnafile, 'r')

sequence = '' for line in file: sequence += line.strip

print sequence sequence = add_tail(sequence) print sequence

Not very useful, at first sight, but gives us an impression of what a function looks like. Basically we define a function  that receives   as a parameter. Don't worry about variable scope now, we will see it later. The rest of the script is just like things we saw before, except for the line. Here we are saving memory (yep, not that much and not even impressive) by assigning the return value of the function to the same string where we have the sequence stored. Run the script and get ready for the command line arguments.

Command line arguments and a second take on functions
We have seen, briefly, how to define and use a function in Python. Now we are going to jump forward a bit and create a new function and at the same time take a look on command line parameters that can be passed to the script.

If you have used command line applications before, you might have encountered programs that asks for a file name, a calculation parameter, etc to be passed in the command line. Python scripts are no different, they accept such parameters. For this we have the  module that has system specific parameters and functions. We have used before the, imported as an extra module function. Every operating system (even Windows) has arguments in its command line, and programming languages usually call such arguments  (in the C/C++ you have argv in the parameters of the main function). Lists in Python start at 0 (zero), and for the argument list the first item is the script/program name. Basically if we have this

$> python myscript.py DNA.txt

is the argument 0 in the list and DNA.txt is the argument 1. So whenever we create a script that receives arguments in the command line, we have to start (in most cases, be aware) from 1. In Python using system arguments in the CLI will look like

import sys

filename = sys.argv[1] valueone = sys.argv[2] ...

We will a variation of our previous script that counts the bases, now with command line arguments and a function (with no "error" checking at first)

import string import sys
 * 1) !/usr/bin/env python

def count_nucleotide_types(seq): result = [] totalA = seq.count('A') totalC = seq.count('C') totalG = seq.count('G') totalT = seq.count('T')

result.append(totalA) result.append(totalC) result.append(totalG) result.append(totalT)

return result

sequencefile = open(sys.argv[1], 'r').readlines sequence = ''.join(sequencefile) sequence = sequence.replace('\n', '') values = count_nucleotide_types(sequence) print "Found " + str(result[0] + "As" print "Found " + str(result[0] + "Cs" print "Found " + str(result[0] + "Gs" print "Found " + str(result[0] + "Ts"

Few new things here. We created a function  that should receive a string containing the sequence. The "real" first line of the program flow is the one that gets the name of the file from the command line argument, open and read it. We then convert the list to a string, modify it a but and throw it to the function. Get the result back, and done.

With functions we actually don't save coding time/length (at least here), we make out code more organized, easier to read and somewhat easier to someone else read and understand it. It is not a good coding practice to have long programs/scripts with no functions, no subdivision, no structure. Functions are sometimes good program nuggets that can be reused in the same application or even ported/copied to other applications and reused indefinitely. Soon we will see a function and class that reads a FASTA file in Python that can be used anywhere in any program that needs such feature. Try the code and come back later for more.

Ramdomization
We start section 4 with a very short introduction to randomization. Chapter 7 of BPB discuss the use of randomization to obtain mutations in DNA and protein sequences. The example given in the book is at the same time simple and interesting, as it creates a paragraph from random selections of noums, adjectives, verbs and other grammar elements. Here we are going to to create a very (stress on very) simple dice game, where each time you run the script it will throw two dices for you and two dices for the computer. It then prints the sum of the dices and tells who won the match.

Randomization is an important feature of computer languages. There are different researchers involved in the creation of the best approaches to generate random number in computers. Random number are important in the simulation of different natural processes, such as genetic mutation, gene drift, epidemiology, weather forecast, etc. The better the generator, the better the simulation.

Python has a random module included, with a myriad of functions that can perform different randomization. In this post we will see the  randomization, and in later entries we will see some other powerful functions.

Our script is quite simple, and the only new aspect for us here is the random module and the  function.


 * 1) !/usr/bin/env python

import random

dice1 = random.randint(1,6) dice2 = random.randint(1,6)

computerdice1 = random.randint(1,6) computerdice2 = random.randint(1,6)

mine = dice1 + dice2 his = computerdice1 + computerdice2

print 'mine = ' + str(mine) + ' vs. computer = ' + str(his)

if mine > his: print "I won" elif mine < his: print "Computer won" else: print "Tie. Try again"

is a function that generates an integer random number between a range specified by the number between parentheses. In the above case, we are using a dice of 6 values. So, for every call of  a random number between 1 and 6 will be generated. We store this number in a variable, sum the user's dices and the computer's and check with a  clause to see the winner. A good exercise would be to make this script interactive, allowing multiple matches.

Next we will see another random module function and then we will generate mutations on DNA sequences.

Random sequences in Python
There is a reason to say that Python has batteries included. On the last post we have seen a small example of randomization in Python, generating random integer values for a (extremely) simple dice game. We move to another example, still simple which will allow us to generate random DNA sequences. The script is


 * 1) !/usr/bin/env python

import random import sys

length = int(sys.argv[1])

dnaseq = list('ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT')

result = '' for i in range(length): result += random.choice(dnaseq)

print result

Not fancy at all, just plain simple (yet again). We import a couple of modules,  and , and ask for the sequence length as a script parameter. is a list containing a tandem repeat of ACGT, and from there we will extract our random nucleotides. In fact  could have been 'ACGT' only. The fact that we create a string and convert it to a list, is just for convenience of writing  easier than. We then declare an empty string that will be used to store the random sequence. And inside the loop, the command that does all the magic:. This command will return a random element from the list passed as subject. Using it inside a loop we will get a random nucleotide on each iteration and add it to our string.

This is a very simple command, but at the same time extremely powerful and easy to implement. A good exercise from this would be to modify the  string and see if there is any change in the final random sequence.

A script to simulate DNA sequence sets
In Beginning Perl for Bioinformatics the chapter that covers simulated mutations on a DNA sequence is quite verbose and the code examples employ some subroutines to do what we have done on the last post. Basically the code example that generates a random DNA sequence is the last one on the chapter, but it was the first one we covered. Also this code example has a twist that our code from the last post does not have, which is it allows you to generate a set of sequences with different length instead of one sequence with fixed length that our script does.

So, the script that generates a user defined simulated DNA sequence set is


 * 1) !/usr/bin/env python

import random import sys

def simulate_sequence(length): dna = ['A', 'C', 'G', 'T'] sequence = '' for i in range(length): sequence += random.choice(dna) return sequence

setsize = int(sys.argv[1]) minlength = int(sys.argv[2]) maxlength = int(sys.argv[3])

sequenceset = [] for i in range(setsize): rlength = random.randint(minlength, maxlength) sequenceset.append(simulate_sequence(rlength))

for sequence in sequenceset: print sequence

Simple and efficient. First we define a function that generates a simulated DNA sequence from the four nucleotides, again using the. The sequence length is based on the parameter received by the function. The sequence length is another random number defined in the main body of the script. This random number is generated by  with a range based on the arguments given by the user when running the script. And finally, the number of sequences to be simulated is define by the first parameter. Seventeen lines.

Next we will see how to draw some scientific information about the sequences, such as sequence identity and nucleotide frequency.

Still simulations
We use our last code as a starting point in order to generate some real information from our simulated sequence sets. Our approach here will be the same: functions to do all the work for us and a very simple main code. We also reuse some code with applied before to count the nucleotides.

Let's see the code, discussion just after it.


 * 1) !/usr/bin/env python

import random import sys

def simulate_sequence(length): dna = ['A', 'C', 'G', 'T'] sequence = '' for i in range(length): sequence += random.choice(dna) return sequence

def nucleotide_percentage(sequence): print str(sequence.count('A')) + ' As ', print str(sequence.count('C')) + ' Cs ', print str(sequence.count('G')) + ' Gs ', print str(sequence.count('T')) + ' Ts'

def sequence_identity(set): iden = [] count = 0.0 for x in range(len(set)-1): print str(x), str(x+1) for n in range(len(set[x])): if set[x][n] == set[x+1][n]: count += 1 iden.append(count/len(set[x])) count = 0.0 return iden

setsize = int(sys.argv[1]) minlength = int(sys.argv[2]) maxlength = int(sys.argv[3])

sequenceset = [] for i in range(setsize): rlength = random.randint(minlength, maxlength) sequenceset.append(simulate_sequence(rlength))

identity = sequence_identity(sequenceset)

for i in range(len(sequenceset)): print sequenceset[i] if i < len(sequenceset)-1: print 'sequence identity to next sequence : ' + str(identity[i]) nucleotide_percentage(sequenceset[i]) print

Well, not many new things here. The main one might the be the assignment of the variable, which receives initially a value of  , which in this case is a float. This variable is used the calculate the sequence identity which is a percentage, or a relative value. If we had initialized  with zero, our division   would always be zero, due to the fact that count would have been an integer.

As is pointed out in BPB, this example is more an indication that we are able to use our Python skills to actually make some real code, with some real output. One issue with this example is the fact that we only calculate sequence identity of two sequences at a time. It would be ideal to have sequence identity between all simulated sequences. This would require much more code, of course a good educational step, but it is something that can be easily obtained with classes and we will see this later on.

With this entry, we finished our Section 4 and we will start Section 5 with Python's dictionaries, moving to fasta files and classes next.

Genetic code: part I
The Python dictionary data-type is like hash in Perl. It is a very similar structure, where each element in the variable is composed of two values, more specifically a key-value pair. This is the ideal data type to store the genetic code. As you might know the genetic code governs the translation of DNA into proteins, where each codon (3 bases or nucleotides in the DNA sequence) correspond to an amino acid in the protein sequence. So for every sequence of 3 nucleotides (key) will represent an amino acid (value). Important things: dictionaries do not accept duplicated key values, and every time a new value is assigned to a key the old value is erased. To create a new dictionary use the curly brackets

first_dictionary = {}

inside the curly braces we first assign a key and separated by a colon, while multiple pairs should be separated by comma. Both key and value have to be between single or double quotes. Let's see how we will represent the genetic code in a Python dictionary, assigning values to keys

gencode = { 'ATA':'I',   #Isoleucine 'ATC':'I',   #Isoleucine 'ATT':'I',   # Isoleucine 'ATG':'M',   # Methionine 'ACA':'T',   # Threonine 'ACC':'T',   # Threonine 'ACG':'T',   # Threonine 'ACT':'T',   # Threonine 'AAC':'N',   # Asparagine 'AAT':'N',   # Asparagine 'AAA':'K',   # Lysine 'AAG':'K',   # Lysine 'AGC':'S',   # Serine 'AGT':'S',   # Serine 'AGA':'R',   # Arginine 'AGG':'R',   # Arginine 'CTA':'L',   # Leucine 'CTC':'L',   # Leucine 'CTG':'L',   # Leucine 'CTT':'L',   # Leucine 'CCA':'P',   # Proline 'CCC':'P',   # Proline 'CCG':'P',   # Proline 'CCT':'P',   # Proline 'CAC':'H',   # Histidine 'CAT':'H',   # Histidine 'CAA':'Q',   # Glutamine 'CAG':'Q',   # Glutamine 'CGA':'R',   # Arginine 'CGC':'R',   # Arginine 'CGG':'R',   # Arginine 'CGT':'R',   # Arginine 'GTA':'V',   # Valine 'GTC':'V',   # Valine 'GTG':'V',   # Valine 'GTT':'V',   # Valine 'GCA':'A',   # Alanine 'GCC':'A',   # Alanine 'GCG':'A',   # Alanine 'GCT':'A',   # Alanine 'GAC':'D',   # Aspartic Acid 'GAT':'D',   # Aspartic Acid 'GAA':'E',   # Glutamic Acid 'GAG':'E',   # Glutamic Acid 'GGA':'G',   # Glycine 'GGC':'G',   # Glycine 'GGG':'G',   # Glycine 'GGT':'G',   # Glycine 'TCA':'S',   # Serine 'TCC':'S',   # Serine 'TCG':'S',   # Serine 'TCT':'S',   # Serine 'TTC':'F',   # Phenylalanine 'TTT':'F',   # Phenylalanine 'TTA':'L',   # Leucine 'TTG':'L',   # Leucine 'TAC':'Y',   # Tyrosine 'TAT':'Y',   # Tyrosine 'TAA':'_',   # Stop 'TAG':'_',   # Stop 'TGC':'C',   # Cysteine 'TGT':'C',   # Cysteine 'TGA':'_',   # Stop 'TGG':'W',   # Tryptophan } Simple, yet efficient. But this is the type of functionality that would be great to have at hand every time you write a script to translate DNA into proteins. And it is not something that you would like to type (or even copy-and-paste) all the time. On the next post we will create the translation script and will also create our first Python module.