Open writing projects/Beginning Python for Bioinformatics
This page is part of the Open Writing Project Beginning_Python_for_Bioinformatics. (More Open Writing Projects.) |
Beginning at the Beginning
This site is based on the book Beginning Perl for Bioinformatics by James Tisdal which was published in 2001. My idea here is to follow the structure of the book, analysing each chapter and converting the Perl scripts into Python. The original book is very well written and an excellent starting point for any aspiring bioinformatician. It is helpful if you are a biologist that does not understand programming or a computer scientist that does not know a lot of biology.
In no way this site tries to plagiarize the book, as it is only used as an starting point (a very good one indeed) to this journey into Python. Here you will not find biological concept explanations and criticisms towards Perl. Making this clear, I will start from the beginning.
Why Python (and not Perl)?
According to the official Python website:
Python and Perl come from a similar background (Unix scripting, which both have long outgrown) [to learn more about that check this tutorial], and sport many similar features, but have a different philosophy. Perl emphasizes support for common application-oriented tasks, e.g. by having built-in regular expressions, file scanning and report generating features. Python emphasizes support for common programming methodologies such as data structure design and object-oriented programming, and encourages programmers to write readable (and thus maintainable) code by providing an elegant but not overly cryptic notation. As a consequence, Python comes close to Perl but rarely beats it in its original application domain; however Python has an applicability well beyond Perl's niche.
I couldn't explain better than that. But still I have to give my take on why I prefer Python over Perl, and why I decided to use it in my day-to-day programming. Python code is "extremely" readable; in no-time you can grasp it completely. OK, I admit that it has at least one odd feature for the non-computer savvy: the "mandatory" indentation. In Python you have to indent loops, if clauses, function definitions, etc. Maybe this is the first and only hard step to get, but after a couple of hours of coding you will be pleased how pretty your code looks.
Setting Up Python on Your Computer
Python is freely available for all types of computers (Windows, Macintosh, Linux, etc). Macs and Linux machines have a version of Python installed as part of the standard operating system. Windows users will have to install Python as an Application. Python is frequently updated, and the update to to version 3.0 has many significant changes. All users are encouraged to install a current version of Python from python.org [1]
There are three basic ways to work with Python on your computer.
1) You can open a terminal window and start up Python as an interactive command line application. On most systems the command to launch Python is python3
. Working in interactive mode has the advantage that commands are executed as soon as you type them (and press the enter/return key). However, it is more difficult to make changes and debug mistakes. It is very difficult to develop programs that are more than a few lines long interactively.
2) You can type a set of Python commands in a text file, then save that text file and covert it to an executable script (see below).
3) You can work in an Interactive Development Environment (an application that works as an editor for Python code). All of the downloadable packages from python.org contain the IDE called "IDLE".
Hands on code: Sequences and strings - part I
As pointed out in Beginning Perl for Bioinformatics, a large percentage of bioinformatics methods deal with data as strings of text, especially DNA and amino acids sequence data. DNA is composed of four different nucleotide bases: A, C, T and G; while proteins contain 20 amino acids. Each one of these elements have one letter of the alphabet assigned to them. In the DNA case an additional set of letters are used as ambiguity codes to represent positions which may be occupied by one or more different nucleotides. [2] .
Perl and many other programming languages use special symbols to represent variables that contain strings, like this:$sequence= "GATC"
. Python is dynamically typed, meaning variable types are assigned/discovered by the interpreter at run time. This means that the value after the equal sign will tell the interpreter what variable type you are declaring. So in Python if you want to store a DNA sequence you can just enter:
myDNA = "ACGTACGTACGTACGTACGTACGT"
OK, you are ready to write your first Bioinformatics Python script. A script is a fancy name for a simple text file that contains code in a programming language. Python can be used with the interpreter command line or by scripts edited and saved in any text editor. On Unix systems (including Mac OSX), you need to make your text file executable in order to use it as a script, which usually means that after creating and saving the file, you will need to change the mode. Type the change mode command in the terminal as follows (where filename is the name of your script file):
chmod u+x filename
Let's get the sequence above and print it on the screen. The first line of code will tell the operating system to use Python to read this program and where to find Python on your computer.
#! /usr/bin/env python
Next, create the variable myDNA and assign the corresponding sequence to it:
myDNA = "ACGTACGTACGTACGTACGTACGT"
And finally, print the contents of the variable to the screen:
print myDNA
As mentioned above, Python mandates that you have your code indented, so our final code will look like this:
#! /usr/bin/env python myDNA = "ACGTACGTACGTACGTACGTACGT" print myDNA
Hands on code: Sequences and Strings - part II
Another important task for many biologists is to merge/concatenate different strings of DNA in one unique sequence. We modify the previous script in order to have two distinct DNA sequences in one.
We start using code_01 structure:
<syntax type=python>
- ! /usr/bin/env python
myDNA = "ACGTACGTACGTACGTACGTACGT" myDNA2 = "TCGATCGATCGATCGATCGA" print myDNA, myDNA2 </syntax>
So far, we added a new string containing an extra DNA sequence and we print
both sequences. In Python the print
statement automatically adds a new line at the end of the string to be printed, unless you add a comma (,) at the end. The comma is also needed if you are going to print more than one string in order to separate them (try removing the comma from the code above).
Now, how do we merge myDNA and myDNA2? Easy in Python: just sum them with a plus signal:
<syntax type=python> myDNA3 = myDNA + myDNA2 print myDNA3 </syntax>
Notice that in Python strings are immutable, meaning they cannot be changed. This immutability confer some advantages to the code where strings (in Python strings are not variables) cannot be modified anywhere in the program and also allowing some performance gain in the interpreter. So, in order to have our sequences merged we created a third sequence that received both strings. Finally our code will be (some captions were added):
<syntax type=python>
- ! /usr/bin/env python
myDNA = "ACGTACGTACGTACGTACGTACGT" myDNA2 = "TCGATCGATCGATCGATCGA" print "First and Second sequences" print myDNA, myDNA2 myDNA3 = myDNA + myDNA2 print "Concatenated sequence" print myDNA3 </syntax>
Importing and regular expressions
Tisdall's book on Perl introduces next the ability to "transcribe" DNA sequences into RNA. In order to do that we need to study a different aspect of programming: regular expressions (or regex). Regular expressions are a pattern/string expression that are used for matching/describing/filtering other strings.
Let's say you want to examine or extract all vowels contained in one phrase, one page, one word. Another example would be to remove all html tags from a downloaded webpage. As HTML tags are encapsulated between <
and >
signs we can create a regex that will search for any characters in between the signs and remove (parse) them from our page. We will deal very briefly with regex, and if you are interested in learning more about it you can search for countless references on the internet (such as this one).
In order to use regular expression in Python we need to also learn about another concept present in the language: importing modules. Python core functionality provides most of the usual operations and also comes with a built-in library of functions that "access to operations that are not part of the core of the language but are nevertheless built in". One of this operations is the ability to interpret regular expression that in Python is located in the re
module. Apart from the language core, built in modules, Python can be further extended by using third-party modules imported into the language. Anyone can create a module and distribute to every Python user and programmer.
So, in order to have regex capabilities we literally have to import
the regex module. We do that by entering the line:
<syntax type=python>import re</syntax>
Python's code style guide suggests that import
statements should be on separate lines
<syntax type=python>import sys import re
...</syntax>
and to be put on the top of the file (usually below the line that tells your OS to use Python's interpreter on the script).
So the first two lines of our new script would be
<syntax type=python>#! /usr/bin/env python import re</syntax>
Now that we have the ability to use regex, we need to create one expression that will transcribe our DNA sequences into RNA.
Python's print statement
Just an apart from the bioinformatics aspect of programming: Python's print
statement.
As in most computer languages Python allows an easy way to write to the standard output. Python's print
only accepts output of strings, and if the variable sent to it is not a string it is first converted and then output.
The print
always put a line-break ('\n'
or "\n"
) at the end of the expression to be output, except when the print
statement ends with a comma. For example:
<syntax type=python>print "This is a" print "test"</syntax>
will print
This is a
test
while
<syntax type=python>print "This is a", print "test",</syntax>
will print
This is a test
Of course Python's print
statement allows any programming escape character, such as '\n'
and '\t'
.
Concatenating strings on output
To concatenate two strings on output there are two possible ways in Python. You can either separate the strings with a comma, like we did here
<syntax type=python>print myDNA, myDNA2</syntax>
or you can use the "+" sign in order to obtain almost the same result. This is similar to what was used here
<syntax type=python>myDNA3 = myDNA + myDNA2</syntax>
but instead we would use the print
command as
<syntax type=python>print myDNA3 + myDNA</syntax>
In the latter case, both strings will not be separated by a space and will be merged. To get the same result you would have to concatenate an extra space between the strings like
<syntax type=python>print myDNA3 + " " + myDNA</syntax>
The Regular Expression
As mentioned above, regex in Python are provided by the re
module, which provides an interface for the regular expression engine. First thing we have to do is to tell the interpreter what to do and what expression to use.
Let's start with a DNA sequence.
<syntax type=python>myDNA = 'ACGTTGCAACGTTGCAACGTTGCA'</syntax>
How to transcribe it to RNA? Transcription creates a single-strand RNA molecule from the double-strand DNA; basically the final result is a similar sequence, with all T
's changed to U
's. So our regular expression has to find all T
nucleotides in the above sequence and then replace them.
Regular expressions in Python need to be compiled into a RegexObject, that contains all possible regular expression operations. In our case we need to search and replace, what can be done by using the sub()
method. According to the Python's Regular Expression HOWTO sub()
returns the string obtained by replacing the leftmost non-overlapping occurrences of the RE in string by the replacement replacement. If the pattern isn't found, string is returned unchanged.
Let's put everything above in real code. First we need to compile a new RegexObject that will search for all thymines in our sequence. It can be achieved by using this:
<syntax type=python>regexp = re.compile('T')</syntax>
Simple as that. This line of code tells the Python interpreter that our "regular expression" is every T in our string. Now, we have to make replace those Ts with Us. In order to do that we just tell the interpreter:
<syntax type=python>myRNA = regexp.sub('U', myDNA)M</syntax>
Let's look at the last two lines of code. On the first line we created a new RegexObject, regexp
(that could have any name, as any variable) and compiled it, making our regular expression to be every T in our string. On the second line, we assigned our soon to be created RNA sequence to a new string (remember that strings in Python are immutable) and used the command sub
to replace in the Ts by Us present in our original DNA string. Putting all together our transcription code will be
<syntax type=python>
- ! /usr/bin/env python
import re myDNA = 'ACGTTGCAACGTTGCAACGTTGCA' regexp = re.compile('T') myRNA = regexp.sub('U', myDNA) print myRNA </syntax>
You can download the resulting script here.
Transcribing: the "other" way
We have seen how to transcribe DNA using regular expression, even though the regex we have used cannot be considered a real one. Now we are going to simplify our small script even more and take advantage of some string capabilities of Python. Instead of using two lines, we are going to use only one. And also we won't need to import anything.
Let's start again with the same DNA sequence
<syntax type=python>myDNA = 'ACGTTGCAACGTTGCAACGTTGCA'</syntax>
This time we are going to use replace
. This is one of the Python's methods to manipulate strings. Basically, you are asking the interpreter to get a certain string by another. The method returns a new copy of your string. We add this line
<syntax type=python>myRNA - myDNA.replace('T', 'U')</syntax>
This tells Python: myRNA
will receive a copy of myDNA
where all Ts were changed by Us. the "dot" after myDNA
means that the method replace
will get that variable as input on that variable.
So our code from above would like this
<syntax type=python>#! /usr/bin/env python myDNA = 'ACGTTGCAACGTTGCAACGTTGCA' myRNA = myDNA.replace('T', 'U') print myRNA</syntax>
Simple and efficient. Next we will use the same approach on generating the reverse complement of a DNA sequence, with no regex pattern.
Reading files in Python
We are currently following Chapter 4 of Beginning Perl for Bioinformatics, which is the first chapter of the book that actually has code snippets and real programming. The last exercises in this chapter deal with the ability to read files and operate with information extracted from these files, to create arrays and scalar list in Perl.
We are going to check how to read files in python. The book tells you how to read protein sequences. Here we are going to read DNA and protein sequences from files and change them.
Let say you have a file with a DNA sequence in some directory in your hard disk. The file cannot be a FASTA type (we will see later how to handle FASTA files), just pure sequence, something like this:
<syntax type=python>GTGACTTTGTTCAACGGCCGCGGTATCCTAACCGTGCGAAGGTAGCGTAATCACTTGTTC TTTAAATAAGGACTAGTATGAATGGCATCACGAGGGCTTTACTGTCTCCTTTTTCTAATC AGTGAAACTAATCTCCCGTGAAGAAGCGGGAATTAACTTATAAGACGAGAAGACCCTATG GAGCTTTAAACCAAATAACATTTGCTATTTTACAACATTCAGATATCTAATCTTTATAGC ACTATGATTACAAGTTTTAGGTTGGGGTGACCGCGGAGTAAAAATTAACCTCCACATTGA AGGAATTTCTAAGCAAAAAGCTACAACTTTAAGCATCAACAAATTGACACTTATTGACCC AATATTTTGATCAACGAACCATTACCCTAGGGATAACAGCGCAATCCATTATGAGAGCTA TTATCGACAAGTGGGCTTACGACCTCGATGTTGGATCAGGG</syntax>
You can download the file here. This is a partial sequence of a mitochondrial gene from a South American frog species called Hylodes ornatus. For our purposes you can save the file in the same directory you are going to run the script from, or if you are using the Python interpreter start it in the directory that contains the file.
The file name is not important but I will use AY162388.seq from now on. The first thing we have to do is to open the file for reading. We define a string variable that will contain the file name.
<syntax type=python>dnafile = "AY162388.seq"</syntax>
In order to open the file, we can use the command open
, that receives two strings: the first is the file name (it can be the whole location too) to be opened and the mode to be used, which is what you want to do with the file. The mode can be one or more letters that tell the interpreter what to do. For now we are going to use the r
mode , which tells Python to read the file, and only do that. So our code is
<syntax type=python> file = open(dnafile, 'r')</syntax>
file
is a file object that contains the directives to read our DNA sequence. Now, we have actually read the contents of the file but they are stored in a file object and we did not accessed it yet. We can achieve that by using a myriad of commands. We will start with the commonest one: we are going to read the file line by line. We will present different ways of improving our "reading performance" later.
So, file
is our file object. We have just opened it, but Python already knows that any file contains lines (remember that this is a regular ASCII file). We are going to use a loop to read each line of the file, one by one
<syntax type=python>for line in file:
print line</syntax>
In Python, the for
loop/statement iterates over items in a sequence of items, usually a string or a list (we will see Python's list
soon), instead of iterating over a progression of numbers. Basically, our for
above will iterate over each line in the file until EOF (end-of-file) is reached. Our simple script to read a DNA sequence from a file and output to the screen is
<syntax type=python>#! /usr/bin/env python dnafile = "AY162388.seq" file = open(dnafile, 'r') for line in file:
print line</syntax>
You can download the above script here. To run it have the AY162388.seq
in the same directory.
Reading files in Python: using lists
Let's improve our previous script and put the contents of the file in a variable similar to an array. Python understands different formats of compound data types, and list
is the most versatile. A list
in Python can be assigned by a series of elements (or values) separated by a comma and surrounded by square brackets
<syntax type=python> shoplist = ['milk', 1, 'lettuce', 2, 'coffee', 3]</syntax>
Now, we are going to read the same file and store the DNA sequence in a list
and output this variable. The beginning of the script is the same, where we basically tell Python that the file name is AY162388.seq.
<syntax type=python>#! /usr/bin/env python dnafile = "AY162388.seq"</syntax>
We are going to change the way we read the file. Instead of just opening and then reading line-by-line, we are going to open it a read all the lines at once, by using this
<syntax type=python>file = open(dnafile, 'r').readlines()</syntax>
Notice the part in bold? In the previous script, we open and store the contents of the file in a file object
. Now , we are opening the file and just after it is opened, we are reading all the lines of the file at once and storing them in file</code, which in this case is not a
file object
, but is a Python's list
of strings.
Before, if we wanted to manipulate our DNA sequence, we would had to read it, and then in the loop store in a variable of our choice. In this script, we do that all at once, and the result is a variable that we can change the way we wanted. The code without the output part is
<syntax type=python>#! /usr/bin/env python
dnafile = "AY162388.seq"
file = open(dnafile, 'r').readlines()</syntax>
Try putting a
print
statement after the last line to print the file
list. You will get something like this
<syntax type=python>['GTGACTTTGTTCAACGGCCGCGGTATCCTAACCGTGCGAAGGTAGCGTAATCACTTGTTC\n',
'TTTAAATAAGGACTAGTATGAATGGCATCACGAGGGCTTTACTGTCTCCTTTTTCTAATC\n',
'AGTGAAACTAATCTCCCGTGAAGAAGCGGGAATTAACTTATAAGACGAGAAGACCCTATG\n',
'GAGCTTTAAACCAAATAACATTTGCTATTTTACAACATTCAGATATCTAATCTTTATAGC\n',
'ACTATGATTACAAGTTTTAGGTTGGGGTGACCGCGGAGTAAAAATTAACCTCCACATTGA\n',
'AGGAATTTCTAAGCAAAAAGCTACAACTTTAAGCATCAACAAATTGACACTTATTGACCC\n',
'AATATTTTGATCAACGAACCATTACCCTAGGGATAACAGCGCAATCCATTATGAGAGCTA\n',
'TTATCGACAAGTGGGCTTACGACCTCGATGTTGGATCAGGG\n']</syntax>
which is exactly the description of a Python's list. You see all lines, separated by comma and surrounded by square brackets. Notice that each line has a carriage return (
\n
) symbol at the end.
Let's make the output a little nicer including a loop. Remember when I introduced loop I wrote that Python iterates over "items in a sequence of items", what is a good synonym for
list
. So the loop should be as straightforward as
<syntax type=python>for line in file:
print line</syntax>
Putting everything together gives us
<syntax type=python>#! /usr/bin/env python
dnafile = "AY162388.seq"
file = open(dnafile, 'r').readlines()
print file
for line in file:
print line
</syntax>
that can be downloaded here. Next we will work on improving the output again and maybe modify/convert the list
.
"Manipulating" Python lists I
Now, we want to manipulate the DNA sequence, extract some nucleotides, check lines, etc. Simple things. We start with the same basic code to read the file:
<syntax type=python>#! /usr/bin/env python
dnafile = "AY162388.seq"
file = open(dnafile, 'r').readlines()</syntax>
Our nucleotides are stored in the variable file
. Remember that each line is one item of the list and the lines still contain the carriage return present in the ASCII file. Let's get the first and the last lines of the sequence. The first line is easy to get, as Python's lists start at 0. To access one list item just add square brackets with the index number of the item you want to get (this is also known as slicing). Something like this
<syntax type=python>file[0]</syntax>
will return the item 0 from the list, that in our case is the firs line of the sequence. If you add a print
command
<syntax type=python>print file[0]</syntax>
you should expect
<syntax type=python>GTGACTTTGTTCAACGGCCGCGGTATCCTAACCGTGCGAAGGTAGCGTAATCACTTGTTC</syntax>
The last line is a little bit trickier. Let's assume that we don't know the number of lines in the list, and here we want to make our script as general as possible, so it can handle some simple files later. It is also good code practice to think ahead and plan what you want, first to have a detailed project to follow, and second it allows you to be prepared to errors/bugs that your code might have or situations not expected in your original plan.
In our file, we have eight lines of DNA, so it would be just adding this print file[7]
and we would output the last line. But, the right way to do it is to check the length of the list and output the item which has an index equal to the list length. In Python, you can check the length of a list by adding the built-in function len
before the list name, like this
<syntax type=python>len(file)</syntax>
So who do we print the last line of our sequence? Simple. len(file)
should return an integer of value 8, which is the actual number of elements in our list. We already know that to access any item in a list we just add its index (that has to be an integer) to the list name. One idea then would be to use len(file)
as the index, like this
<syntax type=python>print file[len(file)]</syntax>
Why would that be wrong? Our list has eight items, but the indexes are from 0 to 7. So eight would be one index over the list length, which is not accessible because it does not exist. Solution? Let's use the list length minus one:
<syntax type=python>print file[len(file)-1]</syntax>
and there you are, the last line of the sequence.
Putting everything together, we have
<syntax type=python>#! /usr/bin/env python
dnafile = "AY162388.seq"
file = open(dnafile, 'r').readlines()
print 'I want the first line'
print file[0]
print 'now the last line'
print file[len(file)-1]</syntax>
Two points worth mentioning: differently of strings, Python's lists are mutable, items can be removed, deleted, changed, and strings also can be sliced by using indexes that access characters.
Next we will see some more features of lists and strings, and how to manipulate them. It will probably be the last entry in the first section as we finish the Chapter 4 in the book. As you may have noticed some items in the Perl book will not be covered, at least not immediately. We will jump back and forth sometimes.
"Manipulating" Python lists I
As mentioned we will see in this entry some other features of Python lists. We will start with a similar example to the one in the book and then use our DNA file. So let's assume we have this simple list
<syntax type=python>nucleotides = [ 'A', 'C', 'G'. 'T']</syntax>
If we print it directly we would get something like this
<syntax type=python>['A', 'C', 'G', 'T']</syntax>
which is fine for now, as we are not worried (yet) with the output (what we will do further below). Let's remove the last nucleotide. To accomplish that, we use
pop
with no specific index, like this
<syntax type=python>nucleotides.pop()</syntax>
which gives me this when printed
<syntax type=python>['A', 'C', 'G']</syntax>
Remember that lists are mutable, so the removed item is lost. We can also remove any other in the list, let's say 'C'. First, we reassign the original list items and then remove the second item
<syntax type=python>nucleotides = [ 'A', 'C', 'G'. 'T']
nucleotides.pop(1)</syntax>
The list when printed will return this
<syntax type=python>['A', 'G', 'T']</syntax>
pop
accepts any valid index of the list. Any index larger that the length of the list will return an error. For future reference, remember that when any item is removed (and inserted) the indexes change and the length also. It may seems obvious but mistakes are common.
Shifting from our 'destructive' mode, we cal also add elements to the list. Adding to the end of the list is trivial, by using append
<syntax type=python>nucleotides = [ 'A', 'C', 'G'. 'T']
nucleotides.append('A')</syntax>
that returns
<syntax type=python>nucleotides = [ 'A', 'C', 'G'. 'T', 'A']</syntax>
Adding to any position is also very straightforward with insert
, like this
<syntax type=python>nucleotides = [ 'A', 'C', 'G'. 'T']
nucleotides.insert(0, 'A')</syntax>
where insert
takes two arguments: first is the index of the element before which to insert and second the element to be inserted. So our line above will insert an 'A' just before the 'A' at position zero. We can try this
<syntax type=python>nucleotides = [ 'A', 'C', 'G'. 'T']
nucleotides.insert(0, 'A1')
nucleotides.insert(2, 'C1')
nucleotides.insert(4, 'G1')
nucleotides.insert(6, 'T1')</syntax>
that will result in
<syntax type=python>['A1', 'A', 'C1', 'C', 'T1', 'T', 'G1', 'G']</syntax>
Notice that we add every new item at an even position, due to the fact that for every insertion the list's length and indexes change.
And for last, we will take care of the output. Of course if are creating a script that requires a nicer output, printing a list is not the best way. We could create a loop and merge all entries in the list, but that would be a couple of lines and we ought to have an easier way (otherwise we could be using C++ instead). There is a way, by using the method join
. This method will take all the elements in a list into a single string, and even a delimiter can be used. So something like this
<syntax type=python>nucleotides = [ 'A', 'C', 'G'. 'T']
"".join(nucleotides)</syntax>
will generate this output
<syntax type=python>ACGT</syntax>
Join is a method that applies to strings. So the first "item" is a string, that could be anything (in our case is an empty one). The code line tells Python to get the empty string an join it to the list of strings that we call nucleotides. We could replace the line for something easier to understand
<syntax type=python>nucleotides = [ 'A', 'C', 'G'. 'T']
myresult =
myresult.join(nucleotides)
print myresult
</syntax>
So we initialize an empty string, join it with the list and print in the end with identical results. Try changing myresult
initialization and see what happens.
With this we finish the first section of the site and we are moving to chapter 5 in the book.
Flow control in Python
We start following the fifth chapter of BPB. The first item is about flow control and code layout, which are very relevant for our tutorial. I already introduced briefly both aspects in past entries on the site, but it is always good to check. As the book, I will start with flow control.
Flow control is the way your instructions are executed by the program. Most languages have a linear flow control, meaning every line is executed from top to bottom. This linear flow control can be disrupted by two types of statements: looping and branching. Looping statements tell the computer to execute a determined set of commands until certain condition is met. This can be a numeric value (ie from 1 to 100) or the number of items in a list (like our shop list from before). Branching statements are also known as conditional statements, tell the computer to execute/or not determined lines depending on certain conditions.
The for
loop was shown before. In its for loop Python iterates over the elements in a list like this
<syntax type=python>for lines in text:
do something</syntax>
Notice that the first line of the loop ends in a colon. This and the word for
in the line> tell the interpreter that this a for loop and the indented block below is the code to be executed repeatedly until the last element in the list is reached. How Python knows where the loop ends? Indentation. Many languages use curly braces, parentheses, etc. In Python the loop ends by checking the indentation level of lines (this will help us a lot when discussing code layout).
We've already seen one example of loop in Python, for
, but Python accepts other types of loop structures, such as while
, that uses the same indented properties to execute the commands.
<syntax type=python>mycounter = 0
while mycounter == 0:
do something</syntax>
Take a closer look at the while
line. See something different? Maybe not, if you are used to programming. In Python equality is tested with a double equal sign (==
), while a sole equal sign (=
) assigns a value to a variable. You can also test for inequality, greater and less than, with !=
, <
and >
respectively.
Attention: is quite common to generate errors by substituting ==
by =
, so pay attention when coding.
Branching statements are the conditional commands in a computer language, usually governed by if ... then ... else
. In Python a branching statement would look like
<syntax type=python>if value == 1:
do something
elif value == 2:
do something else
else:
do even more</syntax>
Notice the colon ending each line of the conditions and again the indented code, telling the interpreter where the corresponding code for each condition ends.
We are going to use a lot of conditions and loops, but as you might have noticed Python has some tricks that make us avoid these statements.
About code layout just one word: indentation. No brackets, parentheses, curly braces, etc. Yes, we have seen brackets and parentheses, but not to tell the interpreter where loops and conditions start and end. This is taken care by indentation, making our life easier and the code more beautiful.
Next we will see Python's ability to find motifs in words, mainly on DNA sequences.
Finding motifs in DNA sequences
As you might have noticed, BPB generally uses protein sequences. I am a "DNA guy", and basically in our simple examples either type of sequence (except the example on transcribing) could have been used. I will stick with this molecule for a while, or until I can.
We are going to use our good old AY162388.seq
file, still assigning the file name inside the script there will be a twist in the end. Also, remember the Regular Expression module? We are going to use it too. I am going to do just as the book: I will paste the code below with comments in the middle and explain it afterwards.
<syntax type=python>#!/usr/bin/env python
- finding motifs in DNA sequences
- we use the RegEx module
import re
import string
- still keep the file fixed
dnafile = "AY162388.seq"
- opening the file, reading the sequence and storing in a list
seqlist = open(dnafile, 'r').readlines()
- let's join the the lines in a temporary string
temp = .join(seqlist)
- assigning our sequence, with no carriage returns to our
- final variable/object
sequence = temp.replace('\n', )
- we start to deal with user input
- first we use a boolean variable to check for valid input
inputfromuser = True
- while loop: while there is an motif larger than 0
- the loop continues
while inputfromuser:
#raw_input received the user input as string
inmotif = raw_input('Enter motif to search: ')
#now we check for the size of the input
if len(inmotif) >= 1:
#we compile a regex with the input given
motif = re.compile('%s' % inmotif)
#looking to see if the entered motif is in the sequence
if re.search(motif, sequence):
print 'Yep, I found it'
else:
print 'Sorry, try another one'
else:
print 'Done, thanks for using motif_search'
inputfromuser = False</syntax>
Now let's dissect the code, in a biological way. We already seen everything up to the part the list's lines are joined. First new lines for us is this
<syntax type=python>sequence = temp.replace('\n', )</syntax>
Most of the methods in Python have very intuitive names (ok, most languages do), so it is easy to deduce that replace
actually replaces something. And why do we need to use this method? First, we joined the lines in one temporary string (yep, strings are immutable), but the lines come with everything, including carriage returns that we need to get rid of. So our "final" string sequence
receives the value in temp
and we apply the method replace
to modify it. replace
has two arguments, the first is the string we want to change from, while the second is the string we want to replace with (in fact replace
accepts three arguments, the last being the number of times we want to do the change). In our case, we are telling the interpreter to get every substring '\n'
and put an empty one in their places.
The next line is a simple value assignment:
<syntax type=python>inputfromuser = True</syntax>
and the variable will manage the while
that checks input from the user. Basically we will run the loop until a certain type of input is given, that will make the variable value become False
. That's why we have the line
<syntax type=python>while inputfromuser</syntax>
with no check. Yep, notice that we don't need to check for the variable's value, Python assumes that it is True. To understand better, imagine that inputfromuser
is a flag that appears when True and disappears when False. Python can only "see" it when it appears, and when it does disappear it needs to check for it. We will elaborate more later.
Another different line awaits us
<syntax type=python>inmotif = raw_input('Enter motif to search: ')</syntax>
raw_input
is a function that takes a line input by the user and returns a string. In our case, we assign the value returned by the function to a new string called inmotif
. Also, raw_input
has one prompt argument which is written to the screen with no trailing line, waiting for the user to input something. This takes us to the while
loop
<syntax type=python>if len(inmotif) >= 1:
#we compile a regex with the input given
motif = re.compile(r'%s' % inmotif)
#looking to see if the entered motif is in the sequence
if re.search(motif, sequence):
print 'Yep, I found it'
else:
print 'Sorry, try another one'
else:
print 'Done, thanks for using motif_search'
inputfromuser = False</syntax>
After getting the input from the user, we need to check the size of such input: if it is greater or equal to one, we will search it in our sequence, otherwise we will finish the loop. So if we get a valid input (valid in the sense of size, for now) we will compile the regex and search for it.
There is a difference in regex compilation. In the DNA transcribing we assigned a string to the regex directly, now we have a string coming from a variable/object
<syntax type=python>motif = re.compile(r'%s' % inmotif)</syntax>
Notice the difference in the argument that is passed to the compile
function. In this case the regex to be compiled is coming from the string entered by the user, and we have to pass it by using Python's string formatting operator, noted as a %
. This formatting operator uses two parameters: a string with formatting characters and the data to be formatted, separated by the percent sign. In our case the formatting character will receive a string, hence the %s
(s for string), and the data to be formatted that is the input.
Previously, we used the regex function to replace characters/substrings in a sequence. This time, we are interested to know if the motif entered by the user is in our sequence. So we use the search method instead
<syntax type=python>if re.search(motif, sequence):</syntax>
Notice that we do the search in at the same time we are testing for its result. The same case as the checking we do at the while loop. If there is a positive result from the regex search a True flag will be raised and the interpreter will execute the code of the initial branch, not testing for the elif
and else
<syntax type=python> print 'Yep, I found it'
else:
print 'Sorry, try another one'</syntax>
This condition is nested inside another condition, the one that tests for the size of the input entered. So, if our search returns a regex object, we print Yep, I found it
, otherwise the user will receive Sorry, try another one
. Now back to our upper if
, if the user input length is equal to zero (just pressing the Enter key) the interpreter will process the line
<syntax type=python>print 'Done, thanks for using motif_search'</syntax>
and then
<syntax type=python>inputfromuser = False</syntax>
that will tell the while loop that there is no True variable anymore, ending the loop and consequently our script.
Let's review the script and its flow:
<syntax type=python>
1) assign a filename to be opened
2) read the file
3) join the lines
4) remove carriage returns
5) ask for user input, while is valid
6) if input length is greater or equal to 1, process it
6a) compile regex with input
6b) search for it
6b1) if it is found print yes
6b2) if it is not found print no
7) if input length is equal to zero, end while loop,
end string
</syntax>
I know it is a lot of information, take your time. The code can be downloaded here. Next I will change a bit this code, using other methods to find the motifs, and making the promised twist in the method that reads the file.
Finding motifs in DNA sequences: twist
As promised, let's change a bit our previous code, and make it more effective. If you are reading this tutorial in one-entry mode, let's check the code
<syntax type=python>#!/usr/bin/env python
import re
import string
dnafile = "AY162388.seq"
seqlist = open(dnafile, 'r').readlines()
temp = .join(seqlist)
sequence = temp.replace('n', )
inputfromuser = True
while inputfromuser:
inmotif = raw_input('Enter motif to search: ')
if len(inmotif) >= 1:
motif = re.compile('%s' % inmotif)
if re.search(motif, sequence):
print 'Yep, I found it'
else:
print 'Sorry, try another one'
else:
print 'Done, thanks for using motif_search'
inputfromuser = False</syntax>
In order to make it more effective, let's allow the input of any file, maybe asking for the file as soon as the script is started. To accomplish that we need to "protect" the script and make it error proof (almost at least). Yes, you thought it right: we need to check if the input file exists before opening it.
We can use try ... except
statements to do that. Pretty handy. This is called an exception handler, so basically we try the validity of some command/method and depending on the result we continue our program flow or we catch the exception and do something else. Under try
we put the expression to evaluate, and under excepts
what to do in case of failure. Like this
<syntax type=python>try:
opening a file
except:
as the file does not exist, print something</syntax>
So, let's put in our code above. We remove this
<syntax type=python>dnafile = "AY162388.seq"
seqlist = open(dnafile, 'r').readlines()</syntax>
and include this lines
<syntax type=python>fileinput = True
while fileinput == True:
filename = raw_input('Enter file name:')
if len(filename) > 0:
try:
dnafile = open(filename, 'r')
fileinput = False
except:
print 'File does not exist, please try again'
else:
fileinput = False
sys.exit()</syntax>
I know, a lot of new code. But if you take a closer look, there is only three lines we have never seen: try except
and the last line with sys.exit()
. Here, the script tries to open the file provided as input, if it does find the file normal operation resumes, if does not, the script asks for another input. As in the other while loop, we control it with a boolean variable, and in the case of empty input we end the loop and the script, using a system command exit
, in the last line of the new code.
Our script is better now, not bulletproof, but a little bit more efficient. There still a "flaw", that you can only check one file for each run of the script. At least we not stuck to our usual DNA sequence.
Download the commented code here.
Commenting in Python
As you might have noticed from the previous topics, comments in Python are defined mainly by the #
sign. This is the signal used for single line comment like
<syntax type=python>#this is a single line comment</syntax>
If you are used to C++, this would be equivalent to //
.
On the other hand, multi line comments are defined by triple double quotes """
, opening and closing, similar to C++ /* ... */
, like this
<syntax type=python>"""this is a multi
line comment"""</syntax>
Just a note. Resuming bioinformatics mode.
Counting nucleotides
We are going to jump the regular expression explanation that you can find in the book, and we will go directly to the section that introduces string/list/array manipulation and adapt it to Python.
We are going to see two different methods: a "long" and a "short" one. In many places and computer languages you will see that there are different ways of doing the same thing, with advantages and disadvantages. It is up to you to define which methods are better or worse, as this is a very personal matter. Some people prefer the longer way because it might be clearer and easier to understand, or it might be necessary to use it due to code maintainability. Other people would rather use the shorter path because they want to generate code faster.
It does not matter the path you select, as long as you get your task done. That's the key: focus on the end product not on how exactly got there. If you used 10 lines of code more, or 10 less, that's irrelevant as long as you did what you wanted. Live and learn and someday you will use the even shorter way.
So we start with the long way. We want to count the individual number of nucleotides in a sequence, determining they relative frequency. Remember that we read the lines of a sequence file into a list, using readlines
. But on the long method we will read the file and store the data into a string like
<syntax type=python>file = open(filename, 'r')
sequence =
for line in file
sequence += line</syntax>
This way it will be easier to "explode" the sequence in separated items. After the "explosion" we can check each item in the list and get our result. The code is below, I will be back after it.
<syntax type=python>#!/usr/bin/env python
import string
- let's keep the file fixed for now
dnafile = "AY162388.seq"
- opening the file, reading the sequence and storing in a list
file = open(dnafile, 'r')
- initialize a string to receive the data
sequence =
for line in file:
sequence += line.strip() #notice the strip, to remove n
- "exploding" the sequence in a list
seqlist = list(sequence)
- initializing integers to store the counts
totalA = 0
totalC = 0
totalG = 0
totalT = 0
- checking each item in the list and updating counts
for base in seqlist:
if base == 'A':
totalA += 1
elif base == 'C':
totalC += 1
elif base == 'G':
totalG += 1
elif base == 'T':
totalT += 1
- printing results
print str(totalA) + ' As found'
print str(totalC) + ' Cs found'
print str(totalG) + ' Gs found'
print str(totalT) + ' Ts found'</syntax>
Most of the above code was already covered before. Let's look at the different stuff, like the "explosion line"
<syntax type=python>seqlist = list(sequence)</syntax>
Here we basically transform our string sequence
into a list, by putting the object type we want before the object we want converted, like we do here
<syntax type=python>print str(totalA) + ' As found'</syntax>
This construction type_to_convert(to_be_converted)
tells Python to get whatever is inside the parentheses and transform into whatever is outside. Converting the string to a list will get
<syntax type=python>GTGACTTTGTTCAACGGCCGCGGTATCCTAACCGT
GCGAAGGTAGCGTAATCACTTGTTCTTTAAATAAGGACTAGTATG
AATGGCATCACGAGGGCTTTACTGTCTCCTTTTTCTAATCAGTGAA
...</syntax>
and transform it into
<syntax type=python>['G', 'T', 'G', 'A', 'C', 'T', 'T', 'T', 'G', 'T', 'T', 'C', 'A', 'A', 'C', 'G', 'G', 'C', 'C', 'G', 'C', 'G',
'G', 'T', 'A', 'T', 'C', 'C', 'T', 'A', 'A', 'C', 'C', 'G', 'T', 'G', 'C', 'G', 'A', 'A', 'G', 'G', 'T', 'A', 'G', 'C', 'G', 'T',
'A', 'A', 'T', 'C', 'A', 'C', 'T', 'T', 'G', 'T', 'T', 'C', 'T', 'T', 'T', 'A', 'A', 'A', 'T', 'A', 'A', 'G', 'G', 'A', 'C', 'T', 'A',
'G', 'T', 'A', 'T', 'G', 'A', 'A', 'T', 'G', 'G', 'C', 'A', 'T',...]</syntax>
After "exploding", we use a for
loop to iterate over every item in the list and use conditional statements to do the counts. Regarding the counts, we use this operator
<syntax type=python>totalT += 1</syntax>
that, in C/C++, tells the interpreter to get the value of totalT
and add 1 to it.
On the next post, we will see the short way and further modify the above script, that can be downloaded here.
Counting nucleotides: part II
Let's check the "short" way, that is basically a method that avoid the "explosion" of the string. Instead of transforming the sequence from the file from a string to a list, we go and use the string directly, applying one of the methods available to manipulate strings. This time we read the file at once and convert the list to a string using join
. To count we simply use the method count
on our string. Pretty nice.
<syntax type=python>#!/usr/bin/env python
import string
- still keep the file fixed
dnafile = "AY162388.seq"
- opening the file, reading the sequence and storing in a list
seqlist = open(dnafile, 'r').readlines()
- let's join the the lines in a temporary string
temp = .join(seqlist)
- counting
totalA = temp.count('A')
totalC = temp.count('C')
totalG = temp.count('G')
totalT = temp.count('T')
- printing results
print str(totalA) + ' As found'
print str(totalC) + ' Cs found'
print str(totalG) + ' Gs found'
print str(totalT) + ' Ts found'</syntax>
That's the short way: using count
. This is a method that when applied on a string counts the number of times the substring appears in our string. We can even set a start and ending point to count. Notice one difference in this script to the previous examples: after we join the items of the list into a string we do not remove the carriage returns. Because in this case we don't need it, as it will be an extra character there that won't make any harm.
Download the above code here.
Manipulating strings: again
I am going to finish the book's chapter 5 (our section 2) in the next topic, where I will give a small introduction on how to output data to a file in Python. On this post we will check some of the methods that can be used to manipulate strings. The book gives only a couple of methods to be used in Perl on string, but here I will show a longer list of Python methods that can be used on its immutable strings.
A full list of the methods can be found here and I will will give brief explanations on the ones I think are key for bioinformatics. Remember that string cannot be changed in Python, so we will always going to use a buffer/temp variable to store our changed string when needed.
- count this method returns the number of times you see a substring (a letter/number, a word, etc) in another string. You can also specify a start and an end positions to look for. This is handy if you are counting nucleotides/aminoacids in a sequence. This method returns an integer
<syntax type=python>totalA = sequence.count('A')</syntax>
- endwith this method checks the end of your string for a determined substring. Very handy of you need to check the tail end of your sequence right away.
- find returns the position of the substring being searched, and -1 if it is not found. It is similar to the re.search we used before. This is very useful if you are looking for a determined motif/subsequence in a hurry.
- islower returns a true bool (true or false) if all characters in the string are lowercase. False if at least one of the characters is uppercase. Very handy if you need to convert lowercase to UPPERCASE files for input in some application. Works in conjunction with the isupper which is basically the opposite.
- join. We have seen this before: it concatenates strings using a determined separator.
- lower and upper, that as their name might indicate return the string converted to lowercase/uppercase. This is also good for the conversion of sequence format in input files.
I will stop here. There are many other methods that can be used. Some of the above were already covered here and in the next topics we will take a look at the other ones, creating an application that actually performs some useful function.
Writing to files
We already know how to read from files, now we are going to see how to write to them. We will start with a simple example, writing some content to a "fresh" file that does not exist in the system.
Sometime ago, we have seen the print
statement in Python, that prints to the system standard output (usually the screen). This output can be redirected using >
to a stream/file. But if we are going to create really professional applications (even to our own use), usually stream redirection is not really the nicest approach. In some cases the best alternative is to save a file.
Also, some posts ago, we covered the methodology to open
a file. We are going to use here the same command, open
to (in our case) create the file. Remember that to read the file we used
<syntax type=python>file = open(dnafile, 'r')</syntax>
where the 'r'
is the mode we are using to open the file. In that case we used the read mode, now we are going to use the write mode. When we open a file with this command in write mode if the file (with the name with pass) does not exist it is created. If it does exist, all the contents of the current file will be erased, so be careful when opening a file from your script. Check for the location, file name, etc before opening the file.
So, basically we have
<syntax type=python>file = open(output, 'w')</syntax>
and the file is open and ready to receive data.
Let's cheat and use the previous script that counts nucleotides and modify it to save a count.txt
file wit the results:
<syntax type=python>#!/usr/bin/env python
import string
- let's keep the file fixed for now
dnafile = "AY162388.seq"
- opening the file, reading the sequence and storing in a list
file = open(dnafile, 'r')
- initialize a string to receive the data
sequence =
for line in file:
sequence += line.strip() #notice the strip, to remove n
- "exploding" the sequence in a list
seqlist = list(sequence)
- initializing integers to store the counts
totalA = 0
totalC = 0
totalG = 0
totalT = 0
- checking each item in the list and updating counts
for base in seqlist:
if base == 'A':
totalA += 1
elif base == 'C':
totalC += 1
elif base == 'G':
totalG += 1
elif base == 'T':
totalT += 1
- printing results
resultfile = open('counts.txt', 'w')
resultfile.write(str(totalA) + ' As found \n')
resultfile.write(str(totalC) + ' Cs found \n')
resultfile.write(str(totalG) + ' Gs found \n')
resultfile.write(str(totalT) + ' Ts found \n')</syntax>
The only difference is at the end of the script. Here instead of print
we use write
. Notice that write
is a method of the opened file.(in our case called resultfile
. This method only accepts strings, so we need to convert everything to string before writing in the file. Notice also that we need to add a carriage return/newline at the end of the string to be written. Differently of print
, write
does not automatically puts a new line at the end of the output.
A nice summary script
Closing section two, let's use everything we saw before and write a nice script that will read a sequence file (DNA) and report us of any "errors" and the number of different nucleotides. It is very simple, but a good exercise. Depending on your needs you can easily modify it to check for other characteristics of sequences, even change it to read amino acids sequences. We start with the code, comments coming after it.
<syntax type=python>
- !/usr/bin/env python
import string
import sys
import re
fileentered = True
while fileentered == True:
filename = raw_input('Please enter a file to check: ')
if len(filename) >= 1:
try:
seqlist = open(filename, 'r').readlines()
sequence = .join(seqlist)
sequence = sequence.replace('\n', )
totalA = sequence.count('A')
totalC = sequence.count('C')
totalG = sequence.count('G')
totalT = sequence.count('T')
otherletter = re.compile('[BDEFHIJKLMNOPQRSUVXZ]')
extra = re.findall(otherletter, sequence)
output = open(filename+'.count', 'w')
output.write('Count report for file ' + filename + '\n')
output.write('A = ' + str(totalA) + '\n')
output.write('C = ' + str(totalC) + '\n')
output.write('G = ' + str(totalG) + '\n')
output.write('T = ' + str(totalT) + '\n')
if len(extra) > 0:
output.write('Also were found ' + str(len(extra)) + ' errors\n')
for i in extra:
output.write(i + ' ')
else:
output.write('No error found')
output.close()
print 'Result file saved on ' + filename + '.count'
except:
print 'File not found. Please try again.'
else:
fileentered = False
sys.exit()
</syntax>
The main change here is that we use a while
loop to control de program flow. Notice that we import string
(not really necessary though), sys
and re
. Basically we ask for an user input, the filename, and depending on the input given we process the file or exit the program. The early exit is done with the sys.exit
method which is a shortcut to get out of the script processing. Very handy. If the input is valid we try
to open it. If the operation is successful, great, we read the file, count the nucleotides and use a quite scary regular expression to search all the "errors" in our sequence.
The regex is compiled with the pattern '[BDEFHIJKLMNOPQRSUVXZ]'
which means "match any character in this range". Take a closer look at the pattern and you will see that is every letter in the alphabet, except for ACGT. And when searching for these "errors" instead of using re.search
we use re.findall
, which conveniently returns a list with all the substrings found. That's even more handy.
On the final part of the script we take care of the output, opening a file called <whatever>.count
where we print the counts and the errors, if they actually exist. One thing I left from the previous post, is that we need to close the file opened to write. In some cases if the file is not properly closed, errors might occur. So always remember to close the files used for writing/appending, using <filename>.close()
.
Debugging in Python
Beginning the third section in our tutorial/guide, we are going to see the chapter six of BPB. This chapter discusses the topics of creating subroutines (in Python's case functions) and debugging the code.
We are going to start by the end. In Python, code debugging can be done as in any other programming language: Perl has pdb, C/C++ has gdb, etc. Python also has a pdb module that can be imported and run to check for errors in your code. Using this command line:
<syntax type=bash>$> python -m pdb myscript</syntax>
will start the debug module and this will run your script. If you are an experienced programmer, who is just starting Python, pdb usage might look simple and straightforward. On the other hand, if you don't have a lot of experience in programming I would suggest a different approach, as you become more comfortable with the language. Python has a great advantage over some other interpreted languages, allowing you to interactively code using the interpreter. So if your code is not working properly, maybe a wrong output or a value that is not being correctly calculated you have the options of coding the part of your script that is not working using the interpreter or use the first rule of debugging: include print
statements that output the value of variables/objects.
Another option is to use a Python code editor, what will also help you with highlight your code. I have little experience with Python code editors, as I normally code in Linux and use Kate. Lately I have been trying Komodo edit which is a cross-platform freeware from Active State. It looks pretty good but I never tried debugging my code with it.
So, these are my advices if you are just starting to program. Maybe because of the age of Beginning Perl for Bioinformatics (published in 2001), Perl's pdb was the only option back then. Thanks to major advances on open-source and free software there are many other options nowadays to debug your code.
Python functions: simple example
Python subroutines do not exist. Everything is a function, all functions return a value (even if it's None), and all functions start with def
. This statement is from Dive into Python, a book on Python programming available for free. As mentioned Python functions start with the word def
, which is followed by the function name that is followed by the arguments the function receives in between parentheses. Something like
<syntax type=python>def my_first_function(somevalue):</syntax>
Usually Python coders (sometime called Pythonistas, among others), following the Python coding style (that states: Function names should be lowercase, with words separated by underscores as necessary to improve readability.) name their functions with words separated by underscores. And we are going to use this style here, whenever a function becomes handy. The parameters passed to the function (above somevalue
) do not have a data type, Python should handle it whatever is being passed. It is also attribute of your code to handle the parameter/value passed inside the function and avoid errors. Functions also follow the same indentation of normal programming and the line after the declaration should be indented with four spaces
<syntax type=python>def my_first_function(somevalue):
do_something</syntax>
So, let's warm-up with functions. The following script is just the start: it adds a poly-T tail to a DNA sequence. We are going to use our old friend AY162388.seq. I will be back after the script
<syntax type=python>#! /usr/bin/env python
def add_tail(seq):
result = seq + 'TTTTTTTTTTTTTTTTTTTTT'
return seq
dnafile = 'AY162388.seq'
file = open(dnafile, 'r')
sequence =
for line in file:
sequence += line.strip()
print sequence
sequence = add_tail(sequence)
print sequence</syntax>
Not very useful, at first sight, but gives us an impression of what a function looks like. Basically we define a function add_tail
that receives seq
as a parameter. Don't worry about variable scope now, we will see it later. The rest of the script is just like things we saw before, except for the line sequence = add_tail(sequence)
. Here we are saving memory (yep, not that much and not even impressive) by assigning the return value of the function to the same string where we have the sequence stored. Run the script and get ready for the command line arguments.
Command line arguments and a second take on functions
We have seen, briefly, how to define and use a function in Python. Now we are going to jump forward a bit and create a new function and at the same time take a look on command line parameters that can be passed to the script.
If you have used command line applications before, you might have encountered programs that asks for a file name, a calculation parameter, etc to be passed in the command line. Python scripts are no different, they accept such parameters. For this we have the sys
module that has system specific parameters and functions. We have used before the sys.exit
, imported as an extra module function. Every operating system (even Windows) has arguments in its command line, and programming languages usually call such arguments argv
(in the C/C++ you have argv in the parameters of the main function). Lists in Python start at 0 (zero), and for the argument list the first item is the script/program name. Basically if we have this
<syntax type=bash>$> python myscript.py DNA.txt</syntax>
myscript.py
is the argument 0 in the list and DNA.txt is the argument 1. So whenever we create a script that receives arguments in the command line, we have to start (in most cases, be aware) from 1. In Python using system arguments in the CLI will look like
<syntax type=python>import sys
filename = sys.argv[1]
valueone = sys.argv[2]
...</syntax>
We will a variation of our previous script that counts the bases, now with command line arguments and a function (with no "error" checking at first)
<syntax type=python>
- !/usr/bin/env python
import string
import sys
def count_nucleotide_types(seq):
result = []
totalA = seq.count('A')
totalC = seq.count('C')
totalG = seq.count('G')
totalT = seq.count('T')
result.append(totalA)
result.append(totalC)
result.append(totalG)
result.append(totalT)
return result
sequencefile = open(sys.argv[1], 'r').readlines()
sequence = .join(sequencefile)
sequence = sequence.replace('\n', )
values = count_nucleotide_types(sequence)
print "Found " + str(result[0] + "As"
print "Found " + str(result[0] + "Cs"
print "Found " + str(result[0] + "Gs"
print "Found " + str(result[0] + "Ts"
</syntax>
Few new things here. We created a function count_nucleotide_types
that should receive a string containing the sequence. The "real" first line of the program flow is the one that gets the name of the file from the command line argument, open and read it. We then convert the list to a string, modify it a but and throw it to the function. Get the result back, and done.
With functions we actually don't save coding time/length (at least here), we make out code more organized, easier to read and somewhat easier to someone else read and understand it. It is not a good coding practice to have long programs/scripts with no functions, no subdivision, no structure. Functions are sometimes good program nuggets that can be reused in the same application or even ported/copied to other applications and reused indefinitely. Soon we will see a function and class that reads a FASTA file in Python that can be used anywhere in any program that needs such feature. Try the code and come back later for more.
Ramdomization
We start section 4 with a very short introduction to randomization. Chapter 7 of BPB discuss the use of randomization to obtain mutations in DNA and protein sequences. The example given in the book is at the same time simple and interesting, as it creates a paragraph from random selections of noums, adjectives, verbs and other grammar elements. Here we are going to to create a very (stress on very) simple dice game, where each time you run the script it will throw two dices for you and two dices for the computer. It then prints the sum of the dices and tells who won the match.
Randomization is an important feature of computer languages. There are different researchers involved in the creation of the best approaches to generate random number in computers. Random number are important in the simulation of different natural processes, such as genetic mutation, gene drift, epidemiology, weather forecast, etc. The better the generator, the better the simulation.
Python has a random module included, with a myriad of functions that can perform different randomization. In this post we will see the integer
randomization, and in later entries we will see some other powerful functions.
Our script is quite simple, and the only new aspect for us here is the random module and the randint
function.
<syntax type=python>
- !/usr/bin/env python
import random
dice1 = random.randint(1,6)
dice2 = random.randint(1,6)
computerdice1 = random.randint(1,6)
computerdice2 = random.randint(1,6)
mine = dice1 + dice2
his = computerdice1 + computerdice2
print 'mine = ' + str(mine) + ' vs. computer = ' + str(his)
if mine > his:
print "I won"
elif mine < his:
print "Computer won"
else:
print "Tie. Try again"
</syntax>
random.randint
is a function that generates an integer random number between a range specified by the number between parentheses. In the above case, we are using a dice of 6 values. So, for every call of random.randint(1,6)
a random number between 1 and 6 will be generated. We store this number in a variable, sum the user's dices and the computer's and check with a if
clause to see the winner. A good exercise would be to make this script interactive, allowing multiple matches.
Next we will see another random module function and then we will generate mutations on DNA sequences.
Random sequences in Python
There is a reason to say that Python has batteries included. On the last post we have seen a small example of randomization in Python, generating random integer values for a (extremely) simple dice game. We move to another example, still simple which will allow us to generate random DNA sequences. The script is
<syntax type=python>
- !/usr/bin/env python
import random
import sys
length = int(sys.argv[1])
dnaseq = list('ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT')
result =
for i in range(length):
result += random.choice(dnaseq)
print result</syntax>
Not fancy at all, just plain simple (yet again). We import a couple of modules, sys
and random
, and ask for the sequence length as a script parameter. dnaseq
is a list containing a tandem repeat of ACGT, and from there we will extract our random nucleotides. In fact dnaseq
could have been 'ACGT' only. The fact that we create a string and convert it to a list, is just for convenience of writing 'ACGT...'
easier than ['A', 'C' ...]
. We then declare an empty string that will be used to store the random sequence. And inside the loop, the command that does all the magic: random.choice(<list>)
. This command will return a random element from the list passed as subject. Using it inside a loop we will get a random nucleotide on each iteration and add it to our string.
This is a very simple command, but at the same time extremely powerful and easy to implement. A good exercise from this would be to modify the dnaseq
string and see if there is any change in the final random sequence.
A script to simulate DNA sequence sets
In Beginning Perl for Bioinformatics the chapter that covers simulated mutations on a DNA sequence is quite verbose and the code examples employ some subroutines to do what we have done on the last post. Basically the code example that generates a random DNA sequence is the last one on the chapter, but it was the first one we covered. Also this code example has a twist that our code from the last post does not have, which is it allows you to generate a set of sequences with different length instead of one sequence with fixed length that our script does.
So, the script that generates a user defined simulated DNA sequence set is
<syntax type=python>#!/usr/bin/env python
import random
import sys
def simulate_sequence(length):
dna = ['A', 'C', 'G', 'T']
sequence =
for i in range(length):
sequence += random.choice(dna)
return sequence
setsize = int(sys.argv[1])
minlength = int(sys.argv[2])
maxlength = int(sys.argv[3])
sequenceset = []
for i in range(setsize):
rlength = random.randint(minlength, maxlength)
sequenceset.append(simulate_sequence(rlength))
for sequence in sequenceset:
print sequence</syntax>
Simple and efficient. First we define a function that generates a simulated DNA sequence from the four nucleotides, again using the random.choice
. The sequence length is based on the parameter received by the function. The sequence length is another random number defined in the main body of the script. This random number is generated by random.randint
with a range based on the arguments given by the user when running the script. And finally, the number of sequences to be simulated is define by the first parameter. Seventeen lines.
Next we will see how to draw some scientific information about the sequences, such as sequence identity and nucleotide frequency.
Still simulations
We use our last code as a starting point in order to generate some real information from our simulated sequence sets. Our approach here will be the same: functions to do all the work for us and a very simple main code. We also reuse some code with applied before to count the nucleotides.
Let's see the code, discussion just after it.
<syntax type=python>#!/usr/bin/env python
import random
import sys
def simulate_sequence(length):
dna = ['A', 'C', 'G', 'T']
sequence =
for i in range(length):
sequence += random.choice(dna)
return sequence
def nucleotide_percentage(sequence):
print str(sequence.count('A')) + ' As ',
print str(sequence.count('C')) + ' Cs ',
print str(sequence.count('G')) + ' Gs ',
print str(sequence.count('T')) + ' Ts'
def sequence_identity(set):
iden = []
count = 0.0
for x in range(len(set)-1):
print str(x), str(x+1)
for n in range(len(set[x])):
if set[x][n] == set[x+1][n]:
count += 1
iden.append(count/len(set[x]))
count = 0.0
return iden
setsize = int(sys.argv[1])
minlength = int(sys.argv[2])
maxlength = int(sys.argv[3])
sequenceset = []
for i in range(setsize):
rlength = random.randint(minlength, maxlength)
sequenceset.append(simulate_sequence(rlength))
identity = sequence_identity(sequenceset)
for i in range(len(sequenceset)):
print sequenceset[i]
if i < len(sequenceset)-1:
print 'sequence identity to next sequence : ' + str(identity[i])
nucleotide_percentage(sequenceset[i])
print</syntax>
Well, not many new things here. The main one might the be the assignment of the variable count
, which receives initially a value of 0.0
, which in this case is a float. This variable is used the calculate the sequence identity which is a percentage, or a relative value. If we had initialized count
with zero, our division count/len(set(x)
would always be zero, due to the fact that count would have been an integer.
As is pointed out in BPB, this example is more an indication that we are able to use our Python skills to actually make some real code, with some real output. One issue with this example is the fact that we only calculate sequence identity of two sequences at a time. It would be ideal to have sequence identity between all simulated sequences. This would require much more code, of course a good educational step, but it is something that can be easily obtained with classes and we will see this later on.
With this entry, we finished our Section 4 and we will start Section 5 with Python's dictionaries, moving to fasta files and classes next.
Genetic code: part I
The Python dictionary data-type is like hash in Perl. It is a very similar structure, where each element in the variable is composed of two values, more specifically a key-value pair. This is the ideal data type to store the genetic code. As you might know the genetic code governs the translation of DNA into proteins, where each codon (3 bases or nucleotides in the DNA sequence) correspond to an amino acid in the protein sequence. So for every sequence of 3 nucleotides (key) will represent an amino acid (value). Important things: dictionaries do not accept duplicated key values, and every time a new value is assigned to a key the old value is erased. To create a new dictionary use the curly brackets
<syntax type=python>first_dictionary = {}</syntax>
inside the curly braces we first assign a key and separated by a colon (:), while multiple pairs should be separated by comma. Both key and value have to be between single or double quotes. Let's see how we will represent the genetic code in a Python dictionary, assigning values to keys
<syntax type=python>gencode = {
'ATA':'I', #Isoleucine
'ATC':'I', #Isoleucine
'ATT':'I', # Isoleucine
'ATG':'M', # Methionine
'ACA':'T', # Threonine
'ACC':'T', # Threonine
'ACG':'T', # Threonine
'ACT':'T', # Threonine
'AAC':'N', # Asparagine
'AAT':'N', # Asparagine
'AAA':'K', # Lysine
'AAG':'K', # Lysine
'AGC':'S', # Serine
'AGT':'S', # Serine
'AGA':'R', # Arginine
'AGG':'R', # Arginine
'CTA':'L', # Leucine
'CTC':'L', # Leucine
'CTG':'L', # Leucine
'CTT':'L', # Leucine
'CCA':'P', # Proline
'CCC':'P', # Proline
'CCG':'P', # Proline
'CCT':'P', # Proline
'CAC':'H', # Histidine
'CAT':'H', # Histidine
'CAA':'Q', # Glutamine
'CAG':'Q', # Glutamine
'CGA':'R', # Arginine
'CGC':'R', # Arginine
'CGG':'R', # Arginine
'CGT':'R', # Arginine
'GTA':'V', # Valine
'GTC':'V', # Valine
'GTG':'V', # Valine
'GTT':'V', # Valine
'GCA':'A', # Alanine
'GCC':'A', # Alanine
'GCG':'A', # Alanine
'GCT':'A', # Alanine
'GAC':'D', # Aspartic Acid
'GAT':'D', # Aspartic Acid
'GAA':'E', # Glutamic Acid
'GAG':'E', # Glutamic Acid
'GGA':'G', # Glycine
'GGC':'G', # Glycine
'GGG':'G', # Glycine
'GGT':'G', # Glycine
'TCA':'S', # Serine
'TCC':'S', # Serine
'TCG':'S', # Serine
'TCT':'S', # Serine
'TTC':'F', # Phenylalanine
'TTT':'F', # Phenylalanine
'TTA':'L', # Leucine
'TTG':'L', # Leucine
'TAC':'Y', # Tyrosine
'TAT':'Y', # Tyrosine
'TAA':'_', # Stop
'TAG':'_', # Stop
'TGC':'C', # Cysteine
'TGT':'C', # Cysteine
'TGA':'_', # Stop
'TGG':'W', # Tryptophan
}
</syntax>
Simple, yet efficient. But this is the type of functionality that would be great to have at hand every time you write a script to translate DNA into proteins. And it is not something that you would like to type (or even copy-and-paste) all the time. On the next post we will create the translation script and will also create our first Python module.