Harvard:Biophysics 101/2007/Notebook:Michael Wang/2007-2-20
For anyone still trying to get clustalw working on a PC after reading the link here, the key seems to be making sure that clustalw works from the command line. Even if you set it up properly, any problems in the actual call will give you the same error as if you didn't set it up properly. The only thing python cares about is whether or not the output file was created.
The current version of my code is not very intelligent on the analysis side. It currently sucks up all the fasta files in the ./input folder of the current directory and then compiles them into a single file. This file is passed into clustalw for alignment.
Features I'm still working on implementing:
- Use regular expressions to identify orfs for all sequences:
- Will be implemented by capturing sequences between AUG and UAA, UGA, or UAG
- Using the orfs of the reference sequence as a well...reference...translate the sequences and find relevant mutations.
- Call the standard_translate to translate seuqneces
- Align translated sequences and look for differences in translated product
- Detect different types of delections: hopefully all implemented using regex
- Mismatch- in the alignment search for a *, blank, *
- Insertion/deletions- a string of blanks
- Differentiation will depend on identifying a gap in the original sequence by comparing it to reference
- Frameshift- a string of blankes that extends to the end of the ORF in the protein alignment
- Silent- ORF remains untouched
I hope to get this implemented sometime on the 22nd as I'm still trying to figure out the intricacies of regex in python and the seq class...
#!/usr/bin/env python
import os
from Bio import Clustalw
#This first section of code merges all fasta files located in the input folder of curdir
#into a single file called all.fasta
input_list = list(os.listdir(os.path.join(os.curdir,'input')))
print input_list
merged_file = open(os.path.join(os.curdir, 'all.fasta'),"w")
print os.path.join(os.curdir, 'all.fasta')
for i in input_list:
print "loading ", os.path.join(os.curdir,'input\\',i)
current_file = open(os.path.join(os.curdir,'input\\',i),"r")
all_lines = current_file.readlines()
merged_file.writelines(all_lines)
current_file.close()
merged_file.write("\n\n")
print "done making file"
merged_file.close()
#Once the merged file has been created, it is passed into the alignment program
cline = Clustalw.MultipleAlignCL(os.path.join(os.curdir, 'all.fasta'))
cline.set_output('test.aln')
alignment = Clustalw.do_alignment(cline)
all_records = alignment.get_all_seqs()
print alignment
I have yet to write code to do counts of say, how many frameshift mutations there are, etc. It just prints the raw alignment for now.
Using test files Media:apoemod.fasta and Media:Copy of apoe.fasta, the following output is generated.
loading .\input\apoe.fasta
loading .\input\Copy of apoe.fasta
done making file
CLUSTAL X (1.81) multiple sequence alignment
gi|178350|gb|K00296.1|HUMAPOE3 CGCAGCGGAGGTGAAGGACGTCCTTCCCCAGGAGCCGACTGGCCAATCAC
gi|189350|gb|K10296.1|HUMAPOE3 CGCAGCGGAGGTGAAGGACGTCCTTCCCCAGGAGCCGACTGGCCAATCAC
gi|178850|gb|K00396.1|HUMAPOE3 CGCAGCGGAGGTGAAGGACGTCCTTCCCCAGGAGCCGACTGGCCAATCAC
gi|178843|gb|K06396.1|HUMAPOE3 CGCAGCGGAGGTGAAGGACGTCCTTCCCCAGGAGCCGACTGGCCAATCAC
**************************************************
gi|178350|gb|K00296.1|HUMAPOE3 AGGCAGGAAGATGAAGGTTCTGTGGGCTGCGTTGCTGGTCACATTCCTGG
gi|189350|gb|K10296.1|HUMAPOE3 AGGCAGGAAGATGAAGGTTCTGTGGGCTGCGTTGCTGGTCACATTCCTGG
gi|178850|gb|K00396.1|HUMAPOE3 AGGCAGGAAGATGAAGGTTCTGTGGGCTGCGTTGCTGGTCACATTCCTGG
gi|178843|gb|K06396.1|HUMAPOE3 AGGCAGGAAGATGAAGGTTCTGTGGGCTGCGTTGCTGGTCACATTCCTGG
**************************************************
gi|178350|gb|K00296.1|HUMAPOE3 CAGGATGCCAGGCCAAGGTGGAGCAAGCGGTGGAGACAGAGCCGGAGCCC
gi|189350|gb|K10296.1|HUMAPOE3 CAGGATGCCAGGCCAAGGTGGAG--GGCGGTGGAGACAGAGCCGGAGCCC
gi|178850|gb|K00396.1|HUMAPOE3 CAGGATGCCAGGCCAAGGTGGAGCAAGCGGTGGAGACAGAGCCGGAGCCC
gi|178843|gb|K06396.1|HUMAPOE3 CAGGATGCCAGGCCAAGGTGGAGCAAGCGGTGGAGACAGAGCCGGAGCCC
*********************** ************************
gi|178350|gb|K00296.1|HUMAPOE3 GAGCTGCGCCAGCAGACCGAGTGGCAGAGCGGCCAGCGCTGGGAACTGGC
gi|189350|gb|K10296.1|HUMAPOE3 GAGCTGCGCCAGCAGACCGAGTGGCAGAGCGGCCAGCGCTGGGAACTGGC
gi|178850|gb|K00396.1|HUMAPOE3 GAGCTGCGCCAGCAGACCGAGTGGCAGAGCGGCCAGCGCTGGGAACTGGC
gi|178843|gb|K06396.1|HUMAPOE3 GAGCTGCGCCAGCAGACCGAGTGGCAGAGCGGCCAGCGCTGGGAACTGGC
**************************************************
gi|178350|gb|K00296.1|HUMAPOE3 ACTGGGTCGCTTTTGGGATTAATCCTGCGCTGGGTGCAGACACTGTCTGA
gi|189350|gb|K10296.1|HUMAPOE3 ACTGGGTCGCTTTTGGGATTAATCCTGCGCTGGGTGCAGACACTGTCTGA
gi|178850|gb|K00396.1|HUMAPOE3 ACTGGGTCGCTTTTGGGATTA--CCTGCGCTGGGTGCAGACACTGTCTGA
gi|178843|gb|K06396.1|HUMAPOE3 ACTGGGTCGCTTTTGGGATTA--CCTGCGCTGGGTGCAGACACTGTCTGA
********************* ***************************
gi|178350|gb|K00296.1|HUMAPOE3 GCAGGTGCAGGAGGAGCTGCTCAGCTCCCAGGTCACCCAGGAACTGAGGG
gi|189350|gb|K10296.1|HUMAPOE3 GCAGGTGCAGGAGGAGCTGCTCAGCTCCCAGGTCACCCAGGAACTGAGGG
gi|178850|gb|K00396.1|HUMAPOE3 GCAGGTGCAGGAGGAGCTGCTCAGCTCCCAGGTCACCCAGGAACTGAGGG
gi|178843|gb|K06396.1|HUMAPOE3 GCAGGTGCAGGAGGAGCTGCTCAGCTCCCAGGTCACCCAGGAACTGAGGG
**************************************************
gi|178350|gb|K00296.1|HUMAPOE3 CGCTGATGGACGAGACCATGAAGGAGTTGAAGGCCTACAAATCGGAACTG
gi|189350|gb|K10296.1|HUMAPOE3 CGCTGATGGACGAGACCATGAAGGAGTTGAAGGCCTACAAATCGGAACTG
gi|178850|gb|K00396.1|HUMAPOE3 CGCTGATGGACGAGACCATGAAGGAGTTGAAGGCCTACAAATCGGAACTG
gi|178843|gb|K06396.1|HUMAPOE3 CGCTGATGGACGAGACCATGAAGGAGTTGAAGGCCTACAAATCGGAACTG
**************************************************
gi|178350|gb|K00296.1|HUMAPOE3 GAGGAACAACTGACCCCGGTGGCGGAGGAGACGCGGGCACGGCTGTCCAA
gi|189350|gb|K10296.1|HUMAPOE3 GAGGAACAACTGACCCCGGTGGCGGAGGAGACGCGGGCACGGCTGTCCAA
gi|178850|gb|K00396.1|HUMAPOE3 GAGGAACAACTGACCCCGGTGGCGGAGGAGACGCGGGCACGGCTGTCCAA
gi|178843|gb|K06396.1|HUMAPOE3 GAGGAACAACTGACCCCGGTGGCGGAGGAGACGCGGGCACGGCTGTCCAA
**************************************************
gi|178350|gb|K00296.1|HUMAPOE3 GGAGCTGCAGGCGGCGCAGGCCCGGCTGGGCGCGGACATGGAGGACGTGT
gi|189350|gb|K10296.1|HUMAPOE3 GGAGCTGCAGGCGGCGCAGGCCCGGCTGGGCGCGGACATGGAGGACGTGT
gi|178850|gb|K00396.1|HUMAPOE3 GGAGCTGCAGGCGGCGCAGGCCCGGCTGGGCGCGGACATGGAGGACGTGT
gi|178843|gb|K06396.1|HUMAPOE3 GGAGCTGCAGGCGGCGCAGGCCCGGCTGGGCGCGGACATGGAGGACGTGT
**************************************************
gi|178350|gb|K00296.1|HUMAPOE3 GCGGCCGCCTGGTGCAGTACCGCGGCGAGGTGCAGGCCATGCTCGGCCAG
gi|189350|gb|K10296.1|HUMAPOE3 GCGGCCGCCTGGTGCAGTACCGCGGCGAGGTGCAGGCCATGCTCGGCCAG
gi|178850|gb|K00396.1|HUMAPOE3 GCGGCCGCCTGGTGCAGTACCGCGGCGAGGTGCAGGCCATGCTCGGCCAG
gi|178843|gb|K06396.1|HUMAPOE3 GCGGCCGCCTGGTGCAGTACCGCGGCGAGGTGCAGGCCATGCTCGGCCAG
**************************************************
gi|178350|gb|K00296.1|HUMAPOE3 AGCACCGAGGAGCTGCGGGTGCGCCTCGCCTCCCACCTGCGCAAGCTGCG
gi|189350|gb|K10296.1|HUMAPOE3 AGCACCGAGGAGCTGCGGGTGCGCCTCGCCTCCCACCTGCGCAAGCTGCG
gi|178850|gb|K00396.1|HUMAPOE3 AGCACCGAGGAGCTGCGGGTGCGCCTCGCCTCCCACCTGCGCAAGCTGCG
gi|178843|gb|K06396.1|HUMAPOE3 AGCACCGAGGAGCTGCGGGTGCGCCTCGCCTCCCACCTGCGCAAGCTGCG
**************************************************
gi|178350|gb|K00296.1|HUMAPOE3 TAAGCGGCTCCTCCGCGATGCCGATGACCTGCAGAAGCGCCTGGCAGTGT
gi|189350|gb|K10296.1|HUMAPOE3 TAAGCGGCTCCTCCGCGATGCCGATGACCTGCAGAAGCGCCTGGCAGTGT
gi|178850|gb|K00396.1|HUMAPOE3 TAAGCGGCTCCTCCGCGATGCCGATGACCTGCAGAAGCGCCTGGCAGTGT
gi|178843|gb|K06396.1|HUMAPOE3 TAAGCGGCTCCTCCGCGATGCCGATGACCTGCAGAAGCGCCTGGCAGTGT
**************************************************
gi|178350|gb|K00296.1|HUMAPOE3 ACCAGGCCGGGGCCCGCGAGGGCGCCGAGCGCGGCCTCAGCGCCATCCGC
gi|189350|gb|K10296.1|HUMAPOE3 ACCAGGCCGGGGCCCGCGAGGGCGCCGAGCGCGGCCTCAGCGCCATCCGC
gi|178850|gb|K00396.1|HUMAPOE3 ACCAGGCCGGGGCCCGCGAGGGCGCCGAGCGCGGCCTCAGCGCCATCCGC
gi|178843|gb|K06396.1|HUMAPOE3 ACCAT-------CCCGCGAGGGCGCCGAGCGCGGCCTCAGCGCCATCCGC
**** **************************************
gi|178350|gb|K00296.1|HUMAPOE3 GAGCGCCTGGGGCCCCTGGTGGAACAGGGCCGCGTGCGGGCCGCCACTGT
gi|189350|gb|K10296.1|HUMAPOE3 GAGCGCCTGGGGCCCCTGGTGGAACAGGGCCGCGTGCGGGCCGCCACTGT
gi|178850|gb|K00396.1|HUMAPOE3 GAGCGCCTGGGGCCCCTGGTGGAACAGGGCCGCGTGCGGGCCGCCACTGT
gi|178843|gb|K06396.1|HUMAPOE3 GAGCGCCTGGGGCCCCTGGTGGAACAGGGCCGCGTGCGGGCCGCCACTGT
**************************************************
gi|178350|gb|K00296.1|HUMAPOE3 GGGCTCCCTGGCCGGCCAGCCGCTACAGGAGCGGGCCCAGGCCTGGGGCG
gi|189350|gb|K10296.1|HUMAPOE3 GGGCTCCCTGGCCGGCCAGCCGCTACAGGAGCGGGCCCAGGCCTGGGGCG
gi|178850|gb|K00396.1|HUMAPOE3 GGGCTCCCTGGCCGGCCAGCCGCTACAGGAGCGGGCCCAGGCCTGGGGCG
gi|178843|gb|K06396.1|HUMAPOE3 GGGCTCCCTGGCCGGCCAGCCGCTACAGGAGCGGGCCCAGGCCTGGGGCG
**************************************************
gi|178350|gb|K00296.1|HUMAPOE3 AGCGGCTGCGCGCGCGGATGGAGGAGATGGGCAGCCGGACCCGCGACCGC
gi|189350|gb|K10296.1|HUMAPOE3 AGCGGCTGCGCGCGCGGATGGAGGAGATGGGCAGCCGGACCCGCGACCGC
gi|178850|gb|K00396.1|HUMAPOE3 AGCGGCTGCGCGCGCGGATGGAGGAGATGGGCAGCCGGACCCGCGACCGC
gi|178843|gb|K06396.1|HUMAPOE3 AGCGGCTGCGCGCGCGGATGGAGGAGATGGGCAGCCGGACCCGCGACCGC
**************************************************
gi|178350|gb|K00296.1|HUMAPOE3 CTGGACGAGGTGAAGGAGCAGGTGGCGGAGGTGCGCGCCAAGCTGGAGGA
gi|189350|gb|K10296.1|HUMAPOE3 CTGGACGAGGTGAAGGAGCAGGTGGCGGAGGTGCGCGCCAAGCTGGAGGA
gi|178850|gb|K00396.1|HUMAPOE3 CTGGACGAGGTGAAGGAGCAGGTGGCGGAGGTGCGCGCCAAGCTGGAGGA
gi|178843|gb|K06396.1|HUMAPOE3 CTGGACGAGGTGAAGGAGCAGGTGGCGGAGGTGCGCGCCAAGCTGGAGGA
**************************************************
gi|178350|gb|K00296.1|HUMAPOE3 GCAGGCCCAGCAGATACGCCTGCAGGCCGAGGCCTTCCAGGCCCGCCTCA
gi|189350|gb|K10296.1|HUMAPOE3 GCAGGCCCAGCAGATACGCCTGCAGGCCGAGGCCTTCCAGGCCCGCCTCA
gi|178850|gb|K00396.1|HUMAPOE3 GCAGGCCCAGCAGATACGCCTGCAGGCCGAGGCCTTCCAGGCCCGCCTCA
gi|178843|gb|K06396.1|HUMAPOE3 GCAGGCCCAGCAGATACGCCTGCAGGCCGAGGCCTTCCAGGCCCGCCTCA
**************************************************
gi|178350|gb|K00296.1|HUMAPOE3 AGAGCTGGTTCGAGCCCCTGGTGGAAGACATGCAGCGCCAGTGGGCCGGG
gi|189350|gb|K10296.1|HUMAPOE3 AGAGCTGGTTCGAGCCCCTGGTGGAAGACATGCAGCGCCAGTGGGCCGGG
gi|178850|gb|K00396.1|HUMAPOE3 AGAGCTGGTTCGAGCCCCTGGTGGAAGACATGCAGCGCCAGTGGGCCGGG
gi|178843|gb|K06396.1|HUMAPOE3 AGAGCTGGTTCGAGCCCCTGGTGGAAGACATGCAGCGCCAGTGGGCCGGG
**************************************************
gi|178350|gb|K00296.1|HUMAPOE3 CTGGTGGAGAAGGTGCAGGCTGCCGTGGGCACCAGCGCCGCCCCTGTGCC
gi|189350|gb|K10296.1|HUMAPOE3 CTGGTGGAGAAGGTGCAGGCTGCCGTGGGCACCAGCGCCGCCCCTGTGCC
gi|178850|gb|K00396.1|HUMAPOE3 CTGGTGGAGAAGGTGCAGGCTGCCGTGGGCACCAGCGCCGCCCCTGTGCC
gi|178843|gb|K06396.1|HUMAPOE3 CTGGTGGAGAAGGTGCAGGCTGCCGTGGGCACCAGCGCCGCCCCTGTGCC
**************************************************
gi|178350|gb|K00296.1|HUMAPOE3 CAGCGACAATCACTGAACGCCGAAGCCTGCAGCCATGCGACCCCACGCCA
gi|189350|gb|K10296.1|HUMAPOE3 CAGCGACAATCACTGAACGCCGAAGCCTGCAGCCATGCGACCCCACGCCA
gi|178850|gb|K00396.1|HUMAPOE3 CAGCGACAATCACTGAACGCCGAAGCCTGCAGCCATGCGACCCCACGCCA
gi|178843|gb|K06396.1|HUMAPOE3 CAGCGACAATCACTGAACGCCGAAGCCTGCAGCCATGCGACCCCACGCCA
**************************************************
gi|178350|gb|K00296.1|HUMAPOE3 CCCCGTGCCTCCTGCCTCCGCGCAGCCTGCAGCGGGAGACCCTGTCCCCG
gi|189350|gb|K10296.1|HUMAPOE3 CCCCGTGCCTCCTGCCTCCGCGCAGCCTGCAGCGGGAGACCCTGTCCCCG
gi|178850|gb|K00396.1|HUMAPOE3 CCCCGTGCCTCCTGCCTCCGCGCAGCCTGCAGCGGGAGACCCTGTCCCCG
gi|178843|gb|K06396.1|HUMAPOE3 CCCCGTGCCTCCTGCCTCCGCGCAGCCTGCAGCGGGAGACCCTGTCCCCG
**************************************************
gi|178350|gb|K00296.1|HUMAPOE3 CCCCAGCCGTCCTCCTGGGGTGGACCCTAGTTTAATAAAGATTCACCAAG
gi|189350|gb|K10296.1|HUMAPOE3 CCCCAGCCGTCCTCCTGGGGTGGACCCTAGTTTAATAAAGATTCACCAAG
gi|178850|gb|K00396.1|HUMAPOE3 CCCCAGCCGTCCTCCTGGGGTGGACCCTAGTTTAATAAAGATTCACCAAG
gi|178843|gb|K06396.1|HUMAPOE3 CCCCAGCCGTCCTCCTGGGGTGGACCCTAGTTTAATAAAGATTCACCAAG
**************************************************
gi|178350|gb|K00296.1|HUMAPOE3 TTTCACGT
gi|189350|gb|K10296.1|HUMAPOE3 TTTCACGT
gi|178850|gb|K00396.1|HUMAPOE3 TTTCACGC
gi|178843|gb|K06396.1|HUMAPOE3 TTTCACGC
*******
Each of the two files contains two sequences (I made fake changes to each). Yes...I realize that this is not particularly interesting.