Harvard:Biophysics 101/2007/Notebook:Michael Wang/2007-2-20

For anyone still trying to get clustalw working on a PC after reading the link here, the key seems to be making sure that clustalw works from the command line. Even if you set it up properly, any problems in the actual call will give you the same error as if you didn't set it up properly. The only thing python cares about is whether or not the output file was created.

The current version of my code is not very intelligent on the analysis side. It currently sucks up all the fasta files in the ./input folder of the current directory and then compiles them into a single file. This file is passed into clustalw for alignment.

Features I'm still working on implementing:


 * Use regular expressions to identify orfs for all sequences:
 * Will be implemented by capturing sequences between AUG and UAA, UGA, or UAG
 * Using the orfs of the reference sequence as a well...reference...translate the sequences and find relevant mutations.
 * Call the standard_translate to translate seuqneces
 * Align translated sequences and look for differences in translated product
 * Detect different types of delections: hopefully all implemented using regex
 * Mismatch- in the alignment search for a *, blank, *
 * Insertion/deletions- a string of blanks
 * Differentiation will depend on identifying a gap in the original sequence by comparing it to reference
 * Frameshift- a string of blankes that extends to the end of the ORF in the protein alignment
 * Silent- ORF remains untouched

I hope to get this implemented sometime on the 22nd as I'm still trying to figure out the intricacies of regex in python and the seq class...


 * 1) !/usr/bin/env python

import os from Bio import Clustalw

input_list = list(os.listdir(os.path.join(os.curdir,'input'))) print input_list merged_file = open(os.path.join(os.curdir, 'all.fasta'),"w") print os.path.join(os.curdir, 'all.fasta') for i in input_list: print "loading ", os.path.join(os.curdir,'input\\',i) current_file = open(os.path.join(os.curdir,'input\\',i),"r") all_lines = current_file.readlines merged_file.writelines(all_lines) current_file.close merged_file.write("\n\n") print "done making file" merged_file.close
 * 1) This first section of code merges all fasta files located in the input folder of curdir
 * 2) into a single file called all.fasta

cline = Clustalw.MultipleAlignCL(os.path.join(os.curdir, 'all.fasta')) cline.set_output('test.aln') alignment = Clustalw.do_alignment(cline) all_records = alignment.get_all_seqs
 * 1) Once the merged file has been created, it is passed into the alignment program

print alignment

I have yet to write code to do counts of say, how many frameshift mutations there are, etc. It just prints the raw alignment for now.

Using test files [[Media:apoemod.fasta]] and [[Media:Copy of apoe.fasta]], the following output is generated.

loading .\input\apoe.fasta loading .\input\Copy of apoe.fasta done making file CLUSTAL X (1.81) multiple sequence alignment

gi|178350|gb|K00296.1|HUMAPOE3     CGCAGCGGAGGTGAAGGACGTCCTTCCCCAGGAGCCGACTGGCCAATCAC gi|189350|gb|K10296.1|HUMAPOE3     CGCAGCGGAGGTGAAGGACGTCCTTCCCCAGGAGCCGACTGGCCAATCAC gi|178850|gb|K00396.1|HUMAPOE3     CGCAGCGGAGGTGAAGGACGTCCTTCCCCAGGAGCCGACTGGCCAATCAC gi|178843|gb|K06396.1|HUMAPOE3     CGCAGCGGAGGTGAAGGACGTCCTTCCCCAGGAGCCGACTGGCCAATCAC **************************************************

gi|178350|gb|K00296.1|HUMAPOE3     AGGCAGGAAGATGAAGGTTCTGTGGGCTGCGTTGCTGGTCACATTCCTGG gi|189350|gb|K10296.1|HUMAPOE3     AGGCAGGAAGATGAAGGTTCTGTGGGCTGCGTTGCTGGTCACATTCCTGG gi|178850|gb|K00396.1|HUMAPOE3     AGGCAGGAAGATGAAGGTTCTGTGGGCTGCGTTGCTGGTCACATTCCTGG gi|178843|gb|K06396.1|HUMAPOE3     AGGCAGGAAGATGAAGGTTCTGTGGGCTGCGTTGCTGGTCACATTCCTGG **************************************************

gi|178350|gb|K00296.1|HUMAPOE3     CAGGATGCCAGGCCAAGGTGGAGCAAGCGGTGGAGACAGAGCCGGAGCCC gi|189350|gb|K10296.1|HUMAPOE3     CAGGATGCCAGGCCAAGGTGGAG--GGCGGTGGAGACAGAGCCGGAGCCC gi|178850|gb|K00396.1|HUMAPOE3     CAGGATGCCAGGCCAAGGTGGAGCAAGCGGTGGAGACAGAGCCGGAGCCC gi|178843|gb|K06396.1|HUMAPOE3     CAGGATGCCAGGCCAAGGTGGAGCAAGCGGTGGAGACAGAGCCGGAGCCC ***********************  ************************

gi|178350|gb|K00296.1|HUMAPOE3     GAGCTGCGCCAGCAGACCGAGTGGCAGAGCGGCCAGCGCTGGGAACTGGC gi|189350|gb|K10296.1|HUMAPOE3     GAGCTGCGCCAGCAGACCGAGTGGCAGAGCGGCCAGCGCTGGGAACTGGC gi|178850|gb|K00396.1|HUMAPOE3     GAGCTGCGCCAGCAGACCGAGTGGCAGAGCGGCCAGCGCTGGGAACTGGC gi|178843|gb|K06396.1|HUMAPOE3     GAGCTGCGCCAGCAGACCGAGTGGCAGAGCGGCCAGCGCTGGGAACTGGC **************************************************

gi|178350|gb|K00296.1|HUMAPOE3     ACTGGGTCGCTTTTGGGATTAATCCTGCGCTGGGTGCAGACACTGTCTGA gi|189350|gb|K10296.1|HUMAPOE3     ACTGGGTCGCTTTTGGGATTAATCCTGCGCTGGGTGCAGACACTGTCTGA gi|178850|gb|K00396.1|HUMAPOE3     ACTGGGTCGCTTTTGGGATTA--CCTGCGCTGGGTGCAGACACTGTCTGA gi|178843|gb|K06396.1|HUMAPOE3     ACTGGGTCGCTTTTGGGATTA--CCTGCGCTGGGTGCAGACACTGTCTGA ********************* ***************************

gi|178350|gb|K00296.1|HUMAPOE3     GCAGGTGCAGGAGGAGCTGCTCAGCTCCCAGGTCACCCAGGAACTGAGGG gi|189350|gb|K10296.1|HUMAPOE3     GCAGGTGCAGGAGGAGCTGCTCAGCTCCCAGGTCACCCAGGAACTGAGGG gi|178850|gb|K00396.1|HUMAPOE3     GCAGGTGCAGGAGGAGCTGCTCAGCTCCCAGGTCACCCAGGAACTGAGGG gi|178843|gb|K06396.1|HUMAPOE3     GCAGGTGCAGGAGGAGCTGCTCAGCTCCCAGGTCACCCAGGAACTGAGGG **************************************************

gi|178350|gb|K00296.1|HUMAPOE3     CGCTGATGGACGAGACCATGAAGGAGTTGAAGGCCTACAAATCGGAACTG gi|189350|gb|K10296.1|HUMAPOE3     CGCTGATGGACGAGACCATGAAGGAGTTGAAGGCCTACAAATCGGAACTG gi|178850|gb|K00396.1|HUMAPOE3     CGCTGATGGACGAGACCATGAAGGAGTTGAAGGCCTACAAATCGGAACTG gi|178843|gb|K06396.1|HUMAPOE3     CGCTGATGGACGAGACCATGAAGGAGTTGAAGGCCTACAAATCGGAACTG **************************************************

gi|178350|gb|K00296.1|HUMAPOE3     GAGGAACAACTGACCCCGGTGGCGGAGGAGACGCGGGCACGGCTGTCCAA gi|189350|gb|K10296.1|HUMAPOE3     GAGGAACAACTGACCCCGGTGGCGGAGGAGACGCGGGCACGGCTGTCCAA gi|178850|gb|K00396.1|HUMAPOE3     GAGGAACAACTGACCCCGGTGGCGGAGGAGACGCGGGCACGGCTGTCCAA gi|178843|gb|K06396.1|HUMAPOE3     GAGGAACAACTGACCCCGGTGGCGGAGGAGACGCGGGCACGGCTGTCCAA **************************************************

gi|178350|gb|K00296.1|HUMAPOE3     GGAGCTGCAGGCGGCGCAGGCCCGGCTGGGCGCGGACATGGAGGACGTGT gi|189350|gb|K10296.1|HUMAPOE3     GGAGCTGCAGGCGGCGCAGGCCCGGCTGGGCGCGGACATGGAGGACGTGT gi|178850|gb|K00396.1|HUMAPOE3     GGAGCTGCAGGCGGCGCAGGCCCGGCTGGGCGCGGACATGGAGGACGTGT gi|178843|gb|K06396.1|HUMAPOE3     GGAGCTGCAGGCGGCGCAGGCCCGGCTGGGCGCGGACATGGAGGACGTGT **************************************************

gi|178350|gb|K00296.1|HUMAPOE3     GCGGCCGCCTGGTGCAGTACCGCGGCGAGGTGCAGGCCATGCTCGGCCAG gi|189350|gb|K10296.1|HUMAPOE3     GCGGCCGCCTGGTGCAGTACCGCGGCGAGGTGCAGGCCATGCTCGGCCAG gi|178850|gb|K00396.1|HUMAPOE3     GCGGCCGCCTGGTGCAGTACCGCGGCGAGGTGCAGGCCATGCTCGGCCAG gi|178843|gb|K06396.1|HUMAPOE3     GCGGCCGCCTGGTGCAGTACCGCGGCGAGGTGCAGGCCATGCTCGGCCAG **************************************************

gi|178350|gb|K00296.1|HUMAPOE3     AGCACCGAGGAGCTGCGGGTGCGCCTCGCCTCCCACCTGCGCAAGCTGCG gi|189350|gb|K10296.1|HUMAPOE3     AGCACCGAGGAGCTGCGGGTGCGCCTCGCCTCCCACCTGCGCAAGCTGCG gi|178850|gb|K00396.1|HUMAPOE3     AGCACCGAGGAGCTGCGGGTGCGCCTCGCCTCCCACCTGCGCAAGCTGCG gi|178843|gb|K06396.1|HUMAPOE3     AGCACCGAGGAGCTGCGGGTGCGCCTCGCCTCCCACCTGCGCAAGCTGCG **************************************************

gi|178350|gb|K00296.1|HUMAPOE3     TAAGCGGCTCCTCCGCGATGCCGATGACCTGCAGAAGCGCCTGGCAGTGT gi|189350|gb|K10296.1|HUMAPOE3     TAAGCGGCTCCTCCGCGATGCCGATGACCTGCAGAAGCGCCTGGCAGTGT gi|178850|gb|K00396.1|HUMAPOE3     TAAGCGGCTCCTCCGCGATGCCGATGACCTGCAGAAGCGCCTGGCAGTGT gi|178843|gb|K06396.1|HUMAPOE3     TAAGCGGCTCCTCCGCGATGCCGATGACCTGCAGAAGCGCCTGGCAGTGT **************************************************

gi|178350|gb|K00296.1|HUMAPOE3     ACCAGGCCGGGGCCCGCGAGGGCGCCGAGCGCGGCCTCAGCGCCATCCGC gi|189350|gb|K10296.1|HUMAPOE3     ACCAGGCCGGGGCCCGCGAGGGCGCCGAGCGCGGCCTCAGCGCCATCCGC gi|178850|gb|K00396.1|HUMAPOE3     ACCAGGCCGGGGCCCGCGAGGGCGCCGAGCGCGGCCTCAGCGCCATCCGC gi|178843|gb|K06396.1|HUMAPOE3     ACCAT---CCCGCGAGGGCGCCGAGCGCGGCCTCAGCGCCATCCGC ****       **************************************

gi|178350|gb|K00296.1|HUMAPOE3     GAGCGCCTGGGGCCCCTGGTGGAACAGGGCCGCGTGCGGGCCGCCACTGT gi|189350|gb|K10296.1|HUMAPOE3     GAGCGCCTGGGGCCCCTGGTGGAACAGGGCCGCGTGCGGGCCGCCACTGT gi|178850|gb|K00396.1|HUMAPOE3     GAGCGCCTGGGGCCCCTGGTGGAACAGGGCCGCGTGCGGGCCGCCACTGT gi|178843|gb|K06396.1|HUMAPOE3     GAGCGCCTGGGGCCCCTGGTGGAACAGGGCCGCGTGCGGGCCGCCACTGT **************************************************

gi|178350|gb|K00296.1|HUMAPOE3     GGGCTCCCTGGCCGGCCAGCCGCTACAGGAGCGGGCCCAGGCCTGGGGCG gi|189350|gb|K10296.1|HUMAPOE3     GGGCTCCCTGGCCGGCCAGCCGCTACAGGAGCGGGCCCAGGCCTGGGGCG gi|178850|gb|K00396.1|HUMAPOE3     GGGCTCCCTGGCCGGCCAGCCGCTACAGGAGCGGGCCCAGGCCTGGGGCG gi|178843|gb|K06396.1|HUMAPOE3     GGGCTCCCTGGCCGGCCAGCCGCTACAGGAGCGGGCCCAGGCCTGGGGCG **************************************************

gi|178350|gb|K00296.1|HUMAPOE3     AGCGGCTGCGCGCGCGGATGGAGGAGATGGGCAGCCGGACCCGCGACCGC gi|189350|gb|K10296.1|HUMAPOE3     AGCGGCTGCGCGCGCGGATGGAGGAGATGGGCAGCCGGACCCGCGACCGC gi|178850|gb|K00396.1|HUMAPOE3     AGCGGCTGCGCGCGCGGATGGAGGAGATGGGCAGCCGGACCCGCGACCGC gi|178843|gb|K06396.1|HUMAPOE3     AGCGGCTGCGCGCGCGGATGGAGGAGATGGGCAGCCGGACCCGCGACCGC **************************************************

gi|178350|gb|K00296.1|HUMAPOE3     CTGGACGAGGTGAAGGAGCAGGTGGCGGAGGTGCGCGCCAAGCTGGAGGA gi|189350|gb|K10296.1|HUMAPOE3     CTGGACGAGGTGAAGGAGCAGGTGGCGGAGGTGCGCGCCAAGCTGGAGGA gi|178850|gb|K00396.1|HUMAPOE3     CTGGACGAGGTGAAGGAGCAGGTGGCGGAGGTGCGCGCCAAGCTGGAGGA gi|178843|gb|K06396.1|HUMAPOE3     CTGGACGAGGTGAAGGAGCAGGTGGCGGAGGTGCGCGCCAAGCTGGAGGA **************************************************

gi|178350|gb|K00296.1|HUMAPOE3     GCAGGCCCAGCAGATACGCCTGCAGGCCGAGGCCTTCCAGGCCCGCCTCA gi|189350|gb|K10296.1|HUMAPOE3     GCAGGCCCAGCAGATACGCCTGCAGGCCGAGGCCTTCCAGGCCCGCCTCA gi|178850|gb|K00396.1|HUMAPOE3     GCAGGCCCAGCAGATACGCCTGCAGGCCGAGGCCTTCCAGGCCCGCCTCA gi|178843|gb|K06396.1|HUMAPOE3     GCAGGCCCAGCAGATACGCCTGCAGGCCGAGGCCTTCCAGGCCCGCCTCA **************************************************

gi|178350|gb|K00296.1|HUMAPOE3     AGAGCTGGTTCGAGCCCCTGGTGGAAGACATGCAGCGCCAGTGGGCCGGG gi|189350|gb|K10296.1|HUMAPOE3     AGAGCTGGTTCGAGCCCCTGGTGGAAGACATGCAGCGCCAGTGGGCCGGG gi|178850|gb|K00396.1|HUMAPOE3     AGAGCTGGTTCGAGCCCCTGGTGGAAGACATGCAGCGCCAGTGGGCCGGG gi|178843|gb|K06396.1|HUMAPOE3     AGAGCTGGTTCGAGCCCCTGGTGGAAGACATGCAGCGCCAGTGGGCCGGG **************************************************

gi|178350|gb|K00296.1|HUMAPOE3     CTGGTGGAGAAGGTGCAGGCTGCCGTGGGCACCAGCGCCGCCCCTGTGCC gi|189350|gb|K10296.1|HUMAPOE3     CTGGTGGAGAAGGTGCAGGCTGCCGTGGGCACCAGCGCCGCCCCTGTGCC gi|178850|gb|K00396.1|HUMAPOE3     CTGGTGGAGAAGGTGCAGGCTGCCGTGGGCACCAGCGCCGCCCCTGTGCC gi|178843|gb|K06396.1|HUMAPOE3     CTGGTGGAGAAGGTGCAGGCTGCCGTGGGCACCAGCGCCGCCCCTGTGCC **************************************************

gi|178350|gb|K00296.1|HUMAPOE3     CAGCGACAATCACTGAACGCCGAAGCCTGCAGCCATGCGACCCCACGCCA gi|189350|gb|K10296.1|HUMAPOE3     CAGCGACAATCACTGAACGCCGAAGCCTGCAGCCATGCGACCCCACGCCA gi|178850|gb|K00396.1|HUMAPOE3     CAGCGACAATCACTGAACGCCGAAGCCTGCAGCCATGCGACCCCACGCCA gi|178843|gb|K06396.1|HUMAPOE3     CAGCGACAATCACTGAACGCCGAAGCCTGCAGCCATGCGACCCCACGCCA **************************************************

gi|178350|gb|K00296.1|HUMAPOE3     CCCCGTGCCTCCTGCCTCCGCGCAGCCTGCAGCGGGAGACCCTGTCCCCG gi|189350|gb|K10296.1|HUMAPOE3     CCCCGTGCCTCCTGCCTCCGCGCAGCCTGCAGCGGGAGACCCTGTCCCCG gi|178850|gb|K00396.1|HUMAPOE3     CCCCGTGCCTCCTGCCTCCGCGCAGCCTGCAGCGGGAGACCCTGTCCCCG gi|178843|gb|K06396.1|HUMAPOE3     CCCCGTGCCTCCTGCCTCCGCGCAGCCTGCAGCGGGAGACCCTGTCCCCG **************************************************

gi|178350|gb|K00296.1|HUMAPOE3     CCCCAGCCGTCCTCCTGGGGTGGACCCTAGTTTAATAAAGATTCACCAAG gi|189350|gb|K10296.1|HUMAPOE3     CCCCAGCCGTCCTCCTGGGGTGGACCCTAGTTTAATAAAGATTCACCAAG gi|178850|gb|K00396.1|HUMAPOE3     CCCCAGCCGTCCTCCTGGGGTGGACCCTAGTTTAATAAAGATTCACCAAG gi|178843|gb|K06396.1|HUMAPOE3     CCCCAGCCGTCCTCCTGGGGTGGACCCTAGTTTAATAAAGATTCACCAAG **************************************************

gi|178350|gb|K00296.1|HUMAPOE3     TTTCACGT gi|189350|gb|K10296.1|HUMAPOE3     TTTCACGT gi|178850|gb|K00396.1|HUMAPOE3     TTTCACGC gi|178843|gb|K06396.1|HUMAPOE3     TTTCACGC ******* Each of the two files contains two sequences (I made fake changes to each). Yes...I realize that this is not particularly interesting.