User talk:Anugraha Raman

Final Thoughts
Check out [our page] for a summary of our final thoughts and our final presentation!

Over the course of this project I've worked on finding primary literature to help identify SNPs, looking at some primary literature in an effort to understand what potential models could be like, meeting with bio group to discuss future directions, helping create data-sets, creating a small tool to help sieve through primary literature for relevant snps for various characteristics, and documenting what we've done.

It's been a really fun interdisciplinary project getting to know really fun people!

Thoughts between Nov 9 and Final Days
Meeting

We met to discuss who to contact about getting more real data sets, and then we met again to discuss how to create the data set for the modeling group to use. See explanation on the biology group page.

Increasing Target Audience

By altering the trait-o-matic user interface we can expand the target audience. One idea was to have something where you could search by traits in drop down menus, and by working backwards from these phenotypes you could determine different potential genotypes, each with different probabilities. This way you could potentially identify someone in a forensic type setting or you could potentially piece together an image of someone usable in a dating serivce type thing (similar to SNP cupid except you're not making predictions about the children.)

Allowing for a 3D visualization in which the user could click on a portion of the person and examine traits in that particular area, would allow users with less genetic background to be able to easily use this tool. It would open it up to a younger audience.

Tool

This tool uses [[Media: Aggregate_gwas_studies.xlsx| Aggregate GWAS Studies Spreadsheet]] from the [| cumulative GWAS study] to: Useful Tool Script
 * 1) Identify all traits studied by these GWAS studies and group them together
 * 2) List rsids linked to information from dbSNP about these snps (including in particular poulation diversity data (some of it which is linked to HapMap info))
 * 3) Link rids to corresponding primary literature

Important Files: [[Media:Gwas.txt| GWAS text file]], [[Media:Frames.css| Frame style sheet]]
 * Tool-web page frame
 * Tool-web page header
 * Tool-Web page lhs
 * Tool-Web page main

Here are some screen shots:

Basic Listing of Traits Found in study:

Link to population diversity data found on NCBI's dbSNP:

Link to primary literature in pubmed:

Wish List

When asked for Wish List Ideas Previously, something that struck me was understand protein-protein interactions. When reading about blood-type, I found it interesting that sometimes an individual can be type O when neither of the parents were carriers. This is because inheritance of the H antigen found on the surface of RBCs plays an important role in determining blood type. If the individual is homozygous recessive for this they can still be type O blood type since the H antigen is a precursor to type A and B antigens.

November 9, 2009
We met earlier today to discuss the ideal data set. This was an interesting modeling paper about a two locus model:[ http://www.biomedcentral.com/1471-2156/9/17]
 * It discusses the four common models found:
 * additive*additive
 * additive*dominant
 * dominant*additive
 * dominant*dominant

It seems that the ideal data set would be two alleles --> 1 trait

Our first example just has to be biologically inspired, not necessarily related to humans. Jackie suggested a data set currently known for yeast in which knockouts were done to figure out the epistatic interactions. The idea is that we could look for homolouges of these genes in humans, just to show a proof of principle.

Looking at Height: Break up body into three parts using baseline of average, so we can have additive properties. ie) a torso is two inches below mean and the legs are three inches above mean --> 1 inch taller than mean

Ideally we want to see all or nothing epistasis but we need to start wtih the simplest example ie. total dominance.

Pharmacogenetics
 * Looking at pharmacogenetics it seems that we would have a very simple project (one snp--> one phenotype)
 * I finalyl realized that the reason for this simplicity is because we have yet to go through metabolic pathways in order to find how multiple snps work together to yield a certain response
 * This data would be much more difficult to obtain from pgp because we need to find
 * enough people with the same response to a certain product
 * ability to study metabolic pathways in these people beyond their genome

Questions that I had:
 * How exactly would this learning work. So can we input an extremely simple example with just two genes leading to one trait, and then feed multiple genes?
 * For learning, do you have to feed in every single example seperately...what exactly is the leas amount of information needed.

November 3, 2009

 * When first hunting for literature on polygenic traits, I found many articles about the steps to take towards identifying and characterizing polygenic traits. As it turns out, identifying QTLs is very important for identifying polygenic traits.


 * However, in our project, we are primarily interested in figuring out what the particular polygenic trait that an invidual possesses is. Then we can move on to cooler applications of finding novel polygenic traits.


 * I found this cool article Pharmacogenomics: Tanslating Functional Genomics into Rational Theraputics| that would really help out with the pharmacogenetics project :D


 * If we look at table 1, we find a list of polymorphisms of genes important to drug metabolism, and how they would effect different phenotypes. We could start immediately searching for these polymorphisms in the genomes entered as input and scan for these specific mutations, thus being able to readily point spew out a phenotype


 * Perhaps in order to make our searching method more efficient, we could first look for genes involved in the most number of pathways such as CYP3A4, and look for mutations in those, and then work our way from most common to least common. It is nice that in this picture we can start looking at genes in terms of frequency


 * Another interesting find in this article was that pharmacogenetic polymorphisms differ in frequency among ethnic and racial groups. So now we would know to include these as a primary criteria when we choose to look at external factors

October 29, 2009
Thoughts on tuesday's discussion.

The first step to either project would involve developping a method to analyze polygenic traits.

This would involve

a) developping a method

b)testing out the method on pre-existing polygenic traits

c) verifying that this method works with accuracy

Though this can be directly introduced into trait-o-matic. It can also be the basis for snp cupid or the metabolism project.

I would feel comfortable helping out with some programming aspects in this first step. Overall, though, I would feel more comfortable thinking about the questions that need to be answered, how the results would be presented to the user, what other factors to take into account when thinking about either polygenic traits, or genetic inheritance, or metabolic pathways, and what is and isn't biologically realistic.

I think a second step after identifying polygenic traits, would be to work in the effects of epistasis into our predictions.

With SNP cupid, I envision the different results of the childs traits for different genes to be ranked with probabilities attached to them.

Octber 27, 2009
Thoughts on last week's discussion

After last week's discussion it seems that SNP cupid would be a wonderful class project. It is broad enough that it can incorporate many different sub-projects into it, and by creating a base many tools can be added on later. The whole polygenic trait subproject could be incorporated into SNP cupid results to increase interest. Furthermore, SNP cupid allows work to be divided up into different areas that are not all very programming intensive. Much thought would need to go into what kinds of questions people would need answers to, and the best and most reliable places to go to in order to obtain this information before feeding it to a program.

October 13/20 2009
Step 1: We would first identify a phenotype of interest in our PGP population. Say for example, people who do not gain weight from high fat diets.

Step 2: We would use a tool like blast to look at sequence alignment, and find portions that were the same. Then we would use OMIM to look for known genes. And then weed out similar sequences that belonged to known genes. (At this point: We could also use HGMD to look for known mutations and Gene tests to see if any clinical testing had been done on our sequence). Additionallly in the weeding out step we could go to SNPedia& Gene tests to find common SNPs and mutations of the known genes to further our weeding out process.

Step 3: We would look at a larger genome dataset to confirm our new phenotype-genotype finding, and add to the Gene tests database.

Checking the validity of our tool:

When looking under the genetics wiki page under gene regulation, I found a small paragraph on epigenetic factors influencing DNA and DNA inheritance. I thought it would be interesting to add that these changes such as methylation or acetylation ocur post0-translationally and silence genes from being expressed. It is also interesting to not that by removing methylation patterns one can more easily reprogram a cell, which would be useful in the line of theraputics development. Furthermore, by looking for hyper/hypo methylated promoter cpg islands, we may be more easily able to identify potential cancerous tumor sites.
 * Existing information on Wikipedia*

October 8 2009
Our idea was to use OMIM, Gene Tests, and SNPedia in ordeer to find new linkages between genotype seuqences and thier corresponding phenotypes. We also wanted to attmept to find the minimal number of genotypic sequences that would correspond to a complex phenotypic trait. In order to show our working system we would first show our program working with known genotype-phenotype linkages. See the (Project Talk Page[]) page under projects for a more detailed summary.

September 29 Due Assignment


The premise of the major idea is that the environment and our lifestyle play major roles in disease onset probability due to their effect on our Epigenetic patterns. The Homo Sapiens toolkit should include the means for our species to test itself for specific diseases due to these epigenetic factors.

PBS Nova aired a show titled “Tale of Two Mice” that focused on Epigenetics. It featured genetically identical mice having the same sex and age, found to be phenotypically distinct due to a methyl-rich diet. Specific DNA regions becoming hyper-methylated can lead to onset of cancer. Amongst humans, one twin getting cancer and the other not, can be explained by diet and environmental factors that resulted in Methylation and eventually cancer.

Today the only known natural modification of human DNA is via DNA Methylation. This Methylation affects the Cytosine base (C) when it is followed by a Guanosine (G) or only at CpG sites. When promoter CpG islands become methylated the gene associated becomes permanently silenced.

Wet-Lab methylation ‘‘profiling’’ studies have shown characteristic set of aberrantly methylated genes with varying CpG island methylation patterns in specific cancer tumors. One of the challenges faced by the lab techniques is degradation of 90% of incubated DNA. The conditions necessary for complete conversion, such as long incubation times, elevated temperature, and high Bisulphite concentration, can lead to this degradation.

An immediate small step towards the Homo Sapiens 2.0 goal of self testing for epigentic factor based diseases is trying to predict if a specific gene is methyaltion prone or resistant algorithmically

References


 * 1) Paper0 pmid=17284773

// example environment and lifestyle linkage to epigenetics

// Small Step 1: Predict algorithmically if a specific gene is Methylation prone or resistant


 * 1) Paper1 pmid=11782440
 * 2) Paper2 pmid=16837523
 * 3) Paper3 pmid=14519846

// Medium sized Step 2: ‘Count’ and curate Methylation levels for specific genes which are normal and diseased

// Lung Cancer example: CDKN2A gene showing normal methyaltion ~0 and diseased methyaltion around ~40% // Another Lung Cancer example DAPK1 gene showing normal methyaltion ~4% and diseased methyaltion around ~40%
 * 1) Paper4 pmid=17932060
 * 2) Paper5 pmid=11106248
 * 1) Paper6 pmid=12912953

// Large sized Step 3: Predict Methylation levels based on variables - tbd

// Much larger sized Step 4: Create in vivo logic based “counter” that will light up when it detects biomarkers within range of disease based on Methylation levels


 * 1) Paper7 pmid=19478183
 * 2) Paper8 pmid=17515909

// Final large sized Step 5: Make the step 4 setup into a kit and let Homo Sapiens test themselves


 * 1) Paper9 pmid=19458720
 * 2) Paper10 pmid=19424153

Useful links
 * [Tale of Two Mice]
 * [Personal Genome Project]
 * [Computational Epigenetics]

Week 3 Assignment
Problems 1,2 and 3 were done within one Python script and problem 4 in a separate script. The tutorial Biopython Tutorial helped me in understanding how to proceed with the functions that had to be written for this assignment!


 * Quick Summary
 * Problems 1 2 and 3 Code


 * Problems 4 Code

The fourth problem was very interesting! I had a blast working on it. Hopefuly it is done correctly :)

The length of the given sequence is 1020 base pairs. For every 100 base pairs the script randomly tries to mutate a to t/g/c t to a/g/c etc. with a probability of 0.01. I then made the script run this a 100 times and called it a simulation. The script did 800 such simulations and aggregated the results in the plot shown below:



As you can see the output plot after 800 simulations of 100 sets of single base pair evolutionary mutations as described in assignment 3b problem 4 produces the above plot showing about four to five (4.65) premature terminations for every 1020 mutations.

I created a simple FASTA type text file to read in the p53seg sequence that was provided.
 * Reading in the Input Sequence

p53 seg sequence text file

input_file = open('p53seg.txt', 'r') for cur_record in SeqIO.parse(input_file, "fasta"): my_seq = cur_record.seq

The GC content % required that a float be used in the denominator to get results.
 * Problem 1 (GC Content)

g_count = cur_record.seq.count('g')
 * 1) GC count done explicitly, i.e. problem #1 in this assignment set
 * 2) Get the number of Guanines in the sequence

c_count = cur_record.seq.count('c')
 * 1) Get the number of Cytosines in the sequence

seq_count = len(cur_record)
 * 1) Get the length of the sequence

gc_percent = ((g_count + c_count) / float(seq_count)) * 100 print 'GC % is: ' + str(gc_percent)
 * 1) use float in denominator to get the decimal answer for GC%

The reverse complement was obtained by simply using the Seq.reverse_complement function.
 * Problem 2 (Reverse Complement)

rev_seq = my_seq.reverse_complement print 'DNA reverse complement of p53seg is: ' output_file.write('  DNA reverse complement of p53seg is: ') print rev_seq output_file.write(str(rev_seq))
 * 1) get the reversed complement of the sequence, i.e. problem #2 in this assignment set

The third problem was done two ways. The first way was to use the Standard table provided by Harris in the 3b word document. The second was to use the standard definition in the Bio.Data CodonTable.py file using the default implementation Seq.translate(sequence).
 * Problem 3 (Frame Translation)


 * 1) Standard translation from Biophys101_assign3b.doc

standard3b = { 'ttt': 'F', 'tct': 'S', 'tat': 'Y', 'tgt': 'C', 'ttc': 'F', 'tcc': 'S', 'tac': 'Y', 'tgc': 'C', 'tta': 'L', 'tca': 'S', 'taa': '*', 'tga': '*', 'ttg': 'L', 'tcg': 'S', 'tag': '*', 'tgg': 'W',

'ctt': 'L', 'cct': 'P', 'cat': 'H', 'cgt': 'R', 'ctc': 'L', 'ccc': 'P', 'cac': 'H', 'cgc': 'R', 'cta': 'L', 'cca': 'P', 'caa': 'Q', 'cga': 'R', 'ctg': 'L', 'ccg': 'P', 'cag': 'Q', 'cgg': 'R',

'att': 'I', 'act': 'T', 'aat': 'N', 'agt': 'S', 'atc': 'I', 'acc': 'T', 'aac': 'N', 'agc': 'S', 'ata': 'I', 'aca': 'T', 'aaa': 'K', 'aga': 'R', 'atg': 'M', 'acg': 'T', 'aag': 'K', 'agg': 'R',

'gtt': 'V', 'gct': 'A', 'gat': 'D', 'ggt': 'G', 'gtc': 'V', 'gcc': 'A', 'gac': 'D', 'ggc': 'G', 'gta': 'V', 'gca': 'A', 'gaa': 'E', 'gga': 'G', 'gtg': 'V', 'gcg': 'A', 'gag': 'E', 'ggg': 'G'		}

def translate_dna(seq): """ translates tri-nucleotide sequences (codon) to its one letter amino acid """ aa_translation = "" for codon_loc in xrange(0,len(seq),3): # if you do not find the codon translation i.e partial codon # or something else replace with ? aa_translation = aa_translation + standard3b.get(str(seq[codon_loc:codon_loc+3]), "?") return aa_translation

Frame +1 and Frame +2 or -1 or -2 could be done simply as shown below

plusone_seq = seq amino_seq1 = translate_dna(plusone_seq) print '(+1) frame translation is: ' print amino_seq1 #   # original sequence minus the first nucleic acid in the sequence plustwo_seq = seq[1:] amino_seq2 = translate_dna(plustwo_seq) r_seq = seq.reverse_complement #   # original sequence reversed minusone_seq = r_seq amino_seq4 = translate_dna(r_seq) #   # reversed sequence minus the first nucleic acid minustwo_seq = r_seq[1:] amino_seq5 = translate_dna(minustwo_seq)
 * 1) +2 Frame
 * 1) +2 Frame
 * 1) -1 Frame
 * 1) -2 Frame
 * 1) -2 Frame

Way 2 using the standard table predefined in Biopython is shown below

#in Bio.Data CodonTable.py   # by using the default Seq.translate
 * 1) Method 2 ===>Using the Standard table defined

# +1 Frame # using the translate method in Bio.Seq # implemented in Libs/sitepackages/Bio/Seq.py   plusone_seq =  seq amino_seq1 = Seq.translate(plusone_seq)

many functions were defined for this script however the 2 that are key are mutatesinglebp and findstops.
 * Problem 4 (Single bp mutation simulation to detect early terminations)

from Bio import SeqIO from Bio.Seq import Seq from Bio.Alphabet import IUPAC from random import * import os


 * 1)   Functions defined in this script file are as follows:
 * 2)   writeheader(myfile)     : Writes a specific HTML header using the myfile handle
 * 3)   writefooter(myfile)     : Writes a specific HTML footer using the myfile handle
 * 4)   writerunsummary(myfile) : Writes a specific set of sumamry information using the myfile handle
 * 5)   mutatesinglebp(seq, random_Seed, forevery, prob) : Mutates  a single base pair for forevery
 * 6)                             location range using a probability of prob; returns mutated sequence
 * 7)   writeAA(myfile, seq,stop_locs) : Writes the amino acids using the myfile handle
 * 8)   findstops(seq)          : Finds the stop locations in the given DNA  sequence
 * 1)   findstops(seq)          : Finds the stop locations in the given DNA  sequence

mutatesinglebp def mutate_singlebp(seq, random_seed=0, forevery=100, prob=0.01): # reset seed if random_seed == 0: seed else: seed(random_seed) for x in range(0,forevery): r=randrange(0,1/prob) if r == 0: mutate_pos = randrange(0,len(seq)-1) old_base = seq[mutate_pos] if old_base == 'a': mutated_base = choice(['c', 't', 'g']) elif old_base == 'c': mutated_base = choice(['a', 't', 'g']) elif old_base == 't': mutated_base = choice(['a', 'c', 'g']) else: # 'g'               mutated_base = choice(['a', 'c', 't']) # mutable sequence seq is updated seq[mutate_pos] = mutated_base # end if   # end for
 * 1) mutate_singlebp mutates  a single base pair for forevery location range using a
 * 2) probability of prob


 * 1) end of function mutate_singlebp

def findstops(seq): stop_array = [] start = 0 stop_pos = 0 while stop_pos != -1: stop_pos = Seq.translate(seq).find('*',start) start = stop_pos + 1 stop_array = stop_array + [stop_pos] stop_array.remove(-1) return stop_array # end of while


 * 1) end of function findstops

calling the functions

rand_seq = my_seq.tomutable
 * 1) I have to create a mutable sequence to use the mutate_singlebp function

# calling the function that will do most of the heavy lifting mutate_singlebp(rand_seq)

# I have to convert the mutable sequence back to a normal DNA sequence in order to use the transcribe method new_seq = Seq(rand_seq.toseq.tostring, IUPAC.unambiguous_dna) b = findstops(new_seq)

Week 2 Assignment
With the first graph (exponential), with larger values of k the graph increased much faster. In face as you can see when you compare the "red triangles" to the red circles, the exponential curve associated with the red triangles (k=4.03) dwarves the exponetial curve associated with the "red circles" (k=.9) so much that the "red circles curve" appears to almost be linear in the top graph.



In the second graph (logistic), with larger values of k, the graph not only grows faster, but also starts leveling off sooner. If we were relating this to population growth, larger k values result in the population reaching its carrying capcity sooner.



The last graph (not shown) is mean to correct for "negative population." I am still having trouble using the max function to evaluate individual values in my array.