# Chris Rhodes Week 8

For today's lab we will working out of the Bioinformatics for Dummies 2nd edition book performing selected activities from Chapters 2, 4, 5, and 6 but modifying the protocols to apply to the current website formats and the use of HIV-1 gp120.

## In Class Activities

Retrieving Protein Sequences

• The protein retrieved in this exercise is HIV-1 gp120. It was found by going to ExPASy and searching "HIV gp120 envelope protein" using the UniProtKB database, but verified independent gp120 protein could not be found. The gp120 protein sequence was instead taken from an entry of gp160 which contains the gp120 sqeuence. The UniProtKB entry of the gp160 protein used is found here and the sequence of the gp120 protein, shown as the highlighted residues within the gp160 protein sequence, is found here http://www.uniprot.org/blast/?about=P04578[33-511] -> This address couldn't be properly hyperlinked due to the [33-511] text causing problems with the linking format.
• The fasta form of the gp120 protein sequence was retrieved from the entry page and is shown here:
>sp|P04578|33-511
KLWVTVYYGVPVWKEATTTLFCASDAKAYDTEVHNVWATHACVPTDPNPQEVVLVNVTENFNMWKNDMVEQMHEDIISL
WDQSLKPCVKLTPLCVSLKCTDLKNDTNTNSSSGRMIMEKGEIKNCSFNISTSIRGKVQKEYAFFYKLDIIPIDNDTTSY
KLTSCNTSVITQACPKVSFEPIPIHYCAPAGFAILKCNNKTFNGTGPCTNVSTVQCTHGIRPVVSTQLLLNGSLAEEEVVI
QFGNNKTIIFKQSSGGDPEIVTHSFNCGGEFFYCNSTQLFN

• From the list of gp160 proteins found when searching for gp120 in the first step 5 sequences were chosen to be used in the multiple retrieval exercise. The UniProt ID numbers of the 5 sequences are P04578, P03377, P03375, P35961, and P05877. From the options for downloading the sequences I chose the FASTA format, the txt version of the combined sequence FASTA file can be found here

• As with the first activity a UniProt verified gp120 protein could not be found so I will be working with the gp160 entry instead.
• The UniProt entry of the gp160 protein used can be found here
• The entry itself is very in-depth and contains a lot of information. Some of the major features of the entry include:
• The protein name along with the names of the proteins that result from the cleavage of the original protein
• The protein sequence, source organism, and in this case the viral host.
• In-depth description of known functions and mechanisms of function.
• Known Secondary Structure and ontologies
• References to the studies and labs used to create all the information found in the UniProt protein entry.

• The NCBI ORF Finder can be found here
• The sequence used for this experiment was found by searching the NCBI nucleotide database for gp120 of HIV-1. The NCBI entry page for the sequence chosen can be found here and the fasta form is shown below
>gi|328550457|gb|JF701706.1| HIV-1 isolate gp120_Oct_10 from USA vpu protein (vpu) and envelope glycoprotein (env) genes, partial cds
CAGAAAGAGCAGAAGACAGTGGCAATGAGAGTGAAGGGGATCAGGAAGAATTATCAGCACTTGTGGACAT
GGGGCATCATGCTCCTTGGGATGTTAATGATCTGTAGTGCTGCAGGAAATTGGTGGGTCACAGTCTATTA
TGGAGTRCCTGTGTGGAAAGAAGCAACCACCACTCTATTTTGTGCATCAGATGCTAAAGCATATGATACA
GAGGTACATAATGTTTGGGCCACACATGCCTGTGTACCCACAGACCCCAACCCACAAGAAATATTATTGA
AAAATGTGACAGAAAATTTTAACATGTGGAAAAATGGCATGGTAGAACAAATGCATGAGGATATAATCAG
TTTATGGGATCAAAGCCTAAAGCCATGTGTGAAATTAACCCCACTCTGTGTTACTTTAAATTGCACTARC
TTGAATGTTACTAATACCACTGCTACTAACACAACGAATAATGGCGGGACAACAATGGCGGGAGAAATGA
GAAACTGCTCTTTCAATGTCACCACAAGCATAGGAAATAGGAGACAAAAAGAATATGCGCTTTTGTATAA
ACATGATATAGTACCAATAGATAATAGTACYAACTATATACTAATAAGTTGTAACACCTCAGTCATTACA
CAGGCCTGTCCAAAGATATCCTTTGAACCAATTCCCATACATTATTGTGCCCCAGCTGGTTTTGCGATTC
TAAAGTGTAAYGAGAAGAAGTTCAATGGCACAGGACCATGTAAAAATGTCAGCACAGTACAATGTACACA
TGGAATTAAGCCAGTAGTATCAACTCAACTGTTGTTAAATGGCAGTCTAGCAGAAGAAGAGGTAGTAATT
AGATCTGAAAATTTCACAAACAATGCTAAAACCATAATAGTACAGCTAAACAGTCCTGTATTAATTAATT
GTACAAGACCCAACAACAATACAAGAAAAGGTATACGGATAGGACCAGGGAGAACATTCWTTGCAACAGA
AAGAATAATAGGAGATATAAGACAAGCACATTGYAATCTTAGTAGAGAACAATGGAATAACACTTTAGAA
AAGGTAGCTGCAAAATTAAGAGAACAATTTGAAAATAAGACAATAATCTTTAATCACTCCTCAGGAGGGG
ACCCAGAAATTGTAATGCACAGTTTTAATTGTGGAGGRGAATTTTTCTATTGTAATACAACACAGCTGTT
TAATAGTACTTGGAATAGTACAGGGTCAAATAACRCTAAAGGAGATGAMGTTATCACACTCCCATGCAGA
ATAAAACAAATTGTAAATATGTGGCAGGAAGTAGGAAAAGCAATGTATGCCCCTCCCATCAGWGGACAAA
TTAATTGTTCGTCAAATATTACAGGGCTGCTATTAACAAGAGACGGYGGTAATAATAATAACMTCCAAAA
TGAGACCTTCAGACCTGGAGGAGGAAATATGAAGGACAATTGGAGAAGTGAATTATATAAATATAAAGTA
GTAAARATACAACCATTAGGA


• The ORF's for the gp120 sequence were analyzed by placing the gp120 sequence ORF Finder box and pressing OrfFind
• The ORF Finder output for the gp120 sequence is shown below:

• The results of the ORF finder tell us the amino acid sequences that will be made through translation of the sequence in the six different ORFs shown. In this case, since the gp120 sequence used was determined from a gp120 protein isolate, the ORF containing the longest or most representative amino acid sequence can be assumed to be the correct ORF or the ORF most likely to be biologically relevant.
• Based on the results of the ORF Finder for the gp120 sequence it can assumed that the +1 ORF is the most likely to be biologically relevant for the sequence.

ProtParam

• The ProtParam tool can be found here
• The sequence used for the ProParam analysis is the gp120 protein sequence found in the first activity.
• The protein parameters were calculated by pasting the protein sequence into the ProtParam box and clicking Compute Parameters
• The ProtParam output is shown below

• The ProtParam results indicate numerous possible features of the protein sequence analyzed including:
• Molecular Weight, Amino Acid Composition, Sequence Length, Molecular Formula, and Theoretical pI
• Extinction Coefficient: Expected A280 nanospectroscopy readings at a protein concentration in water of 1g/L
• Instability Index: A measure of the stability of the protein
• Half-Life: Amount of time for half of the total protein units to degrade in various biological systems

Looking for Transmembrane Segments

• This activity uses two different analysis programs Protscale and TMHMM
• The sequence used in these two exercises is the gp120 protein sequence found in the first activity
• Protscale
• In the Protscale site the gp120 sequence was placed in the box, the window size (located at the bottom of the page) was changed from 9 to 19, and then Submit was clicked.
• The Protscale output is shown below

The threshold limit for a transmembrane segment is a value of 1.6 or greater. Based on the results of Protscale we can see that there are no transmembrane segments present in the gp120 sequence

• TMHMM
• In the TMHMM site the gp120 sequence was pasted into the box and without changing any parameters the submit button was pressed
• The TMHMM out is shown below

• The results agree with the results from Protscale and tell us there are no transmembrane segments present in the gp120 sequence.

## HIV Structure Project

General Outline:

For my project I thought it would be interesting to look at whether or not the predictions of immune response patterns concluded in the previous project coincide with the newly acquired protein sequence data and how the functionality of the protein sequences relates to the observed immune response pattern. This will be done by comparing the phylogenetic trees of DNA used in the previous project with newly generated phylogenetic trees of the protein sequences of certain subjects to establish any changes in tree patterns. The structure of the protein sequences will then be analyzed and compared to each other in an attempt to either confirm or deny the presence of a functional change in the structure of the proteins. This data will then be used to draw conclusions about the immune response pattern of the subjects and these conclusions compared to the predictions of immune response patterns from the previous project.

Subjects Used:

Due to time constraints I am unsure whether or not I will be able to look at all of the subjects used in the previous projects. Depending on how much time I have available I may only be able to look at one subject or one subject from each progressor type.

• In the case of only one subject, I will use Subject 7.
• In the case of one subject from each progressor type, I will use Subjects 3, 7, and 13.
• If time allows I will use all the subjects from the previous project: Subjects 3, 7, 8, 11, 12, and 13.

Explanation of Subject Choice:

Since my current project aims at comparing results to my previous project it is logical that I should use the same subjects as the previous projects in order to make reasonable and relevant comparisons of the results. I chose Subject 7 as my primary specimen because the phylogenetic tree of Subject 7 from the previous project showed a very interesting pattern and progression of divergence and diversity. If nothing else I would like to look at the structure of the proteins from visits 2, 3, 4, and 5 of Subject 7 which seem to represent areas of immense divergence and possible functional significance. Subject 3 also had an interesting phylogenetic pattern by having the DNA sequences of each visit cluster together in distinct branches. This could signify radical, but highly conserved changes in protein sequence and therefore functional differences between the different visits. Subject 13 has a simple and easy to interpret phylogenetic tree and would essentially be used as somewhat of a control to see whether or not the pattern of the protein trees can correspond to the predictions made from the genetic trees. It would also be interesting to see whether or not the idea of "negative evolution" or evolution back towards the root of the tree, seen in Subject 13's phylogenetic tree, also applies to Subject 13's protein sequences and their functionality.