# Angela A. Garibaldi Week 8

## Retrieving Protein Sequences

1. Go to UniProt UniProt
2. Enter dUTPase in search window. This produces more than 3 relevant sequences, so found DUT ECOLI (P06968) on page 4
3. Scroll down for FASTA format of amino acid sequences
• In the case that your beginning information is not enough to find the protein sequence you seek,
1. find the advanced search option. This no longer exists. You have to click the add and search button and a drop down menu will be displayed to give you the same search options as described in Figure 2-16 of the Bioinformatics for Dummeies

### Retrieving a List of Related Protein Sequences

1. Go to the Advanced Search UniProt as described above
2. Because the advanced search is completely different, cannot deselect TrEMBL. Instead Select Reviewed- Yes as an alternative
3. Input dUTPase in search again. There is no "description" field any longer.Yields many possibilities
4. Since there are more than 211 total possibilities, so we selected entire first page of sequences (25)
5. In newer version click retrieve at the bottom right corner instead of french button.
6. Once you retrieve these, it is put into a list of which you can add to and then below choose the format you want the sequences in. No longer have to copy and paste into a document. FASTA format is available.

This time we skipped the example and did the activity using HIV gp120.

1. Select the Reviewed - Yes. Our overall query to achieve these results: HIV gp120 AND reviewed:yes
2. We selected the first option in the list
Entry Name: ENV_HV1H2
Accession Number: P04578

• Scroll down to Sequence Annotation - Region to Look at V3 sequence specifically.

1. Go to NCBI ORF Finder
2. Input a DNA sequence for practice
I input the following sequence: >S7V1-1
GAGATAGTAATTAGATCTGCCAATTTCACGGACAATACTAAGACCATAATAGTACAGCTGAATGTATCTG
TAGAAATTAATTGTACGAGACCCAACAACAATACAAGAAAAAGTATACCTATAGGACCAGGGAGAGCATT
TTATGCTACAGGAGAAATAATAGGGAATATAAGACAAGCACATTGTAACATTAGTAGAGCAAAATGGAAT
AACACTTTAAAACAGATAGCTACAAAATTAAGAAAACAATTTGAGAATAAAACAATAGTCTTTAATCAAT
CCTCA


Compare your results with the SWISS-PROT entry you found for the protein above to decipher what the output means. ExPASy also has a translation tool you can use here

• Based on the ExPASy tool, the following amino acid sequence was the only viable ORF. All others had stop codons within the first few codons

E I V I R S A N F T D N T K T I I V Q L N V S V E I N C T R P N N N T R K S I P I G P G R A F Y A T G E I I G N I R Q A H C N I S R A K W N N T L K Q I A T K L R K Q F E N K T I V F N Q S S

## Working with a single protein sequence

Utilizing Bioinformatics for Dummies pages 159-195

1. Go to Expasy
2. Click protParam near top of page
3. Enter sequence into space provided or by pasting the accession number. DO NOT INCLUDE THE FASTA FORMAT FIRST LINE, ONLY RAW DATA.
4. Compute parameters

I saved this file on my personal computer since WetWare does not allow html files. This will give information about the protein, composition, ph, stability, etc

• For a tool to simulate cutting of your protein, use: [1]

## Looking for transmembrane segments

1. go to Protscale
2. Enter your sequence in raw format or swiss-prot accession number.
4. Choose 19 in the pull-down menu because this number is best for looking for transmembrane helices. 7-11 would be better for globular proteins.

• strong signals are not sensitive to parameters. Recommended threshold for Kyte and Doolittle is 1.6. If you forget this number do the following:
1. Place paper over your results.
2. Lower the paper until the tips of the strongest peaks appear
3. Keep lowering this threshold as long as you can see nice sharp peaks.
• 6 of the 7 transmembrane regions are easy to find.
1. Go to TMHMM only FASTA format is recognized
2. Keep Output Format radio buttons to their default value.

• This predicts segments that are inside the cell AND segments that are outside of the cell. but fails to predict the segment in the middle, but gives good estimation for 5 of the segments.

## Predicting post translational Modifications

1. go to PROSITE and compare your protein to other collection of patterns in PROSITE.
2. Paste your sequence or acession number into the left box (Proteins to be scanned)
3. Uncheck Exclude Motifs with High Probability of Occurrence box.
4. Check the Do not scan profiles box
5. Scan

Results:

• Each pattern family has its own color code, with its accompanying details
• PS##### leads to the pattern documentation and information about its biological function
• PDB= Protein Data Base which contains all the 3d structures.
• Click on one of these links for a static GIF picture
• The list has segments containing patterns within your protein. The numbers indicate the position of the match within your sequence, capital letters=residues specified by the pattern; lowercase letters = residues that weren't specified by the pattern
• NOTE: not everything is in PROSITE! Visit expasy tools!



## Finding domains with InterProScan

1. Go to InterProScan
2. Enter your sequence in the search box. Accession numbers do not work
3. Choose domain databases you are interested in. TMHMM and SignalPHMM are options that determine transmembrane predictions and single peptide (short N-Terminus segment that causes your protein to reach it's destination in the cell) predictions, respectively.
4. Submit job
5. Save results via the "Raw Output" button

Results:

1. First entry in each column indicates the type of diagnosis provided: FAMILY OR DOMAIN
2. IPR#### points to the InterPro documentation
3. PS#### link will take you to that entry where you can find individual PROSITE documentation
4. Colored boxes show you where the match occurred on your sequence

Strengths: Searches multiple databases Weaknesses: domain databases do not agree exactly on the boundaries of the matches.

• See: A common mistake when scanning domains on PAGE 187

## Finding domains with the CD Server

1. Go to CD server or BLAST and click Search the Conserved Domain Database using RPS-BLAST.
2. Paste sequence or its identifier in the box
3. Deselect the Apply Low Complexity Filter in the case that there me an over-represented amino acid, making a sequence simpler and losing domains with this simplicity.
4. Set the Expect Value Threshold to 1
5. Submit ( For this I used a different sequence provided on the CD server site for simplicity)

Results:

• Graphic shows the regions of your protein that match the domain.
• Red domains are from SMART
• Ragged ends indicate partial matches
• E-Value=how many times you can expect this good of a hit by sheer chance. The lower the E-Value, the better. (below 0.01)
• The hit list shows the domains that match your sequence, sorted by E-value. Links lead to documentation.

## 3d structure

1. use Cn3dn Cn3D software site