Angela A. Garibaldi Week 8

Retrieving Protein Sequences

 * 1) Go to UniProt UniProt
 * 2) Enter dUTPase in search window. This produces more than 3 relevant sequences, so found DUT ECOLI (P06968) on page 4
 * 3) Scroll down for FASTA format of amino acid sequences


 * In the case that your beginning information is not enough to find the protein sequence you seek,
 * 1) find the advanced search option. This no longer exists. You have to click the add and search button and a drop down menu will be displayed to give you the same search options as described in Figure 2-16 of the Bioinformatics for Dummeies

Retrieving a List of Related Protein Sequences

 * 1) Go to the Advanced Search UniProt as described above
 * 2) Because the advanced search is completely different, cannot deselect TrEMBL. Instead Select Reviewed- Yes as an alternative
 * 3) Input dUTPase in search again. There is no "description" field any longer.Yields many possibilities
 * 4) Since there are more than 211 total possibilities, so we selected entire first page of sequences (25)
 * 5) In newer version click retrieve at the bottom right corner instead of french button.
 * 6) Once you retrieve these, it is put into a list of which you can add to and then below choose the format you want the sequences in. No longer have to copy and paste into a document. FASTA format is available.



Reading a Swiss-Prot Entry
This time we skipped the example and did the activity using HIV gp120. Entry Name: ENV_HV1H2 Accession Number: P04578
 * 1) Select the Reviewed - Yes. Our overall query to achieve these results: HIV gp120 AND reviewed:yes
 * 2) We selected the first option in the list
 * Scroll down to Sequence Annotation - Region to Look at V3 sequence specifically.

ORFing your DNA Sequences
I input the following sequence: >S7V1-1 GAGATAGTAATTAGATCTGCCAATTTCACGGACAATACTAAGACCATAATAGTACAGCTGAATGTATCTG TAGAAATTAATTGTACGAGACCCAACAACAATACAAGAAAAAGTATACCTATAGGACCAGGGAGAGCATT TTATGCTACAGGAGAAATAATAGGGAATATAAGACAAGCACATTGTAACATTAGTAGAGCAAAATGGAAT AACACTTTAAAACAGATAGCTACAAAATTAAGAAAACAATTTGAGAATAAAACAATAGTCTTTAATCAAT CCTCA Compare your results with the SWISS-PROT entry you found for the protein above to decipher what the output means. ExPASy also has a translation tool you can use here
 * 1) Go to NCBI ORF Finder
 * 2) Input a DNA sequence for practice

E I V I R S A N F T D N T K T I I V Q L N V S V E I N C T R P N N N T R K S I P I G P G R A F Y A T G E I I G N I R Q A H C N I S R A K W N N T L K Q I A T K L R K Q F E N K T I V F N Q S S
 * Based on the ExPASy tool, the following amino acid sequence was the only viable ORF. All others had stop codons within the first few codons

Working with a single protein sequence
Utilizing Bioinformatics for Dummies pages 159-195
 * 1) Go to Expasy
 * 2) Click protParam near top of page
 * 3) Enter sequence into space provided or by pasting the accession number. DO NOT INCLUDE THE FASTA FORMAT FIRST LINE, ONLY RAW DATA.
 * 4) Compute parameters



I saved this file on my personal computer since WetWare does not allow html files. This will give information about the protein, composition, ph, stability, etc


 * For a tool to simulate cutting of your protein, use:

Looking for transmembrane segments

 * 1) go to Protscale
 * 2) Enter your sequence in raw format or swiss-prot accession number.
 * 3) Select the radio button.
 * 4) Choose 19 in the pull-down menu because this number is best for looking for transmembrane helices. 7-11 would be better for globular proteins.




 * strong signals are not sensitive to parameters. Recommended threshold for Kyte and Doolittle is 1.6. If you forget this number do the following:
 * 1) Place paper over your results.
 * 2) Lower the paper until the tips of the strongest peaks appear
 * 3) Keep lowering this threshold as long as you can see nice sharp peaks.


 * 6 of the 7 transmembrane regions are easy to find.
 * 1) Go to TMHMM only FASTA format is recognized
 * 2) Keep Output Format radio buttons to their default value.


 * This predicts segments that are inside the cell AND segments that are outside of the cell. but fails to predict the segment in the middle, but gives good estimation for 5 of the segments.

Coiled Coil regions

 * go to COILS

Predicting post translational Modifications

 * 1) go to PROSITE and compare your protein to other collection of patterns in PROSITE.
 * 2) Paste your sequence or acession number into the left box (Proteins to be scanned)
 * 3) Uncheck Exclude Motifs with High Probability of Occurrence box.
 * 4) Check the Do not scan profiles box
 * 5) Scan

Results:
 * Each pattern family has its own color code, with its accompanying details
 * PS##### leads to the pattern documentation and information about its biological function
 * PDB= Protein Data Base which contains all the 3d structures.
 * Click on one of these links for a static GIF picture
 * The list has segments containing patterns within your protein. The numbers indicate the position of the match within your sequence, capital letters=residues specified by the pattern; lowercase letters = residues that weren't specified by the pattern
 * NOTE: not everything is in PROSITE! Visit expasy tools!

Finding domains with InterProScan

 * 1) Go to InterProScan
 * 2) Enter your sequence in the search box. Accession numbers do not work
 * 3) Choose domain databases you are interested in. TMHMM and SignalPHMM are options that determine transmembrane predictions and single peptide (short N-Terminus segment that causes your protein to reach it's destination in the cell) predictions, respectively.
 * 4) Submit job
 * 5) Save results via the "Raw Output" button

Results: Strengths: Searches multiple databases Weaknesses: domain databases do not agree exactly on the boundaries of the matches.
 * 1) First entry in each column indicates the type of diagnosis provided: FAMILY OR DOMAIN
 * 2) IPR#### points to the InterPro documentation
 * 3) PS#### link will take you to that entry where you can find individual PROSITE documentation
 * 4) Colored boxes show you where the match occurred on your sequence
 * See: A common mistake when scanning domains on PAGE 187

Finding domains with the CD Server

 * 1) Go to CD server or BLAST and click Search the Conserved Domain Database using RPS-BLAST.
 * 2) Paste sequence or its identifier in the box
 * 3) Deselect the Apply Low Complexity Filter in the case that there me an over-represented amino acid, making a sequence simpler and losing domains with this simplicity.
 * 4) Set the Expect Value Threshold to 1
 * 5) Submit ( For this I used a different sequence provided on the CD server site for simplicity)



Results:
 * Graphic shows the regions of your protein that match the domain.
 * Red domains are from SMART
 * Ragged ends indicate partial matches
 * E-Value=how many times you can expect this good of a hit by sheer chance. The lower the E-Value, the better. (below 0.01)
 * The hit list shows the domains that match your sequence, sorted by E-value. Links lead to documentation.

3d structure

 * 1) use Cn3dn Cn3D software site