Wikiomics:Cloning in silico: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
m (1 revision(s))
(rewrite for 2007)
Line 7: Line 7:




==ESTs based==
==databases of pre-clustered ESTs==
A shortcut to obtain either consensus sequence (TIGR) or a set of ESTs (Unigene) derived from a gene of interest.


* STACKdb (limited access, tissue specific splice forms) [http://ww2.sanbi.ac.za/Dbases.html]
* Unigene (no consensus sequence) [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigene]
* TIGR [http://www.tigr.org/tdb/tgi/index.shtml]


Procedures for cloning new genes from ESTs (Expressed Sequence Tags).
==Search of ESTs databases using BLAST===
# Depending on a level of homology we can use:
* blastn program, cDNA sequence as query,  EST DB from the same species (== novel splice forms discovery in the same species)
* tblastn program, protein sequence as query,  EST DB from the same (==paralogue discovery) or other species (== cloning any homologues)
If possible, use protein sequence from related species i.e zebrafish protein when looking for a homologue in salmon), but for a large number number of proteins one can detect homology between human and C.elegans.  


Getting ESTs sequences/traces for gene assembly 2.1 getting sequences from multiple ESTs
# Restrict blast output with species, i.e search only porcine ESTs to simplify the output
 
# On the BLAST output page select reasonable hits by checking box on the left in the alignment section.  
Sometimes you have no other option but working with plain EST sequences with no traces. To get them easily from the blast output we use file containing Acc. Nos. for Batch Entrez and process the output with manualy
# Retrieve all checked results as FASTA file (i.e. pig_Xgene_ESTs_date_round1.fasta
 
# check how many sensible hits you got, i.e. using grep on Unix/Linux
* open nedit using nc command  
<pre>
* mark hits in the blast output window (firefox etc.)  
grep '>' pig_Xgene_ESTs_date_round1.fasta | wc
* copy it to the editor
</pre>
# assembly all your EST sequences using phrap (on Unix command line):


<pre>
<pre>
gb|AA449543|AA449543 zx08a09.r1 Soares total fetus Nb2HF8 9w Ho... 194
phrap pig_Xgene_ESTs_date_round1.fasta
 
</pre> 
gb|AA007668|AA007668 zh99g06.r1 Soares fetal liver spleen 1NFLS... 100


gb|W88626|W88626 zh73b12.r1 Soares fetal liver spleen 1NFLS S1 ... 101
you should get file: pig_Xgene_ESTs_date_round1.fasta.contigs


gb|AA465253|AA465253 aa33a08.r1 NCI\_CGAP\_GCB1 Homo sapiens cDNA... 62
</pre>


* replace pipe symbol "|" with spaces:
If you do not have phrap you may use [http://alggen.lsi.upc.es/recerca/essem/frame-essem.html ESSEM] (Est's aSSEmbly using Malig) from Technical University of Catalonia.
You may download sequences of human SYNGR4 [ESTs http://www.ncbi.nlm.nih.gov/UniGene/clust.cgi?UGID=221005&TAXID=9606&SEARCH= here], save it as FASTA file and then feed 
ESSEM with it to check how it works. Use Suggested assembly sequence:


<pre>
<pre>
>assembly: gnl|UG|Hs# -> gnl|UG|Hs# (R)
TTTTTTTTTTTTTTTGTTTTTAGAAACCCTTCTGGAGGGAGGATTCTCTCTTTATTGATTTGGATAAGGATATTTAGTTG
TCAGGCATCATAGCAAGCCGGGGGGACTTTGGAGCGGTCAGACAGGGGGACAGGGCAGAGCTAGCATAACTCAGGCTGTT
GGGGCCAGTGGTGGGCATGTTCACAGGGCTGTTGGCAGAGGGCAAGGGGAGGGTGGTCAGCACCATGCCACCCTCATCCA
GGAAGCGCTTGTAAGGGACTGGAGCATCATTTCGGAGGTCCTGGAATGCCAGGTAGGCCTGGAATATCCAGACAAGGATG
GAGAAGAAGGTGAAGGCGATGGCTGCCTGGCACTGCTGCTCCCCAGGAGGAACTCTTTGGGCGGCGAATGCTGCCATTGG
TTGGCCAGGAAGCAGAAACCCATGAACCAGACAACTGCCCAGAGAACAGCCAGGATGAAGTCCAGGAGCTGGAAGGCTGT
CTTGAAGCGGGTGCCGGCAATGCGGGTCTCCTGTGTGTCCAGGACGAGGAAGGCCAGCCACGCTGAGGAAGGCCAGGAAG
CCGGCTCCCACGGCAAAGCTGCAGGCCACGCTGTTGCTGTTGAGAATGCAGTGGAGCTGCGGAGACTCCATCTTGTTCTG
GTAGCCGTCGGTCAGCAGGGAGGAGAAGACGATCAGGGAGAAGACCCCTGCCTCCCCCACACTCTCCTTCTGCCACCAAA
CC
</pre>


# mask possible repeats using [http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker RepeatMasker server]. ESTs libraries are notorious for containing non-spliced ESTs/containations.


gb AA449543 AA449543 zx08a09.r1 Soares total fetus Nb2HF8 9w
# use masked consensus sequence (MCS) from step above in next round of BLAST search:
in  blastn program, MCS as query,  EST DB from the same species


gb AA007668 AA007668 zh99g06.r1 Soares fetal liver spleen
check how many sensible hits you got.  


gb W88626 W88626 zh73b12.r1 Soares fetal liver spleen 1NFLS
# repeat ESTs assembly, repeat masking, compare new ESTs contigs with contigs from the previous step until you got no new hits in ESTs database.
</pre>
<!-- (save maybe for the v. highly expressed genes with thousands of ESTs and tens of artifacts) -->
# after every assembly step make sure that the contig you use contains sequence of interest (== compare it with the first cDNA or protein sequence)


* copy column containing Acc. Nos. to the final file: AA449543


AA007668
===Genome annotation using ESTs assembly===


W88626
* PASA http://www.tigr.org/tdb/e2k1/ath1/pasa_annot_updates/pasa_annot_updates.shtml


* Start firefox go to Batch Entrez page:  [http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi?db=Nucleotide]


Retrieve all sequences from file of Gis/Accessions using Format: Fasta. You have to select: Browse -> final_file.txt
* save result file as ESTs.current_date.fasta
* assembly sequences using phrap:
<pre>
phrap ESTs.current_date.fasta
</pre>




Line 65: Line 80:
* sequences in GeneBank are usually shorter than original trace files
* sequences in GeneBank are usually shorter than original trace files
* there is no way you can detect a sequencing error in plain text/fasta file without looking at trace file
* there is no way you can detect a sequencing error in plain text/fasta file without looking at trace file
* you can get sequence from the other end of the clone (also possible with some ESTs for which we may get trace files have a naming convention: o human ESTs: start with "a", "y" or "z" (like aa09h01.r1, ye12c01.s1, ze34c06.r1) o mouse ESTs start with "m" , "u" or "v". You can get 5'-ends only. o zebrafish ESTs start with "f"


In order to get them one can search for relevant trace files using Sanger's Trace server:
In order to get them one can search for relevant trace files using Sanger's Trace server:


http://trace.ensembl.org/cgi-bin/tracesearch
http://trace.ensembl.org/cgi-bin/tracesearch
===Genome based===
* based on homology
* de novo
This will be covered in genome annotation guide.


{{stub}}
{{stub}}

Revision as of 12:57, 11 December 2007

Cloning in silico proceduro of obtaining full or partial cDNA sequence of a gene by using computer only.

There are several variants:

  • discovery of new splice forms of a known gene
  • cloning a novel orthologue gene in new species
  • cloning a new gene(s) using ESTs database alone (ESTs clustering)


ESTs based

databases of pre-clustered ESTs

A shortcut to obtain either consensus sequence (TIGR) or a set of ESTs (Unigene) derived from a gene of interest.

  • STACKdb (limited access, tissue specific splice forms) [1]
  • Unigene (no consensus sequence) [2]
  • TIGR [3]

Search of ESTs databases using BLAST=

  1. Depending on a level of homology we can use:
  • blastn program, cDNA sequence as query, EST DB from the same species (== novel splice forms discovery in the same species)
  • tblastn program, protein sequence as query, EST DB from the same (==paralogue discovery) or other species (== cloning any homologues)

If possible, use protein sequence from related species i.e zebrafish protein when looking for a homologue in salmon), but for a large number number of proteins one can detect homology between human and C.elegans.

  1. Restrict blast output with species, i.e search only porcine ESTs to simplify the output
  2. On the BLAST output page select reasonable hits by checking box on the left in the alignment section.
  3. Retrieve all checked results as FASTA file (i.e. pig_Xgene_ESTs_date_round1.fasta
  4. check how many sensible hits you got, i.e. using grep on Unix/Linux
grep '>' pig_Xgene_ESTs_date_round1.fasta | wc 
  1. assembly all your EST sequences using phrap (on Unix command line):
phrap pig_Xgene_ESTs_date_round1.fasta

you should get file: pig_Xgene_ESTs_date_round1.fasta.contigs


If you do not have phrap you may use ESSEM (Est's aSSEmbly using Malig) from Technical University of Catalonia. You may download sequences of human SYNGR4 [ESTs http://www.ncbi.nlm.nih.gov/UniGene/clust.cgi?UGID=221005&TAXID=9606&SEARCH= here], save it as FASTA file and then feed ESSEM with it to check how it works. Use Suggested assembly sequence:

>assembly: gnl|UG|Hs# -> gnl|UG|Hs# (R)
TTTTTTTTTTTTTTTGTTTTTAGAAACCCTTCTGGAGGGAGGATTCTCTCTTTATTGATTTGGATAAGGATATTTAGTTG
TCAGGCATCATAGCAAGCCGGGGGGACTTTGGAGCGGTCAGACAGGGGGACAGGGCAGAGCTAGCATAACTCAGGCTGTT
GGGGCCAGTGGTGGGCATGTTCACAGGGCTGTTGGCAGAGGGCAAGGGGAGGGTGGTCAGCACCATGCCACCCTCATCCA
GGAAGCGCTTGTAAGGGACTGGAGCATCATTTCGGAGGTCCTGGAATGCCAGGTAGGCCTGGAATATCCAGACAAGGATG
GAGAAGAAGGTGAAGGCGATGGCTGCCTGGCACTGCTGCTCCCCAGGAGGAACTCTTTGGGCGGCGAATGCTGCCATTGG
TTGGCCAGGAAGCAGAAACCCATGAACCAGACAACTGCCCAGAGAACAGCCAGGATGAAGTCCAGGAGCTGGAAGGCTGT
CTTGAAGCGGGTGCCGGCAATGCGGGTCTCCTGTGTGTCCAGGACGAGGAAGGCCAGCCACGCTGAGGAAGGCCAGGAAG
CCGGCTCCCACGGCAAAGCTGCAGGCCACGCTGTTGCTGTTGAGAATGCAGTGGAGCTGCGGAGACTCCATCTTGTTCTG
GTAGCCGTCGGTCAGCAGGGAGGAGAAGACGATCAGGGAGAAGACCCCTGCCTCCCCCACACTCTCCTTCTGCCACCAAA
CC
  1. mask possible repeats using RepeatMasker server. ESTs libraries are notorious for containing non-spliced ESTs/containations.
  1. use masked consensus sequence (MCS) from step above in next round of BLAST search:

in blastn program, MCS as query, EST DB from the same species

check how many sensible hits you got.

  1. repeat ESTs assembly, repeat masking, compare new ESTs contigs with contigs from the previous step until you got no new hits in ESTs database.
  2. after every assembly step make sure that the contig you use contains sequence of interest (== compare it with the first cDNA or protein sequence)


Genome annotation using ESTs assembly



importing human, mouse and zebrafish EST trace files

For a significant subset of human, mouse and zebrafish ESTs there are available trace and even experiment files. For a sane gene cloning we need them because:

  • sequences in GeneBank are usually shorter than original trace files
  • there is no way you can detect a sequencing error in plain text/fasta file without looking at trace file

In order to get them one can search for relevant trace files using Sanger's Trace server:

http://trace.ensembl.org/cgi-bin/tracesearch

Genome based

  • based on homology
  • de novo

This will be covered in genome annotation guide.