Wikiomics:Cloning in silico: Difference between revisions
m (1 revision(s)) |
(add tags) |
||
(3 intermediate revisions by one other user not shown) | |||
Line 7: | Line 7: | ||
==ESTs based== | |||
==databases of pre-clustered ESTs== | |||
A shortcut to obtain either consensus sequence (TIGR) or a set of ESTs (Unigene) derived from a gene of interest. | |||
* STACKdb (limited access, tissue specific splice forms) [http://ww2.sanbi.ac.za/Dbases.html] | |||
* Unigene (no consensus sequence) [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigene] | |||
* TIGR [http://www.tigr.org/tdb/tgi/index.shtml] | |||
==Search of ESTs databases using BLAST== | |||
# Depending on a level of homology we can use: | |||
* blastn program, cDNA sequence as query, EST DB from the same species (== novel splice forms discovery in the same species) | |||
* tblastn program, protein sequence as query, EST DB from the same (==paralogue discovery) or other species (== cloning any homologues) | |||
If possible, use protein sequence from related species i.e zebrafish protein when looking for a homologue in salmon), but for a large number number of proteins one can detect homology between human and C.elegans. | |||
# Restrict blast output with species, i.e search only porcine ESTs to simplify the output | |||
# On the BLAST output page select reasonable hits by checking box on the left in the alignment section. | |||
# Retrieve all checked results as FASTA file (i.e. pig_Xgene_ESTs_date_round1.fasta | |||
# check how many sensible hits you got, i.e. using grep on Unix/Linux | |||
<pre> | |||
grep '>' pig_Xgene_ESTs_date_round1.fasta | wc | |||
</pre> | |||
# assembly all your EST sequences using phrap (on Unix command line): | |||
<pre> | <pre> | ||
phrap pig_Xgene_ESTs_date_round1.fasta | |||
</pre> | |||
you should get file: pig_Xgene_ESTs_date_round1.fasta.contigs | |||
If you do not have phrap you may use: | |||
* [http://pbil.univ-lyon1.fr/cap3.php CAP3] | |||
* [http://alggen.lsi.upc.es/recerca/essem/frame-essem.html ESSEM] (Est's aSSEmbly using Malig) from Technical University of Catalonia. | |||
You may download sequences of human SYNGR4 [ESTs http://www.ncbi.nlm.nih.gov/UniGene/clust.cgi?UGID=221005&TAXID=9606&SEARCH= here], save it as FASTA file and then feed | |||
CAP3 or ESSEM with it to check how it works. Use Suggested assembly sequence: | |||
<pre> | <pre> | ||
>assembly: gnl|UG|Hs# -> gnl|UG|Hs# (R) | |||
TTTTTTTTTTTTTTTGTTTTTAGAAACCCTTCTGGAGGGAGGATTCTCTCTTTATTGATTTGGATAAGGATATTTAGTTG | |||
TCAGGCATCATAGCAAGCCGGGGGGACTTTGGAGCGGTCAGACAGGGGGACAGGGCAGAGCTAGCATAACTCAGGCTGTT | |||
GGGGCCAGTGGTGGGCATGTTCACAGGGCTGTTGGCAGAGGGCAAGGGGAGGGTGGTCAGCACCATGCCACCCTCATCCA | |||
GGAAGCGCTTGTAAGGGACTGGAGCATCATTTCGGAGGTCCTGGAATGCCAGGTAGGCCTGGAATATCCAGACAAGGATG | |||
GAGAAGAAGGTGAAGGCGATGGCTGCCTGGCACTGCTGCTCCCCAGGAGGAACTCTTTGGGCGGCGAATGCTGCCATTGG | |||
TTGGCCAGGAAGCAGAAACCCATGAACCAGACAACTGCCCAGAGAACAGCCAGGATGAAGTCCAGGAGCTGGAAGGCTGT | |||
CTTGAAGCGGGTGCCGGCAATGCGGGTCTCCTGTGTGTCCAGGACGAGGAAGGCCAGCCACGCTGAGGAAGGCCAGGAAG | |||
CCGGCTCCCACGGCAAAGCTGCAGGCCACGCTGTTGCTGTTGAGAATGCAGTGGAGCTGCGGAGACTCCATCTTGTTCTG | |||
GTAGCCGTCGGTCAGCAGGGAGGAGAAGACGATCAGGGAGAAGACCCCTGCCTCCCCCACACTCTCCTTCTGCCACCAAA | |||
CC | |||
</pre> | |||
# mask possible repeats using [http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker RepeatMasker server]. ESTs libraries are notorious for containing non-spliced ESTs/containations. | |||
# use masked consensus sequence (MCS) from step above in next round of BLAST search: | |||
in blastn program, MCS as query, EST DB from the same species | |||
check how many sensible hits you got. | |||
# repeat ESTs assembly, repeat masking, compare new ESTs contigs with contigs from the previous step until you got no new hits in ESTs database. | |||
<!-- (save maybe for the v. highly expressed genes with thousands of ESTs and tens of artifacts) --> | |||
# after every assembly step make sure that the contig you use contains sequence of interest (== compare it with the first cDNA or protein sequence) | |||
===Genome annotation using ESTs assembly=== | |||
* PASA http://www.tigr.org/tdb/e2k1/ath1/pasa_annot_updates/pasa_annot_updates.shtml | |||
Line 65: | Line 83: | ||
* sequences in GeneBank are usually shorter than original trace files | * sequences in GeneBank are usually shorter than original trace files | ||
* there is no way you can detect a sequencing error in plain text/fasta file without looking at trace file | * there is no way you can detect a sequencing error in plain text/fasta file without looking at trace file | ||
In order to get them one can search for relevant trace files using Sanger's Trace server: | In order to get them one can search for relevant trace files using Sanger's Trace server: | ||
http://trace.ensembl.org/cgi-bin/tracesearch | http://trace.ensembl.org/cgi-bin/tracesearch | ||
or NCBI | |||
http://www.ncbi.nlm.nih.gov/blast/mmtrace.shtml | |||
After blsting one can retrieve trace files as compressed tar in SCF or RCF. RCF is encoded & shrinked SCF: obtain and compile rcf2scf program [ftp://ftp.ncbi.nih.gov/pub/TraceDB/misc/rcf2scf/rcf2scf.tgz here] if you plan to get large number of trace files for speeding up transfer times. | |||
===Genome based=== | |||
* based on homology | |||
* de novo | |||
This will be covered in genome annotation guide. | |||
{{stub}} | {{stub}} | ||
[[Category:Protocol]] [[Category:In silico]] [[Category:Cloning]] |
Latest revision as of 02:52, 4 March 2008
Cloning in silico proceduro of obtaining full or partial cDNA sequence of a gene by using computer only.
There are several variants:
- discovery of new splice forms of a known gene
- cloning a novel orthologue gene in new species
- cloning a new gene(s) using ESTs database alone (ESTs clustering)
ESTs based
databases of pre-clustered ESTs
A shortcut to obtain either consensus sequence (TIGR) or a set of ESTs (Unigene) derived from a gene of interest.
- STACKdb (limited access, tissue specific splice forms) [1]
- Unigene (no consensus sequence) [2]
- TIGR [3]
Search of ESTs databases using BLAST
- Depending on a level of homology we can use:
- blastn program, cDNA sequence as query, EST DB from the same species (== novel splice forms discovery in the same species)
- tblastn program, protein sequence as query, EST DB from the same (==paralogue discovery) or other species (== cloning any homologues)
If possible, use protein sequence from related species i.e zebrafish protein when looking for a homologue in salmon), but for a large number number of proteins one can detect homology between human and C.elegans.
- Restrict blast output with species, i.e search only porcine ESTs to simplify the output
- On the BLAST output page select reasonable hits by checking box on the left in the alignment section.
- Retrieve all checked results as FASTA file (i.e. pig_Xgene_ESTs_date_round1.fasta
- check how many sensible hits you got, i.e. using grep on Unix/Linux
grep '>' pig_Xgene_ESTs_date_round1.fasta | wc
- assembly all your EST sequences using phrap (on Unix command line):
phrap pig_Xgene_ESTs_date_round1.fasta
you should get file: pig_Xgene_ESTs_date_round1.fasta.contigs
If you do not have phrap you may use:
You may download sequences of human SYNGR4 [ESTs http://www.ncbi.nlm.nih.gov/UniGene/clust.cgi?UGID=221005&TAXID=9606&SEARCH= here], save it as FASTA file and then feed CAP3 or ESSEM with it to check how it works. Use Suggested assembly sequence:
>assembly: gnl|UG|Hs# -> gnl|UG|Hs# (R) TTTTTTTTTTTTTTTGTTTTTAGAAACCCTTCTGGAGGGAGGATTCTCTCTTTATTGATTTGGATAAGGATATTTAGTTG TCAGGCATCATAGCAAGCCGGGGGGACTTTGGAGCGGTCAGACAGGGGGACAGGGCAGAGCTAGCATAACTCAGGCTGTT GGGGCCAGTGGTGGGCATGTTCACAGGGCTGTTGGCAGAGGGCAAGGGGAGGGTGGTCAGCACCATGCCACCCTCATCCA GGAAGCGCTTGTAAGGGACTGGAGCATCATTTCGGAGGTCCTGGAATGCCAGGTAGGCCTGGAATATCCAGACAAGGATG GAGAAGAAGGTGAAGGCGATGGCTGCCTGGCACTGCTGCTCCCCAGGAGGAACTCTTTGGGCGGCGAATGCTGCCATTGG TTGGCCAGGAAGCAGAAACCCATGAACCAGACAACTGCCCAGAGAACAGCCAGGATGAAGTCCAGGAGCTGGAAGGCTGT CTTGAAGCGGGTGCCGGCAATGCGGGTCTCCTGTGTGTCCAGGACGAGGAAGGCCAGCCACGCTGAGGAAGGCCAGGAAG CCGGCTCCCACGGCAAAGCTGCAGGCCACGCTGTTGCTGTTGAGAATGCAGTGGAGCTGCGGAGACTCCATCTTGTTCTG GTAGCCGTCGGTCAGCAGGGAGGAGAAGACGATCAGGGAGAAGACCCCTGCCTCCCCCACACTCTCCTTCTGCCACCAAA CC
- mask possible repeats using RepeatMasker server. ESTs libraries are notorious for containing non-spliced ESTs/containations.
- use masked consensus sequence (MCS) from step above in next round of BLAST search:
in blastn program, MCS as query, EST DB from the same species
check how many sensible hits you got.
- repeat ESTs assembly, repeat masking, compare new ESTs contigs with contigs from the previous step until you got no new hits in ESTs database.
- after every assembly step make sure that the contig you use contains sequence of interest (== compare it with the first cDNA or protein sequence)
Genome annotation using ESTs assembly
importing human, mouse and zebrafish EST trace files
For a significant subset of human, mouse and zebrafish ESTs there are available trace and even experiment files. For a sane gene cloning we need them because:
- sequences in GeneBank are usually shorter than original trace files
- there is no way you can detect a sequencing error in plain text/fasta file without looking at trace file
In order to get them one can search for relevant trace files using Sanger's Trace server:
http://trace.ensembl.org/cgi-bin/tracesearch
or NCBI http://www.ncbi.nlm.nih.gov/blast/mmtrace.shtml
After blsting one can retrieve trace files as compressed tar in SCF or RCF. RCF is encoded & shrinked SCF: obtain and compile rcf2scf program here if you plan to get large number of trace files for speeding up transfer times.
Genome based
- based on homology
- de novo
This will be covered in genome annotation guide.