Wikiomics:Cloning in silico

Cloning in silico proceduro of obtaining full or partial cDNA sequence of a gene by using computer only.

There are several variants:
 * discovery of new splice forms of a known gene
 * cloning a novel orthologue gene in new species
 * cloning a new gene(s) using ESTs database alone (ESTs clustering)

databases of pre-clustered ESTs
A shortcut to obtain either consensus sequence (TIGR) or a set of ESTs (Unigene) derived from a gene of interest.


 * STACKdb (limited access, tissue specific splice forms)
 * Unigene (no consensus sequence)
 * TIGR

Search of ESTs databases using BLAST
If possible, use protein sequence from related species i.e zebrafish protein when looking for a homologue in salmon), but for a large number number of proteins one can detect homology between human and C.elegans.
 * 1) Depending on a level of homology we can use:
 * blastn program, cDNA sequence as query, EST DB from the same species (== novel splice forms discovery in the same species)
 * tblastn program, protein sequence as query,  EST DB from the same (==paralogue discovery) or other species (== cloning any homologues)

grep '>' pig_Xgene_ESTs_date_round1.fasta | wc
 * 1) Restrict blast output with species, i.e search only porcine ESTs to simplify the output
 * 2) On the BLAST output page select reasonable hits by checking box on the left in the alignment section.
 * 3) Retrieve all checked results as FASTA file (i.e. pig_Xgene_ESTs_date_round1.fasta
 * 4) check how many sensible hits you got, i.e. using grep on Unix/Linux
 * 1) assembly all your EST sequences using phrap (on Unix command line):

phrap pig_Xgene_ESTs_date_round1.fasta

you should get file: pig_Xgene_ESTs_date_round1.fasta.contigs

If you do not have phrap you may use:
 * CAP3
 * ESSEM (Est's aSSEmbly using Malig) from Technical University of Catalonia.

You may download sequences of human SYNGR4 [ESTs http://www.ncbi.nlm.nih.gov/UniGene/clust.cgi?UGID=221005&TAXID=9606&SEARCH= here], save it as FASTA file and then feed CAP3 or ESSEM with it to check how it works. Use Suggested assembly sequence:

>assembly: gnl|UG|Hs# -> gnl|UG|Hs# (R) TTTTTTTTTTTTTTTGTTTTTAGAAACCCTTCTGGAGGGAGGATTCTCTCTTTATTGATTTGGATAAGGATATTTAGTTG TCAGGCATCATAGCAAGCCGGGGGGACTTTGGAGCGGTCAGACAGGGGGACAGGGCAGAGCTAGCATAACTCAGGCTGTT GGGGCCAGTGGTGGGCATGTTCACAGGGCTGTTGGCAGAGGGCAAGGGGAGGGTGGTCAGCACCATGCCACCCTCATCCA GGAAGCGCTTGTAAGGGACTGGAGCATCATTTCGGAGGTCCTGGAATGCCAGGTAGGCCTGGAATATCCAGACAAGGATG GAGAAGAAGGTGAAGGCGATGGCTGCCTGGCACTGCTGCTCCCCAGGAGGAACTCTTTGGGCGGCGAATGCTGCCATTGG TTGGCCAGGAAGCAGAAACCCATGAACCAGACAACTGCCCAGAGAACAGCCAGGATGAAGTCCAGGAGCTGGAAGGCTGT CTTGAAGCGGGTGCCGGCAATGCGGGTCTCCTGTGTGTCCAGGACGAGGAAGGCCAGCCACGCTGAGGAAGGCCAGGAAG CCGGCTCCCACGGCAAAGCTGCAGGCCACGCTGTTGCTGTTGAGAATGCAGTGGAGCTGCGGAGACTCCATCTTGTTCTG GTAGCCGTCGGTCAGCAGGGAGGAGAAGACGATCAGGGAGAAGACCCCTGCCTCCCCCACACTCTCCTTCTGCCACCAAA CC


 * 1) mask possible repeats using RepeatMasker server. ESTs libraries are notorious for containing non-spliced ESTs/containations.

in blastn program, MCS as query,  EST DB from the same species
 * 1) use masked consensus sequence (MCS) from step above in next round of BLAST search:

check how many sensible hits you got.


 * 1) repeat ESTs assembly, repeat masking, compare new ESTs contigs with contigs from the previous step until you got no new hits in ESTs database.


 * 1) after every assembly step make sure that the contig you use contains sequence of interest (== compare it with the first cDNA or protein sequence)

Genome annotation using ESTs assembly

 * PASA http://www.tigr.org/tdb/e2k1/ath1/pasa_annot_updates/pasa_annot_updates.shtml

importing human, mouse and zebrafish EST trace files
For a significant subset of human, mouse and zebrafish ESTs there are available trace and even experiment files. For a sane gene cloning we need them because:


 * sequences in GeneBank are usually shorter than original trace files
 * there is no way you can detect a sequencing error in plain text/fasta file without looking at trace file

In order to get them one can search for relevant trace files using Sanger's Trace server:

http://trace.ensembl.org/cgi-bin/tracesearch

or NCBI http://www.ncbi.nlm.nih.gov/blast/mmtrace.shtml

After blsting one can retrieve trace files as compressed tar in SCF or RCF. RCF is encoded & shrinked SCF: obtain and compile rcf2scf program here if you plan to get large number of trace files for speeding up transfer times.

Genome based

 * based on homology
 * de novo

This will be covered in genome annotation guide.