Proportal ReleaseNotes: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
Line 7: Line 7:
     3. Proteins do not have unique gb_tag/locus_tag.
     3. Proteins do not have unique gb_tag/locus_tag.


The modification allows smooth and successful uploading of GenBank files for genomes in single or multiple contigs, with or without unique protein identification. The number of proteins was doubled from 63540 to current 114321. It took about 12 hours to do pair-wise blast search and more than one hour to generate COG clusters from identified pairs of proteins.
The modification allows smooth and successful uploading of GenBank files for genomes in single or multiple contigs, with or without unique protein identifiers. The number of proteins was doubled from 63540 to current 114321. It took about 12 hours to do pair-wise blast search and more than one hour to generate COG clusters from identified pairs of proteins.
 
8181 mixed COG clusters were identified from both Pro and Phage genomes, which combines 84 out of 8265 clusters identified if Pro and Phage genomes are processed separately. 
 
Next, all singletons will be scanned against HMM profiles and compared to the OrthoMCL result.


==June 25, 2012==
==June 25, 2012==

Revision as of 12:03, 25 June 2012

June 25, 2012

New COG cluster (potential version 4) have been generated. Script in COG cluster pipeline has been modified to address the following previous issues, which prevented proper uploading GenBank files of genomes into Proportal database,

    1. GenBank files contain multiple contigs; 
    2. Genomes do not have tax_id;
    3. Proteins do not have unique gb_tag/locus_tag.

The modification allows smooth and successful uploading of GenBank files for genomes in single or multiple contigs, with or without unique protein identifiers. The number of proteins was doubled from 63540 to current 114321. It took about 12 hours to do pair-wise blast search and more than one hour to generate COG clusters from identified pairs of proteins.

8181 mixed COG clusters were identified from both Pro and Phage genomes, which combines 84 out of 8265 clusters identified if Pro and Phage genomes are processed separately.

Next, all singletons will be scanned against HMM profiles and compared to the OrthoMCL result.

June 25, 2012

Integrate single cell Biome database into Proportal. For the single cell paper, visitors are able to download contigs and raw reads as well as look at annotations of single cells. Details are to be discussed.

June 11, 2012

Retrieve single cell genomes from Biome database. Extract contigs/scaffolds, not proteins, from biome.mit.edu for each of the single cell genomes (e.g. W2, W3, ...) Each genome needs it's on multifasta file. Some contigs in the database may be from contaminants. Look in 'admin_scaffold' table, There's a category for each contig/scaffold called 'is_proch' not to be confused with 'is_pro_core'. All contigs should have 1 for this value... contaminant contigs are 0.

The header for the contig in the fasta is the tag "<the single cell tag>_<contig number>".

March 23, 2012

Dead line for DOE proposal.

March 3, 2012

Finished the comparison between the single- and paired end RNA-Seq data.

February 27, 2012

Released the latest Version-2012-02-26 of host and phage mixed COG clusters, which is prepared in a spreadsheet format.

From such mixed COG clusters, one can identify genes co-existing in both Prochlorococcus /Synechococcus and Cyanophage genomes. For instance, we can find 25 co-cluster genes for NATL2A/P-SSM2 genomes including PsbA, Hli03, Hli04, Hli05, PetE, and PurM etc, which is helpful for the study of phage infection using RNA-Seq.

February 5, 2012

P-SSP5 instead of P-SSP3 An immediate change has been made to ProPortal, the podovirus genome that is identified as P-SSP3 needs to be renamed. This genome was apparently incorrectly named; it is P-SSP5 (also called 9515-10a, which agrees with the currently posted metadata on ProPortal). There is no lysate available for this phage, so perhaps we should also try to figure out if we want it to be on ProPortal at all. http://proportal.mit.edu/genome/id=58/

We need to get the real P-SSP3 integrated into ProPortal as well. The CAMERA genome annontated as P-SSP3 appears to be correct (according to the cyanophage inventory Excele table P-SSP3 is also named G2087, which agrees with the CAMERA directory name) but while we have FASTA files for it, there's no Genbank file available on the CAMERA FTP site for it. CAMERA has been contacted to get the file: ftp://ftp.camera.calit2.net/20100928_Prochloro_phage_P_SSP3_G2087/phage/annotation/

Solution: P-SSP5 has replaced P-SSP3 in data_project table.

According to the memo, the old P-SSP3 has been renamed as P-SSP5 currently in my Proportal development database. However, the correct P-SSP3 from CAMERA hasn't been integrated into my DB yet. Therefore, it will help me a lot if you can find the gbk for P-SSP3.

September 26, 2011

Search Page

Issue: some functions e.g. using keywords to search for specific genes wasn’t very useful e.g. a search for phoA produces lots of hits to cysQ.

Reason: the keyword "phoA" was used in searching "gene_name", gb_tag", "locus_tag" and protein description "defline", where the last query on "defline" generates extra abundant output from the search.

Fix: return results from searching "gene_name", gb_tag" and "locus_tag" only if any. Otherwise, search "defline" as the last resort. Made the following modification in queryProtein.py from,

   if keywords.find("COG") == 0:
       keywords = keywords.replace("COG","")
       qset = Q(cog__tag=keywords)
   else:
       qset = (
       Q(defline__icontains=keywords) | Q(locus_tag__icontains=keywords) | Q(gb_tag__icontains=keywords) |Q(gene_name__icontains=keywords) | Q(xrefs__xref__icontains=keywords) 
       )

to

       qset = (
       Q(locus_tag__icontains=keywords) | Q(gb_tag__icontains=keywords) |Q(gene_name__icontains=keywords) | Q(xrefs__xref__icontains=keywords) 
       )
       mylist = Protein.objects.filter(qset).distinct().order_by('id')[:2000] 
       if len(mylist) is None or len(mylist) == 0:
           qset = Q(defline__icontains=keywords)

September 22, 2011

Syn26 or P-SSP2

Simon: If I am correct, Syn26 should be P-SSP2 based on PsbA sequence? Then, I think we should change the metadata linked to the NCBI public entry. Is there anything underway for this? http://www.ncbi.nlm.nih.gov/nuccore/GU071107.1

Syn26 is the alternative name of P-SSP2. Its Taxonomy ID should be 444876 instead of 66501170, which was currently defined in DB. There were bunch of other strains also defined with wrong Taxonomy ID in Proportal DB, due to the previous test run of cluster analysis. Their Taxonomy IDs have been updated as follows,

   START TRANSACTION;
   update data_project set tax_id=445700 where tax_id=66501154;
   update data_project set tax_id=445696 where tax_id=66501155;
   update data_project set tax_id=444860 where tax_id=66501159;
   update data_project set tax_id=444861 where tax_id=66501160;
   update data_project set tax_id=444864 where tax_id=66501161;
   update data_project set tax_id=444878 where tax_id=66501163;
   update data_project set tax_id=445683 where tax_id=66501164;
   update data_project set tax_id=445684 where tax_id=66501165;
   update data_project set tax_id=445685 where tax_id=66501166;
   update data_project set tax_id=445686 where tax_id=66501167;
   update data_project set tax_id=445688 where tax_id=66501169;
   update data_project set tax_id=444876 where tax_id=66501170;
   update data_project set tax_id=444859 where tax_id=66506501;
   COMMIT;

September 8, 2011

Wrong phage genome ID

Phage genome VIMSS IDs don't work (MicrobesOnline IDs), for instance, http://proportal.mit.edu/protein/70464/0/, and select VIMSS.

However,the ones for the cyanobacterial genes work.

Solution:

The VIMSS IDs for proteins are defined in data_dna table, in which many proteins do not have VIMSS ID defined and need to be updated from MicrobesOnline website.

Broken link for data download

The link to Supplementary Data fails: http://proportal.mit.edu/expression/18/, for instance, http://proportal.mit.edu/download/pubmed_17016519/.

Solution:

The supplementary data is available on the Proportal Download page http://proportal.mit.edu/download/, where links for downloading are pointed to a different location, for instance, the link for Nitrogen Availability Gene Expression Project is http://chisholmlab.mit.edu/kat/download/pubmed_17016519/.

Make the following change on templates/gene_express_project.html from,

       Download: <a href="/download/pubmed_Template:Pub.pubmed.pubmed id/">Supplementary Data</a> 

to,

       Download: <a href="http://chisholmlab.mit.edu/kat/download/pubmed_Template:Pub.pubmed.pubmed id/">Supplementary Data</a> 

The supplementary data should be made available directly from Proportal website in next release.

September 7, 2011

P-SSP5 instead of P-SSP3

An immediate change has been made to ProPortal, the podovirus genome that is identified as P-SSP3 needs to be renamed.This genome was apparently incorrectly named; it is P-SSP5 (also called 9515-10a, which agrees with the currently posted metadata on ProPortal). There is no lysate available for this phage, so perhaps we should also try to figure out if we want it to be on ProPortal at all.

http://proportal.mit.edu/genome/id=58/

We need to get the real P-SSP3 integrated into ProPortal as well. The CAMERA genome annontated as P-SSP3 appears to be correct (according to the cyanophage inventory Excele table P-SSP3 is also named G2087, which agrees with the CAMERA directory name) but while we have FASTA files for it, there's no Genbank file available on the CAMERA FTP site for it. CAMERA has been contacted to get the file:

ftp://ftp.camera.calit2.net/20100928_Prochloro_phage_P_SSP3_G2087/phage/annotation/

Solution:

P-SSP5 has replaced P-SSP3 in data_project table.

August 15, 2011

Zero or null values

For the microarray experiments Nitrogen Limitation and Light Sensing, check if all of the genes have 0 in the mean(T) and mean(C) fields in each experiment. If they do, alter the mean(T) and mean(C) fields to read "NA" or some other entry indicating that there is no data for these fields.

Solution:

a. Set the mean(T) and mean(C) values in DB to be null if there is no measurement in the experiment;

b. Display the null value as "NA" on the page by modifying ${OCEAN}/templates/expression/probeset.html from,

    e.t_mean
    e.c_mean

to,

    e.t_mean| default_if_none:"NA"
    e.c_mean| default_if_none:"NA"

Verification:

   * http://proportal.mit.edu/probeset/MED4_ARR_1107_x_at/

Genbank ID

Link each Genbank ID on the Genome page for each genome to it's Genbank web page so users can download the genome if they are interested.

Solution:

Modify the page: ${OCEAN}/templates/genome/genome.html

   GenBank ID: <a class="external" href="http://www.ncbi.nlm.nih.gov/nucleotide/genome.gb_id">genome.gb_id</a>

To be done:

Another solution is to store the address in data_url table.

Verification:

   * http://proportal.mit.edu/genome/id=1/

Update publications

Update Publications for the following phage genomes to reflect Matt Sullivan's 2010 phage paper which described 16 phage genomes, SSM1, SSSM5, SSSM7, SSM2, SShM2, Syn1, Syn33, Syn19, Syn9, SPM2, PSSM7, PRSM4, PHM2, PHM1, PSSM4, PSSM2

   Genomic analysis of oceanic cyanobacterial myoviruses compared with T4-like myoviruses from   
   diverse hosts and environments. Sullivan MB, Huang KH, Ignacio-Espinoza JC, Berlin AM, 
   Kelly L, Weigele PR, DeFrancesco AS, Kern SE, Thompson LR, Young S, Yandava C, Fu R, 
   Krastins B, Chase M, Sarracino D, Osburne MS, Henn MR, Chisholm SW. Environ Microbiol. 2010 
   Nov;12(11):3035-56.

and Matt Henn's paper which describes phage genome sequencing, same phage genomes as above:

   Analysis of high-throughput sequencing and annotation strategies for phage genomes. Henn 
   MR, Sullivan MB, Stange-Thomann N, Osburne MS, Berlin AM, Kelly L, Yandava C, Kodira C, 
   Zeng Q, Weiand M, Sparrow T, Saif S, Giannoukos G, Young SK, Nusbaum C, Birren BW, Chisholm 
   SW. PLoS One. 2010

Solution:

a. Add both papers into data_publication table in DB using;

b. Link both papers with related projects/strains in data_projectpub table.

Verification:

   * http://proportal.mit.edu/genome/id=38/

External link

On the External Links page http://proportal.mit.edu/links/, add a link to

   VirMic website: http://www.cs.technion.ac.il/~itaish/VirMic/

Solution:

Add the link in ${OCEAN}/templates/basics/links.html

Verification:

   * http://proportal.mit.edu/links/

Correct mismatched host strain names

On the cyanophage genomes page: P-SSM4 host of isolation was Prochlorococcus NATL2A, not Prochlorococcus MED4.

The following mismatches are found by checking data_project and data_meta_data tables,

Mismatches
id description additional host (on webpage) Cyanophage Inventory id2
25 P-SSM4 ProNATL2A Prochlorococcus MED4 Prochlorococcus NATL2A 15
41 Syn19 SynWH8109 Synechococcus WH8019 Synecbococcus WH8109 35
43 Syn1 SynWH8101 Synechococcus WH8102 Synecbococcus WH8101 37
49 Syn5 SynWH8109 Synechococcus WH8019 SynWH8109 43
52 S-SSM4 SynWH8101 Synechococcus WH8018 Synechococcus WH8018 46
54 P-HP1 Prochlorococcus NATL2A Prochlorococcus MED4 Prochlorococcus NATL2A 48

Solution:

All mismatches are corrected by following the confirmation in the Cyanophage Iventory spreadsheet.

verification:

   * http://proportal.mit.edu/project/cyanophage/

NCBI COGs

NCBI COGs need to be linked to their database entries.

Solution:

Modify ${PROPORTAL}/templates/protein/protein.embed.html from,

   NCBI COG: COG-tag

to,

   NCBI COG: <a class="external" href="http://www.ncbi.nlm.nih.gov/COG/grace/wiew.cgi?COGtag">COGtag</a> 

Modify ${PROPORTAL}/templates/protein/protein_list.embed.html from,

  COGtag 

to,

  <a class="external" href="http://www.ncbi.nlm.nih.gov/COG/grace/wiew.cgi?COG-tag">COGtag</a>

Verification:

   * http://proportal.mit.edu/protein/27641/0/
   * http://proportal.mit.edu/cluster/9407/

Title

On the microarray experiment pages, the table heading “ProCOG”, should read “CyCog”.

Solution:

It is defined in ${OCEAN}/templates/expression/gene_express_data_table_header.html. Simply chnage from "ProCOG" to "CyCog".

Verification:

   * http://proportal.mit.edu/gedata/exp=1&num=50&ig=0&q=10/

Side Menu

On the main page, change "Environ. Cell Distribution" to "Population Dynamics".

Solution:

Modify templates/basics/left_modules.html.

Invisible Genome Names

Make the GOS plots show all of the genome names. Currently only every other genome is shown.

Solution:

Modify templates/metagenome/genome_bar_plot.html from,

  chart.draw(data,
     {legend:'top', width:700, height:300, is3D:true, 
      title:'GOS Reads Recruited By ProPortal Genomes'
     });

to,

chart.draw(data,

     {legend:'top', width:700, height:300, is3D:true, 
      title:'GOS Reads Recruited By ProPortal Genomes',
      hAxis: {slantedText:true, slantedTextAngle:45, textStyle:{fontSize:14}}
     });

Verification

   All: http://proportal.mit.edu/gosInfo/
   GS000a: http://proportal.mit.edu/gosSite/GS000a/
   GS120: http://proportal.mit.edu/gosSite/GS120/


Probeset

ProbeSet IDs should also be searchable via the “Search Data” search box which appears on the homepage, the IDs look like this:

  MED4_ARR_0701_x_at, 

The page it would link to would be:

  http://proportal.mit.edu/probeset/MED4_ARR_0701_x_at/

Solution:

On hold. The search page is to be systematically modified for accurate search results.

Dataset download

From the NAR site, suggestions to submitters to the database issue: Do make data available for bulk download as flat files or relational database tables with associated documentation. We should at least make flat files (or db tables) of all called genes with cluster information available in both DNA and protein format.

Solution:

Datasets for previous publications have been available for download from the publication page.

Dataset upload

From the NAR site, suggestions to submitters to the database issue: Do allow users to provide feedback on your data and submit new data. Do respond to user feedback in a timely manner. We have the user feedback option, but submitting new data externally has not been done before. I am not sure if we should bring this up or not. It would be cool to incorporate other people’s data (especially if we are serious about making ProPortal more of a community resource) but also a lot of work. This is a topic for later discussion, but I just wanted to get it out there so we can think about it.

Solution:

On hold. Open for suggestions.

One possible solution is to set up a wiki type page for uploading.