Proportal FAQs: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
Line 65: Line 65:
<B>Answer</B>
<B>Answer</B>


We can try to defend if we think useful...


===Question: Future development?===
===Question: Future development?===

Revision as of 12:09, 27 September 2011

Reviewer's Comments on Proportal

Question: Proportal vs IMG, CAMERA and others websites?

ProPortal appears to be a secondary database making data derived from the Chisholm lab public. The database provides access to a lot of their data that is mostly ready available elsewhere e.g. IMG or CAMERA. However, there are some very useful additions e.g. the microarray data and access to environmental data from molecular ecological studies (though the latter is accessed at an external site). Also, potentially it is good to have all the data present in a single place. It is very easy to use and get data from. It looks like a fair bit of work has gone into the backend database that supports the data that it outputs. The fact it is easy to download the data in a ready to use format is good.

Genomes – Its good to have all the genome data in one place to access.But at the same time other sites offer the ability to collect more useful genome information. Cyanorak & Cyanobase does similar things plus other functions (e.g. genome context, BLAST analysis etc) and so does IMG. I couldn’t find anything novel on the genome side (though more Prochlorococcus genomes and phage genomes are included than elsewhere) & some functions.

Answer

Yes, there's overlap in terms of the genome information, and there are many sites provide similar information: NCBI, MicrobesOnline, IMG, to name a few. However, as the reviewer pointed out that it is easier to have all data in one place for large-scale data retrieval and cross-link between our different types of data. Additionally, some of the genomes we have gone through the process of re-annotation (such as SS120 and a few Synechococcus genomes) that we haven't been able to update to GenBank since we do not own those genomes. In comparison, IMG would only have the genome annotations from GenBank. Essentially, our version provides the complete annotations that were published in Kettler et al. And, we have our own way of clustering genes (also described in Kettler et al) that are perhaps more suitable for the genomes in our database.

We provide external links to MicrobesOnline (from the gene page) if available and from there, users can browser the genomes by KEGG pathways, use MO's comparative genome browser and view the precomputed BLAST results. We also provide a link to NCBI BLAST page for users to perform BLAST search on the fly for the gene in view.

Question: Error on the Search page?

Using keywords to search for specific genes wasn’t very useful e.g. a search for phoA produces lots of hits to cysQ.

Answer

Our keyword search engine is the simplest implementation but yet not a more comprehensive or sophisticated one. It is probably worth improving if web users start to feel it lacking. And as for the lab members, it's much easier to query the database directly, so it wasn't a good use of anyone's time to develop a more sophisticated search engine for the website. The current search engine is pretty aggressive, it'll search for any text that matches or partially matches the keyword "phoA", the search engine is not "smart" enough to know that the user is searching for a gene's gene name.

Updated on Sep. 27, 2011: The keyword search engine has been modified in such a way that the keyword will be used to search gene name, locus tag and gb tag first, and then segregated to search gene descriptions. The problem is fixed.

Question: More microarray data?

The Microarrays part of the website is a very useful tool. Easy to use and would benefit many people. Certainly, the most useful part of the website. It would be good though if all current data e.g. Fe limitation and changing 02/CO2 data is added on publication (rather than in the future).

Answer

We can add the iron data. It is published. The other is in prep and we do not want to put it out there until analysis finished.

Question: Metagenome data

The Metagenome part I was a bit unclear of its real function. Much of what I could do is also done in CAMERA. It wasn’t clear to me what relative abundance of specific genes was telling me. Specifically, how is this data normalised etc? This should be mentioned in detail in the manuscript. Otherwise it isn’t particularly useful.

Answer

We are not trying to become CAMERA, we are only interested in including Pro/Syn/cyanophage metagenomes at ProPortal. The goal of ProPortal is to connect all the Prochlorococcus data in one place. The metagenomic part of ProPortal is an area that is still pretty much under-development. The CAMERA dataset is the first of the metagenomic data set that was used as a model how to store Prochlorococcus metagenomes into ProPortal. We re-mapped the reads ourselves since there were more host and cyanophage genomes sequenced and at the time, the updated recruitments weren't available at CAMERA.

The bar graphs provided on the website report the direct read counts that are assigned to the currently available host/phage genomes, perhaps we should also report the read counts normalized to the genome size. Reporting the raw read counts is intended to answer the simplest questions such as "Is this gene/genomic region represented at all in the metagenomes?" And to give the users a quick answer on whether it's worth proceeding further.

BTW, I believe our UI design is much better than CAMERA in many ways (but I haven't used CAMERA for awhile, so don't know if this is still the case). And since we are very Prochlorococcus-central, our database is smaller and faster to query.

For instance: from our UI, we can query a specific Pro/Syn/phage read, and see which genome it is recruited to and what gene(s) it overlaps with: http://proportal.mit.edu/gosread/JCVI_READ_1105499780090/ (But, strangely the fasta report isn't reported correctly, will follow up on that?)

For a specific genomic region, we can query how many GOS reads are recruited to that region and where those reads come from. Obviously from the website, it is only reported the raw counts and no normalization is done, the back-end database allows our lab members to do more sophisticated queries. The web UI currently is still very simple.

Question: Population Dynamics

Population Dynamics : The Data is easy to access. Pro and Syn number info is useful but more easy access to environmental metadata would be useful, i.e. nutrients, light intensity, salinity, temperature etc etc rather than searching an external website.

Answer

This is true. We could get these data if we do not already have them. Each of those papers did statistics with all the environmental data. so there must be a master spread sheet. (Check with Allison...)

Question: More citations?

The citations also need updating on the Synechococcus side since key genome papers Dufresne et al., 2008 and Scanlan et al., 2009 are missing on both the website and the manuscript.

The manuscript is well written though the cited literature should encompass beyond the Chisholm lab since non-specialist readers might find it harder to access other papers with excellent datasets on the molecular ecological, microarray, genome and metagenomic side.

Answer

Hmm, not sure how this could be if they are in pubmed. I guess someone could add these into the database manually. We should cite more references outside the lab.

Question: Cluster analysis?

I didn’t find Figure 3 particularly useful.

Answer

We can try to defend if we think useful...

Question: Future development?

Finally, it would be really nice if some of the things that will only appear in a future ProPortal update e.g. phylogenetic trees for gene clusters; linking GOS reads to gene clusters and genomes are actually included at its outset.

Answer