Proportal ToDoList: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
Line 50: Line 50:
Our keyword search engine is the simplest implementation but yet not a more comprehensive or sophisticated one. It is probably worth improving if web users start to feel it lacking.  And as for the lab members, it's much easier to query the database directly, so it wasn't a good use of anyone's time to develop a more sophisticated search engine for the website.  The current search engine is pretty aggressive, it'll search for any text that matches or partially matches the keyword "phoA", the search engine is not "smart" enough to know that the user is searching for a gene's gene name.
Our keyword search engine is the simplest implementation but yet not a more comprehensive or sophisticated one. It is probably worth improving if web users start to feel it lacking.  And as for the lab members, it's much easier to query the database directly, so it wasn't a good use of anyone's time to develop a more sophisticated search engine for the website.  The current search engine is pretty aggressive, it'll search for any text that matches or partially matches the keyword "phoA", the search engine is not "smart" enough to know that the user is searching for a gene's gene name.


Updated on Sep. 27, 2011: The keyword search engine has been modified in such a way that the keyword will be used to search gene name, locus tag and gb tag first, and then segregated to search gene descriptions. The problem is fixed.
<B>Updated on Sep. 27, 2011</B>: The keyword search engine has been modified in such a way that the keyword will be used to search gene name, locus tag and gb tag first, and then segregated to search gene descriptions. The problem is fixed.


===Question: More microarray data?===
===Question: More microarray data?===

Revision as of 11:50, 27 September 2011

To-do List

To-do List
id description Status Comments
1 Orphan records in DB To be confirmed: whether remove them or fix the wrong links. Add your comment
2 Add/update 13 Cyanophage genome strains into production server To be confirmed: published or not published data? Add your comment
3 Modify the search page On hold: to be systematically modified for accurate results. Add your comment
4 Datasets download On hold: wait for new datasets released or published. Add your comment
5 Datasets upload Open for suggestion: mechanisms for incorporating the community efforts. Add your comment
6 Pipeline for cluster analysis On going. Add your comment
7 Dynamic presentation of cluster network On going. Add your comment
8 Annotation pipeline On hold. Add your comment

Reviewer's Comments on Proportal

Question: Proportal vs IMG, CAMERA and others websites?

ProPortal appears to be a secondary database making data derived from the Chisholm lab public. The database provides access to a lot of their data that is mostly ready available elsewhere e.g. IMG or CAMERA. However, there are some very useful additions e.g. the microarray data and access to environmental data from molecular ecological studies (though the latter is accessed at an external site). Also, potentially it is good to have all the data present in a single place. It is very easy to use and get data from. It looks like a fair bit of work has gone into the backend database that supports the data that it outputs. The fact it is easy to download the data in a ready to use format is good.

Genomes – Its good to have all the genome data in one place to access.But at the same time other sites offer the ability to collect more useful genome information. Cyanorak & Cyanobase does similar things plus other functions (e.g. genome context, BLAST analysis etc) and so does IMG. I couldn’t find anything novel on the genome side (though more Prochlorococcus genomes and phage genomes are included than elsewhere) & some functions.

Answer

Yes, there's overlap in terms of the genome information, and there are many sites provide similar information: NCBI, MicrobesOnline, IMG, to name a few. However, as the reviewer pointed out that it is easier to have all data in one place for large-scale data retrieval and cross-link between our different types of data. Additionally, some of the genomes we have gone through the process of re-annotation (such as SS120 and a few Synechococcus genomes) that we haven't been able to update to GenBank since we do not own those genomes. In comparison, IMG would only have the genome annotations from GenBank. Essentially, our version provides the complete annotations that were published in Kettler et al. And, we have our own way of clustering genes (also described in Kettler et al) that are perhaps more suitable for the genomes in our database.

We provide external links to MicrobesOnline (from the gene page) if available and from there, users can browser the genomes by KEGG pathways, use MO's comparative genome browser and view the precomputed BLAST results. We also provide a link to NCBI BLAST page for users to perform BLAST search on the fly for the gene in view.

Question: Error on the Search page?

Using keywords to search for specific genes wasn’t very useful e.g. a search for phoA produces lots of hits to cysQ.

Answer

Our keyword search engine is the simplest implementation but yet not a more comprehensive or sophisticated one. It is probably worth improving if web users start to feel it lacking. And as for the lab members, it's much easier to query the database directly, so it wasn't a good use of anyone's time to develop a more sophisticated search engine for the website. The current search engine is pretty aggressive, it'll search for any text that matches or partially matches the keyword "phoA", the search engine is not "smart" enough to know that the user is searching for a gene's gene name.

Updated on Sep. 27, 2011: The keyword search engine has been modified in such a way that the keyword will be used to search gene name, locus tag and gb tag first, and then segregated to search gene descriptions. The problem is fixed.

Question: More microarray data?

The Microarrays part of the website is a very useful tool. Easy to use and would benefit many people. Certainly, the most useful part of the website. It would be good though if all current data e.g. Fe limitation and changing 02/CO2 data is added on publication (rather than in the future).

Answer

We can add the iron data. It is published. The other is in prep and we do not want to put it out there until analysis finished.

Question: Metagenome data

The Metagenome part I was a bit unclear of its real function. Much of what I could do is also done in CAMERA. It wasn’t clear to me what relative abundance of specific genes was telling me. Specifically, how is this data normalised etc? This should be mentioned in detail in the manuscript. Otherwise it isn’t particularly useful.

Answer

We are not trying to become CAMERA, we are only interested in including Pro/Syn/cyanophage metagenomes at ProPortal. The goal of ProPortal is to connect all the Prochlorococcus data in one place. The metagenomic part of ProPortal is an area that is still pretty much under-development when I left the lab. The CAMERA dataset is the first of the metagenomic data set that was used as a model how to store Prochlorococcus metagenomes into ProPortal. We re-mapped the reads ourselves since there were more host and cyanophage genomes sequenced and at the time, the updated recruitments weren't available at CAMERA.

The bar graphs provided on the website report the direct read counts that are assigned to the currently available host/phage genomes, perhaps we should also report the read counts normalized to the genome size. Reporting the raw read counts is intended to answer the simplest questions such as "Is this gene/genomic region represented at all in the metagenomes?" And to give the users a quick answer on whether it's worth proceeding further.

BTW, I believe our UI design is much better than CAMERA in many ways (but I haven't used CAMERA for awhile, so don't know if this is still the case). And since we are very Prochlorococcus-central, our database is smaller and faster to query.

For instance: from our UI, we can query a specific Pro/Syn/phage read, and see which genome it is recruited to and what gene(s) it overlaps with: http://proportal.mit.edu/gosread/JCVI_READ_1105499780090/

But, strangely the fasta report isn't reported correctly, can Huiming follow up on that?

For a specific genomic region, we can query how many GOS reads are recruited to that region and where those reads come from. Obviously from the website, it is only reported the raw counts and no normalization is done, the back-end database allows our lab members to do more sophisticated queries. The web UI currently is still very simple.

Question: Population Dynamics

Population Dynamics : The Data is easy to access. Pro and Syn number info is useful but more easy access to environmental metadata would be useful, i.e. nutrients, light intensity, salinity, temperature etc etc rather than searching an external website.

Question: More citations?

The citations also need updating on the Synechococcus side since key genome papers Dufresne et al., 2008 and Scanlan et al., 2009 are missing on both the website and the manuscript.

The manuscript is well written though the cited literature should encompass beyond the Chisholm lab since non-specialist readers might find it harder to access other papers with excellent datasets on the molecular ecological, microarray, genome and metagenomic side.

Answer

Hmm, not sure how this could be if they are in pubmed. I guess someone could add these into the database manually.


Question: Cluster analysis?

I didn’t find Figure 3 particularly useful.

Answer


Question: Future development?

Finally, it would be really nice if some of the things that will only appear in a future ProPortal update e.g. phylogenetic trees for gene clusters; linking GOS reads to gene clusters and genomes are actually included at its outset.

Answer

Cluster Analysis

Coming soon.

Annotation Pipeline

September 9, 2011

katya: Would you guys be available next week to discuss setting up a pipeline for reannotating some of our newer phages, e.g. the strange new siphos, which were pitifully annotated by the Broad pipeline? (I'd also like to revisit a couple of the myos that were annotated by Matt's group once we have a pipeline we're happy with in place.)

September 21, 2011

Simon: I met Matt Henn last Friday and we talked about the phage annotation pipeline. We can send them our sequences for annotation but both of us would prefer to have the pipeline independent. The problem is (or are) that there are in-house dependencies linked to the annotation pipeline. So to make it public, we would need to remove/move these. Matt estimate that it could be between 3-4 months of work for one person.

Data Download

September 23, 2011

The data posted for the different papers should look much more professional, or take it down. The names of the files are hokey, and not transparent, for one thing... (that would be easy to fix).

More importantly, the spread sheets for the temp and light data have those messy graphs on them. We should delete the graphs. And there is no annotation on the spread sheets so they would not be useful to anyone, and they don't have units. And they have too many significant figures. Just not ready for the public eye. Just too "raw" to have out there for the whole world to see.

The data we have under the different publications: http://proportal.mit.edu/download/ We probably should take some of it down for now until we can figure out how to clean it up. We should discuss in the next lab meeting.

Data Upload

A number of new strains should be uploaded into the DB. Refer to the Strain Discussion for more detail.