Proportal ToDoList: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
 
(23 intermediate revisions by the same user not shown)
Line 9: Line 9:
|-
|-
! 2
! 2
| Add/update 13 Cyanophage genome strains into production server || To be confirmed: published or not published data? || Add your comment  
| Add/update 13 Cyanophage genome strains into production server || Complete || Add your comment  
|-
|-
! 3
! 3
| Modify the search page || On hold: to be systematically modified for accurate results. || Add your comment   
| Modify the search page || Complete: systematically modified for accurate results. || Add your comment   
|-
|-
! 4
! 4
| Datasets download || On hold: wait for new datasets released or published. || Add your comment  
| Datasets download || Complete: wait for new datasets released or published. || Add your comment  
|-
|-
! 5  
! 5  
Line 21: Line 21:
|-
|-
! 6  
! 6  
| Pipeline for cluster analysis || On going. || Add your comment  
| Pipeline for cluster analysis || Complete || Add your comment  
|-
|-
! 7  
! 7  
| Dynamic presentation of cluster network || On going.|| Add your comment  
| Uploading new genomes || On going.|| Add your comment  
|-
|-
! 8  
! 8  
| Annotation pipeline || On hold.|| Add your comment  
| Annotation pipeline || On going.|| Add your comment  
|-
! 9
| RNA-Seq pipeline || On going.|| Add your comment
|-
! 10
| Dynamic presentation of cluster network || On going.|| Add your comment 
|}
|}


==Reviewer's Comments on Proportal==
==Cluster Analysis==
===Question: Proportal vs IMG, CAMERA and others websites?===
The current COG clustering pipeline is in review. New COG clusters are being generated on the internal development website and will be updated soon on the public Proportal website.
ProPortal appears to be a secondary database making data derived from the Chisholm lab public. The database provides access to a lot of their data that is mostly ready available elsewhere e.g. IMG or CAMERA. However, there are some very useful additions e.g. the microarray data and access to environmental data from molecular ecological studies (though the latter is accessed at an external site). Also, potentially it is good to have all the data present in a single place. It is very easy to use and get data from. It looks like a fair bit of work has gone into the backend database that supports the data that it outputs. The fact it is easy to download the data in a ready to use format is good.
 
===April 10, 2012===
Its a public genome... 
 
https://moore.jcvi.org/moore/SingleOrganism.do?speciesTag=PUH18301&pageAttr=pageMain
 
so it need not be in the private genomes. Should have all publicly available prochlorococcus genomes in Proportal...  Eventually!
 
===April 6, 2012===
A bunch of new Prochlorococcus genomes would be got into the "private  ProPortal" and assigned the genes to the COGs.  That is our highest next priority.
 
===March 1, 2012===
Continue on RNA-Seq data analysis and the annotation of genomes.  
 
===February 28, 2012===
Prepare DOE proposal.  
 
===February 27, 2012===
Check both single- and paired-end reads using the genome browser. No obvious problem was found.  
 
===February 23, 2012===
Jen said she was finished with everything on the Linux box. The new linux server should be installed on the new hard drive.
 
===February 5, 2012===
P-SSP5 instead of P-SSP3
An immediate change has been made to ProPortal, the podovirus genome that is identified as P-SSP3 needs to be renamed. This genome was apparently incorrectly named; it is P-SSP5 (also called 9515-10a, which agrees with the currently posted metadata on ProPortal). There is no lysate available for this phage, so perhaps we should also try to figure out if we want it to be on ProPortal at all.
http://proportal.mit.edu/genome/id=58/
 
We need to get the real P-SSP3 integrated into ProPortal as well. The CAMERA genome annontated as P-SSP3 appears to be correct (according to the cyanophage inventory Excele table P-SSP3 is also named G2087, which agrees with the CAMERA directory name) but while we have FASTA files for it, there's no Genbank file available on the CAMERA FTP site for it. CAMERA has been contacted to get the file:
ftp://ftp.camera.calit2.net/20100928_Prochloro_phage_P_SSP3_G2087/phage/annotation/
 
Solution:
P-SSP5 has replaced P-SSP3 in data_project table.
 
According to the memo, the old P-SSP3 has been renamed as P-SSP5 currently in my Proportal development database. However, the correct P-SSP3 from CAMERA hasn't been integrated into my DB yet. Therefore, it will help me a lot if you can find the gbk for P-SSP3.
 
===January 30, 2012===


Genomes – Its good to have all the genome data in one place to access.But at the same time other sites offer the ability to collect more useful genome information. Cyanorak & Cyanobase does similar things plus other functions (e.g. genome context, BLAST analysis etc) and so does IMG. I couldn’t find anything novel on the genome side (though more Prochlorococcus genomes and phage genomes are included than elsewhere)
SSSM7 is a phage, the rest are Prochlorococcus. Should this be an orphan phage gene in the SSSM7 genome?
& some functions.


<B>Answer</B>  
>PMED4_13831|3728
>P9303_01001|3728
>A9601_14421|3728
>SS120_13441|3728
>SSSM7_186|3728


Yes, there's overlap in terms of the genome information, and there are many sites provide similar information: NCBI, MicrobesOnline, IMG, to name a few.  However, as the reviewer pointed out that it is easier to have all data in one place for large-scale data retrieval and cross-link between our different types of data.  Additionally, some of the genomes we have gone through the process of re-annotation (such as SS120 and a few Synechococcus genomes) that we haven't been able to update to GenBank since we do not own those genomes. In comparison, IMG would only have the genome annotations from GenBank.  Essentially, our version provides the complete annotations that were published in Kettler et al.  And, we have our own way of clustering genes (also described in Kettler et al) that are perhaps more suitable for the genomes in our database. 
To verify this problem, use the query,


We provide external links to MicrobesOnline (from the gene page) if available and from there, users can browser the genomes by KEGG pathways, use MO's comparative genome browser and view the precomputed BLAST resultsWe also provide a link to NCBI BLAST page for users to perform BLAST search on the fly for the gene in view.
SELECT * FROM `ocean-dev`.`data_protein` A
left join data_scaffold B
on A.scaffold_id=B.id
left join data_project C
on C.id = B.project_id
where cluster_id=3728;


===Question: Error on the Search page?===
To find all hmmscan cases, use
Using keywords to search for specific genes wasn’t very useful e.g. a search for phoA produces lots of hits to cysQ.


<B>Answer</B>
SELECT * FROM `ocean-dev`.`data_protein`  A
left join data_scaffold B
on A.scaffold_id=B.id
left join data_project C
on C.id = B.project_id
where cluster_evi like '%hmmscan'
order by cluster_id;


Our keyword search engine is the simplest implementation but yet not a more comprehensive or sophisticated one. It is probably worth improving if web users start to feel it lacking. And as for the lab members, it's much easier to query the database directly, so it wasn't a good use of anyone's time to develop a more sophisticated search engine for the website.  The current search engine is pretty aggressive, it'll search for any text that matches or partially matches the keyword "phoA", the search engine is not "smart" enough to know that the user is searching for a gene's gene name.
===January 24, 2012===
Add P-SSP3 genome into Proportal and run cluster pipeline again. Include the following genomes in the output,


Updated on Sep. 27, 2011: The keyword search engine has been modified in such a way that the keyword will be used to search gene name, locus tag and gb tag first, and then segregated to search gene descriptions. The problem is fixed.


===Question: More microarray data?===
76 P-GSP1
The Microarrays part of the website is a very useful tool. Easy to use and would benefit many people.  Certainly, the most useful part of the website. It would be good though if all current data e.g. Fe limitation and changing 02/CO2 data is added on publication (rather than in the future).
54 P-HP1
75 P-RSP2
55 P-RSP5
71 P-SSP10
72 P_SSP6_G2088
24 P_SSP7
57 P-SSP9_G2089
49 SYN5_gp01 [NC_009531]
56 P-SSP2


<B>Answer</B>
58 P-SSP5 (old P-SSP3) should be removed from Proportal?


We can add the iron data. It is published. The other is in prep and we do not want to put it out there until analysis finished.
==Annotation Pipeline==
===February, 2012===


===Question: Metagenome data===
martlny, Chisholm, 2006, parallel genome comparison
The Metagenome part I was a bit unclear of its real function. Much of what I could do is also done in CAMERA. It wasn’t clear to me what relative abundance of specific genes was telling me. Specifically, how is this data normalised etc? This should be mentioned in detail in the manuscript. Otherwise it isn’t particularly useful.


<B>Answer</B>
Rodrigue et al, 2010, RNA-seq pipeline


We are not trying to become CAMERA, we are only interested in including Pro/Syn/cyanophage metagenomes at ProPortal.  The goal of ProPortal is to connect all the Prochlorococcus data in one place. The metagenomic part of ProPortal is an area that is still pretty much under-development when I left the lab. The CAMERA dataset is the first of the metagenomic data set that was used as a model how to store Prochlorococcus metagenomes into ProPortal.  We re-mapped the reads ourselves since there were more host and cyanophage genomes sequenced and at the time, the updated recruitments weren't available at CAMERA. 
annotating using rast (rapid annotation
Aziz et al 2008, bmc genomics


The bar graphs provided on the website report the direct read counts that are assigned to the currently available host/phage genomes, perhaps we should also report the read counts normalized to the genome size.  Reporting the raw read counts is intended to answer the simplest questions such as "Is this gene/genomic region represented at all in the metagenomes?"  And to give the users a quick answer on whether it's worth proceeding further.
10 weiniger hill drive,  


BTW, I believe our UI design is much better than CAMERA in many ways (but I haven't used CAMERA for awhile, so don't know if this is still the case).  And since we are very Prochlorococcus-central, our database is smaller and faster to query.
IMG-act
annotation collaboration toolkit
two main types: gene and pathway/Structure


For instance: from our UI, we can query a specific Pro/Syn/phage read, and see which genome it is recruited to and what gene(s) it overlaps with: http://proportal.mit.edu/gosread/JCVI_READ_1105499780090/
wiki-based
housed at doe-jgi


But, strangely the fasta report isn't reported correctly, can Huiming follow up on that?


For a specific genomic region, we can query how many GOS reads are recruited to that region and where those reads come from.  Obviously from the website, it is only reported the raw counts and no normalization is done, the back-end database allows our lab members to do more sophisticated queries.  The web UI currently is still very simple.
localization
orf
structure
enzymatic
duplication and degradation
horizontal gene transfer
RNA info


===Question: Population Dynamics===
img-act.jgi-psf.org/usr/login
Population Dynamics : The Data is easy to access. Pro and Syn number info is useful but more easy access to environmental metadata would be useful, i.e. nutrients, light intensity, salinity, temperature etc etc rather than searching an external website.


===Question: More citations?===
dufrense et al, 2003, pathway graph
The citations also need updating on the Synechococcus side since key genome papers Dufresne et al., 2008 and Scanlan et al., 2009 are missing on both the website and the manuscript.


The manuscript is well written though the cited literature should encompass beyond the Chisholm lab since non-specialist readers might find it harder to access other papers with excellent datasets on the molecular ecological, microarray, genome and metagenomic side.
kegg


<B>Answer</B>
metacyc


Hmm, not sure how this could be if they are in pubmed.  I guess someone could add these into the database manually.
pathway tools version 15


filled pathway tools


===Question: Cluster analysis?===
using blast to identify missing enzyme.
I didn’t find Figure 3 particularly useful.


<B>Answer</B>
pfam/domain analysis for motif


===October, 2011===


===Question: Future development?===
another annotation pipeline.  
Finally, it would be really nice if some of the things that will only appear in a future ProPortal update e.g. phylogenetic trees for gene clusters; linking GOS reads to gene clusters and genomes are actually included at its outset.


<B>Answer</B>
B2G4PIPE - Blast2GO without graphical interface. The Blast2GO Pipeline Version (B2G4Pipe) runs Blast2GO without graphical interface.


==Cluster Analysis==
For more information, refer to http://www.blast2go.com/b2glaunch/resources
Coming soon.


==Annotation Pipeline==
===September 30, 2011===
<B>September 9, 2011</B>
Kat:
Since Matt already offered his pipeline and it sounded like it has been continuously maintained and developed, it does sound like a good option.
However, pay attention to how they train the gene calling program and what program(s) are used.  The old method (described in the T4 paper) was
dependent on a gene calling program, GeneMark.  I think Matt's pipeline's improvement was mostly on the start sites...  But it's perhaps not that critical to get the start sites right depending on the focus of your project.


katya: Would you guys be available next week to discuss setting up a pipeline for reannotating some of our newer phages, e.g. the strange new siphos, which were
The general idea of a pipeline is simple if you'd rather build one yourself:
pitifully annotated by the Broad pipeline? (I'd also like to  revisit a couple of the myos that were annotated by Matt's group once we have a pipeline we're
1. Evaluate the gene calling programs and figure out the best way to train the programs for phage genomes.
happy with in place.)
2. Combine the results into a final set.
3. Filter false positives.  For Prochlorococcus genomes, I filter the short orphan gene models (< 50aa without any homologs in sequenced genomes).


<B>September 21, 2011</B>
For step 1, this has to be a continuous effort and it's most time-consuming since new programs and better algorithms are continuing to be developed and so any annotation pipeline requires constant maintenance and re-evaluation.


===September 21, 2011===
Simon: I met Matt Henn last Friday and we talked about the phage annotation pipeline. We can send them our sequences for annotation but both of us would prefer to have the pipeline
Simon: I met Matt Henn last Friday and we talked about the phage annotation pipeline. We can send them our sequences for annotation but both of us would prefer to have the pipeline
independent. The problem is (or are) that there are in-house dependencies linked to the annotation pipeline. So to make it public, we would need to remove/move these. Matt
independent. The problem is (or are) that there are in-house dependencies linked to the annotation pipeline. So to make it public, we would need to remove/move these. Matt
estimate that it could be between 3-4 months of work for one person.
estimate that it could be between 3-4 months of work for one person.
===September 9, 2011===
katya: Would you guys be available next week to discuss setting up a pipeline for reannotating some of our newer phages, e.g. the strange new siphos, which were pitifully annotated by the Broad pipeline? (I'd also like to  revisit a couple of the myos that were annotated by Matt's group once we have a pipeline we're happy with in place.)


==Data Download==
==Data Download==
<B>September 23, 2011</B>
===September 30, 2011===
We should add the iron microarray data since it is published. The Supp Info of the paper does not include the entire microarray dataset, only the differentially expressed genes in MED4/MIT9313.
 
Here's the data as log2 fold change. The 70 (and 72) hour time points come after an iron rescue to the experiment (-Fe) treatment.


===September 23, 2011===
The data posted for the different papers should look much more professional, or take it down. The names of the files are hokey, and not transparent, for one thing...  (that would be easy to fix).  
The data posted for the different papers should look much more professional, or take it down. The names of the files are hokey, and not transparent, for one thing...  (that would be easy to fix).  


Line 128: Line 209:
==Data Upload==
==Data Upload==
A number of new strains should be uploaded into the DB. Refer to the [[Proportal_Strains | Strain Discussion]] for more detail.
A number of new strains should be uploaded into the DB. Refer to the [[Proportal_Strains | Strain Discussion]] for more detail.
==Broken Links==
===September 30, 2011===
For instance: from our UI, we can query a specific Pro/Syn/phage read, and see which genome it is recruited to and what gene(s) it overlaps with: http://proportal.mit.edu/gosread/JCVI_READ_1105499780090/
But, strangely the fasta report isn't reported correctly.
==Non-coding RNAs==
===February 3, 2012===
150 ncRNAs for prokaryotic genomes
asRNAs: Antisense RNAs, RNA degradation, translation inhibition, mRNA stabilization
Is the TATA-box of asRNAs conserved? not conserved!
Transcription of ORFs is conserved.
Transcription of asRNAs is not conserved.Why?
Which ncRNAs are functional?
Current sequence analysis is good for gene comparison but does not account for differences in regulation (transcriptome,ncRNAs.).
The fact is:
the transcription of ORFs is much more conserved than that of asRNAs;
the majority of ORFs have conserved TSS?!
lack of conservation is mostly due to differences in promoter sequence
Comparing transcriptomes is important

Latest revision as of 06:45, 10 April 2012

To-do List

To-do List
id description Status Comments
1 Orphan records in DB To be confirmed: whether remove them or fix the wrong links. Add your comment
2 Add/update 13 Cyanophage genome strains into production server Complete Add your comment
3 Modify the search page Complete: systematically modified for accurate results. Add your comment
4 Datasets download Complete: wait for new datasets released or published. Add your comment
5 Datasets upload Open for suggestion: mechanisms for incorporating the community efforts. Add your comment
6 Pipeline for cluster analysis Complete Add your comment
7 Uploading new genomes On going. Add your comment
8 Annotation pipeline On going. Add your comment
9 RNA-Seq pipeline On going. Add your comment
10 Dynamic presentation of cluster network On going. Add your comment

Cluster Analysis

The current COG clustering pipeline is in review. New COG clusters are being generated on the internal development website and will be updated soon on the public Proportal website.

April 10, 2012

Its a public genome...

https://moore.jcvi.org/moore/SingleOrganism.do?speciesTag=PUH18301&pageAttr=pageMain

so it need not be in the private genomes. Should have all publicly available prochlorococcus genomes in Proportal... Eventually!

April 6, 2012

A bunch of new Prochlorococcus genomes would be got into the "private ProPortal" and assigned the genes to the COGs. That is our highest next priority.

March 1, 2012

Continue on RNA-Seq data analysis and the annotation of genomes.

February 28, 2012

Prepare DOE proposal.

February 27, 2012

Check both single- and paired-end reads using the genome browser. No obvious problem was found.

February 23, 2012

Jen said she was finished with everything on the Linux box. The new linux server should be installed on the new hard drive.

February 5, 2012

P-SSP5 instead of P-SSP3 An immediate change has been made to ProPortal, the podovirus genome that is identified as P-SSP3 needs to be renamed. This genome was apparently incorrectly named; it is P-SSP5 (also called 9515-10a, which agrees with the currently posted metadata on ProPortal). There is no lysate available for this phage, so perhaps we should also try to figure out if we want it to be on ProPortal at all. http://proportal.mit.edu/genome/id=58/

We need to get the real P-SSP3 integrated into ProPortal as well. The CAMERA genome annontated as P-SSP3 appears to be correct (according to the cyanophage inventory Excele table P-SSP3 is also named G2087, which agrees with the CAMERA directory name) but while we have FASTA files for it, there's no Genbank file available on the CAMERA FTP site for it. CAMERA has been contacted to get the file: ftp://ftp.camera.calit2.net/20100928_Prochloro_phage_P_SSP3_G2087/phage/annotation/

Solution: P-SSP5 has replaced P-SSP3 in data_project table.

According to the memo, the old P-SSP3 has been renamed as P-SSP5 currently in my Proportal development database. However, the correct P-SSP3 from CAMERA hasn't been integrated into my DB yet. Therefore, it will help me a lot if you can find the gbk for P-SSP3.

January 30, 2012

SSSM7 is a phage, the rest are Prochlorococcus. Should this be an orphan phage gene in the SSSM7 genome?

>PMED4_13831|3728 >P9303_01001|3728 >A9601_14421|3728 >SS120_13441|3728 >SSSM7_186|3728

To verify this problem, use the query,

SELECT * FROM `ocean-dev`.`data_protein` A left join data_scaffold B on A.scaffold_id=B.id left join data_project C on C.id = B.project_id where cluster_id=3728;

To find all hmmscan cases, use

SELECT * FROM `ocean-dev`.`data_protein` A left join data_scaffold B on A.scaffold_id=B.id left join data_project C on C.id = B.project_id where cluster_evi like '%hmmscan' order by cluster_id;

January 24, 2012

Add P-SSP3 genome into Proportal and run cluster pipeline again. Include the following genomes in the output,


76 P-GSP1 54 P-HP1 75 P-RSP2 55 P-RSP5 71 P-SSP10 72 P_SSP6_G2088 24 P_SSP7 57 P-SSP9_G2089 49 SYN5_gp01 [NC_009531] 56 P-SSP2

58 P-SSP5 (old P-SSP3) should be removed from Proportal?

Annotation Pipeline

February, 2012

martlny, Chisholm, 2006, parallel genome comparison

Rodrigue et al, 2010, RNA-seq pipeline

annotating using rast (rapid annotation Aziz et al 2008, bmc genomics

10 weiniger hill drive,

IMG-act annotation collaboration toolkit two main types: gene and pathway/Structure

wiki-based housed at doe-jgi


localization orf structure enzymatic duplication and degradation horizontal gene transfer RNA info

img-act.jgi-psf.org/usr/login

dufrense et al, 2003, pathway graph

kegg

metacyc

pathway tools version 15

filled pathway tools

using blast to identify missing enzyme.

pfam/domain analysis for motif

October, 2011

another annotation pipeline. 

B2G4PIPE - Blast2GO without graphical interface. The Blast2GO Pipeline Version (B2G4Pipe) runs Blast2GO without graphical interface.

For more information, refer to http://www.blast2go.com/b2glaunch/resources

September 30, 2011

Kat: Since Matt already offered his pipeline and it sounded like it has been continuously maintained and developed, it does sound like a good option. However, pay attention to how they train the gene calling program and what program(s) are used. The old method (described in the T4 paper) was dependent on a gene calling program, GeneMark. I think Matt's pipeline's improvement was mostly on the start sites... But it's perhaps not that critical to get the start sites right depending on the focus of your project.

The general idea of a pipeline is simple if you'd rather build one yourself: 1. Evaluate the gene calling programs and figure out the best way to train the programs for phage genomes. 2. Combine the results into a final set. 3. Filter false positives. For Prochlorococcus genomes, I filter the short orphan gene models (< 50aa without any homologs in sequenced genomes).

For step 1, this has to be a continuous effort and it's most time-consuming since new programs and better algorithms are continuing to be developed and so any annotation pipeline requires constant maintenance and re-evaluation.

September 21, 2011

Simon: I met Matt Henn last Friday and we talked about the phage annotation pipeline. We can send them our sequences for annotation but both of us would prefer to have the pipeline independent. The problem is (or are) that there are in-house dependencies linked to the annotation pipeline. So to make it public, we would need to remove/move these. Matt estimate that it could be between 3-4 months of work for one person.

September 9, 2011

katya: Would you guys be available next week to discuss setting up a pipeline for reannotating some of our newer phages, e.g. the strange new siphos, which were pitifully annotated by the Broad pipeline? (I'd also like to revisit a couple of the myos that were annotated by Matt's group once we have a pipeline we're happy with in place.)

Data Download

September 30, 2011

We should add the iron microarray data since it is published. The Supp Info of the paper does not include the entire microarray dataset, only the differentially expressed genes in MED4/MIT9313.

Here's the data as log2 fold change. The 70 (and 72) hour time points come after an iron rescue to the experiment (-Fe) treatment.

September 23, 2011

The data posted for the different papers should look much more professional, or take it down. The names of the files are hokey, and not transparent, for one thing... (that would be easy to fix).

More importantly, the spread sheets for the temp and light data have those messy graphs on them. We should delete the graphs. And there is no annotation on the spread sheets so they would not be useful to anyone, and they don't have units. And they have too many significant figures. Just not ready for the public eye. Just too "raw" to have out there for the whole world to see.

The data we have under the different publications: http://proportal.mit.edu/download/ We probably should take some of it down for now until we can figure out how to clean it up. We should discuss in the next lab meeting.

Data Upload

A number of new strains should be uploaded into the DB. Refer to the Strain Discussion for more detail.

Broken Links

September 30, 2011

For instance: from our UI, we can query a specific Pro/Syn/phage read, and see which genome it is recruited to and what gene(s) it overlaps with: http://proportal.mit.edu/gosread/JCVI_READ_1105499780090/

But, strangely the fasta report isn't reported correctly.

Non-coding RNAs

February 3, 2012

150 ncRNAs for prokaryotic genomes asRNAs: Antisense RNAs, RNA degradation, translation inhibition, mRNA stabilization

Is the TATA-box of asRNAs conserved? not conserved!

Transcription of ORFs is conserved. Transcription of asRNAs is not conserved.Why?

Which ncRNAs are functional? Current sequence analysis is good for gene comparison but does not account for differences in regulation (transcriptome,ncRNAs.).

The fact is: the transcription of ORFs is much more conserved than that of asRNAs; the majority of ORFs have conserved TSS?! lack of conservation is mostly due to differences in promoter sequence


Comparing transcriptomes is important