Proportal ToDoList: Difference between revisions
Huiming Ding (talk | contribs) No edit summary |
Huiming Ding (talk | contribs) |
||
(18 intermediate revisions by the same user not shown) | |||
Line 9: | Line 9: | ||
|- | |- | ||
! 2 | ! 2 | ||
| Add/update 13 Cyanophage genome strains into production server || | | Add/update 13 Cyanophage genome strains into production server || Complete || Add your comment | ||
|- | |- | ||
! 3 | ! 3 | ||
| Modify the search page || | | Modify the search page || Complete: systematically modified for accurate results. || Add your comment | ||
|- | |- | ||
! 4 | ! 4 | ||
| Datasets download || | | Datasets download || Complete: wait for new datasets released or published. || Add your comment | ||
|- | |- | ||
! 5 | ! 5 | ||
Line 21: | Line 21: | ||
|- | |- | ||
! 6 | ! 6 | ||
| Pipeline for cluster analysis || | | Pipeline for cluster analysis || Complete || Add your comment | ||
|- | |- | ||
! 7 | ! 7 | ||
| | | Uploading new genomes || On going.|| Add your comment | ||
|- | |- | ||
! 8 | ! 8 | ||
| Annotation pipeline || On | | Annotation pipeline || On going.|| Add your comment | ||
|- | |||
! 9 | |||
| RNA-Seq pipeline || On going.|| Add your comment | |||
|- | |||
! 10 | |||
| Dynamic presentation of cluster network || On going.|| Add your comment | |||
|} | |} | ||
==Cluster Analysis== | ==Cluster Analysis== | ||
The current COG clustering pipeline is in review. New COG clusters are being generated on the internal development website and will be updated soon on the public Proportal website. | |||
===April 10, 2012=== | |||
Its a public genome... | |||
https://moore.jcvi.org/moore/SingleOrganism.do?speciesTag=PUH18301&pageAttr=pageMain | |||
so it need not be in the private genomes. Should have all publicly available prochlorococcus genomes in Proportal... Eventually! | |||
===April 6, 2012=== | |||
A bunch of new Prochlorococcus genomes would be got into the "private ProPortal" and assigned the genes to the COGs. That is our highest next priority. | |||
===March 1, 2012=== | |||
Continue on RNA-Seq data analysis and the annotation of genomes. | |||
===February 28, 2012=== | |||
Prepare DOE proposal. | |||
===February 27, 2012=== | |||
Check both single- and paired-end reads using the genome browser. No obvious problem was found. | |||
===February 23, 2012=== | |||
Jen said she was finished with everything on the Linux box. The new linux server should be installed on the new hard drive. | |||
===February 5, 2012=== | |||
P-SSP5 instead of P-SSP3 | |||
An immediate change has been made to ProPortal, the podovirus genome that is identified as P-SSP3 needs to be renamed. This genome was apparently incorrectly named; it is P-SSP5 (also called 9515-10a, which agrees with the currently posted metadata on ProPortal). There is no lysate available for this phage, so perhaps we should also try to figure out if we want it to be on ProPortal at all. | |||
http://proportal.mit.edu/genome/id=58/ | |||
We need to get the real P-SSP3 integrated into ProPortal as well. The CAMERA genome annontated as P-SSP3 appears to be correct (according to the cyanophage inventory Excele table P-SSP3 is also named G2087, which agrees with the CAMERA directory name) but while we have FASTA files for it, there's no Genbank file available on the CAMERA FTP site for it. CAMERA has been contacted to get the file: | |||
ftp://ftp.camera.calit2.net/20100928_Prochloro_phage_P_SSP3_G2087/phage/annotation/ | |||
Solution: | |||
P-SSP5 has replaced P-SSP3 in data_project table. | |||
According to the memo, the old P-SSP3 has been renamed as P-SSP5 currently in my Proportal development database. However, the correct P-SSP3 from CAMERA hasn't been integrated into my DB yet. Therefore, it will help me a lot if you can find the gbk for P-SSP3. | |||
===January 30, 2012=== | |||
SSSM7 is a phage, the rest are Prochlorococcus. Should this be an orphan phage gene in the SSSM7 genome? | |||
>PMED4_13831|3728 | |||
>P9303_01001|3728 | |||
>A9601_14421|3728 | |||
>SS120_13441|3728 | |||
>SSSM7_186|3728 | |||
To verify this problem, use the query, | |||
SELECT * FROM `ocean-dev`.`data_protein` A | |||
left join data_scaffold B | |||
on A.scaffold_id=B.id | |||
left join data_project C | |||
on C.id = B.project_id | |||
where cluster_id=3728; | |||
To find all hmmscan cases, use | |||
SELECT * FROM `ocean-dev`.`data_protein` A | |||
left join data_scaffold B | |||
on A.scaffold_id=B.id | |||
left join data_project C | |||
on C.id = B.project_id | |||
where cluster_evi like '%hmmscan' | |||
order by cluster_id; | |||
===January 24, 2012=== | |||
Add P-SSP3 genome into Proportal and run cluster pipeline again. Include the following genomes in the output, | |||
76 P-GSP1 | |||
54 P-HP1 | |||
75 P-RSP2 | |||
55 P-RSP5 | |||
71 P-SSP10 | |||
72 P_SSP6_G2088 | |||
24 P_SSP7 | |||
57 P-SSP9_G2089 | |||
49 SYN5_gp01 [NC_009531] | |||
56 P-SSP2 | |||
58 P-SSP5 (old P-SSP3) should be removed from Proportal? | |||
==Annotation Pipeline== | ==Annotation Pipeline== | ||
=== | ===February, 2012=== | ||
martlny, Chisholm, 2006, parallel genome comparison | |||
Rodrigue et al, 2010, RNA-seq pipeline | |||
annotating using rast (rapid annotation | |||
Aziz et al 2008, bmc genomics | |||
10 weiniger hill drive, | |||
IMG-act | |||
annotation collaboration toolkit | |||
two main types: gene and pathway/Structure | |||
wiki-based | |||
housed at doe-jgi | |||
localization | |||
orf | |||
structure | |||
enzymatic | |||
duplication and degradation | |||
horizontal gene transfer | |||
RNA info | |||
img-act.jgi-psf.org/usr/login | |||
dufrense et al, 2003, pathway graph | |||
kegg | |||
metacyc | |||
pathway tools version 15 | |||
filled pathway tools | |||
using blast to identify missing enzyme. | |||
pfam/domain analysis for motif | |||
===October, 2011=== | |||
another annotation pipeline. | |||
B2G4PIPE - Blast2GO without graphical interface. The Blast2GO Pipeline Version (B2G4Pipe) runs Blast2GO without graphical interface. | |||
For more information, refer to http://www.blast2go.com/b2glaunch/resources | |||
===September 30, 2011=== | |||
Kat: | |||
Since Matt already offered his pipeline and it sounded like it has been continuously maintained and developed, it does sound like a good option. | |||
However, pay attention to how they train the gene calling program and what program(s) are used. The old method (described in the T4 paper) was | |||
dependent on a gene calling program, GeneMark. I think Matt's pipeline's improvement was mostly on the start sites... But it's perhaps not that critical to get the start sites right depending on the focus of your project. | |||
The general idea of a pipeline is simple if you'd rather build one yourself: | |||
1. Evaluate the gene calling programs and figure out the best way to train the programs for phage genomes. | |||
2. Combine the results into a final set. | |||
3. Filter false positives. For Prochlorococcus genomes, I filter the short orphan gene models (< 50aa without any homologs in sequenced genomes). | |||
For step 1, this has to be a continuous effort and it's most time-consuming since new programs and better algorithms are continuing to be developed and so any annotation pipeline requires constant maintenance and re-evaluation. | |||
===September 21, 2011=== | ===September 21, 2011=== | ||
Line 43: | Line 189: | ||
independent. The problem is (or are) that there are in-house dependencies linked to the annotation pipeline. So to make it public, we would need to remove/move these. Matt | independent. The problem is (or are) that there are in-house dependencies linked to the annotation pipeline. So to make it public, we would need to remove/move these. Matt | ||
estimate that it could be between 3-4 months of work for one person. | estimate that it could be between 3-4 months of work for one person. | ||
===September 9, 2011=== | |||
katya: Would you guys be available next week to discuss setting up a pipeline for reannotating some of our newer phages, e.g. the strange new siphos, which were pitifully annotated by the Broad pipeline? (I'd also like to revisit a couple of the myos that were annotated by Matt's group once we have a pipeline we're happy with in place.) | |||
==Data Download== | ==Data Download== | ||
Line 66: | Line 215: | ||
But, strangely the fasta report isn't reported correctly. | But, strangely the fasta report isn't reported correctly. | ||
==Non-coding RNAs== | |||
===February 3, 2012=== | |||
150 ncRNAs for prokaryotic genomes | |||
asRNAs: Antisense RNAs, RNA degradation, translation inhibition, mRNA stabilization | |||
Is the TATA-box of asRNAs conserved? not conserved! | |||
Transcription of ORFs is conserved. | |||
Transcription of asRNAs is not conserved.Why? | |||
Which ncRNAs are functional? | |||
Current sequence analysis is good for gene comparison but does not account for differences in regulation (transcriptome,ncRNAs.). | |||
The fact is: | |||
the transcription of ORFs is much more conserved than that of asRNAs; | |||
the majority of ORFs have conserved TSS?! | |||
lack of conservation is mostly due to differences in promoter sequence | |||
Comparing transcriptomes is important |
Latest revision as of 06:45, 10 April 2012
To-do List
id | description | Status | Comments |
---|---|---|---|
1 | Orphan records in DB | To be confirmed: whether remove them or fix the wrong links. | Add your comment |
2 | Add/update 13 Cyanophage genome strains into production server | Complete | Add your comment |
3 | Modify the search page | Complete: systematically modified for accurate results. | Add your comment |
4 | Datasets download | Complete: wait for new datasets released or published. | Add your comment |
5 | Datasets upload | Open for suggestion: mechanisms for incorporating the community efforts. | Add your comment |
6 | Pipeline for cluster analysis | Complete | Add your comment |
7 | Uploading new genomes | On going. | Add your comment |
8 | Annotation pipeline | On going. | Add your comment |
9 | RNA-Seq pipeline | On going. | Add your comment |
10 | Dynamic presentation of cluster network | On going. | Add your comment |
Cluster Analysis
The current COG clustering pipeline is in review. New COG clusters are being generated on the internal development website and will be updated soon on the public Proportal website.
April 10, 2012
Its a public genome...
https://moore.jcvi.org/moore/SingleOrganism.do?speciesTag=PUH18301&pageAttr=pageMain
so it need not be in the private genomes. Should have all publicly available prochlorococcus genomes in Proportal... Eventually!
April 6, 2012
A bunch of new Prochlorococcus genomes would be got into the "private ProPortal" and assigned the genes to the COGs. That is our highest next priority.
March 1, 2012
Continue on RNA-Seq data analysis and the annotation of genomes.
February 28, 2012
Prepare DOE proposal.
February 27, 2012
Check both single- and paired-end reads using the genome browser. No obvious problem was found.
February 23, 2012
Jen said she was finished with everything on the Linux box. The new linux server should be installed on the new hard drive.
February 5, 2012
P-SSP5 instead of P-SSP3 An immediate change has been made to ProPortal, the podovirus genome that is identified as P-SSP3 needs to be renamed. This genome was apparently incorrectly named; it is P-SSP5 (also called 9515-10a, which agrees with the currently posted metadata on ProPortal). There is no lysate available for this phage, so perhaps we should also try to figure out if we want it to be on ProPortal at all. http://proportal.mit.edu/genome/id=58/
We need to get the real P-SSP3 integrated into ProPortal as well. The CAMERA genome annontated as P-SSP3 appears to be correct (according to the cyanophage inventory Excele table P-SSP3 is also named G2087, which agrees with the CAMERA directory name) but while we have FASTA files for it, there's no Genbank file available on the CAMERA FTP site for it. CAMERA has been contacted to get the file: ftp://ftp.camera.calit2.net/20100928_Prochloro_phage_P_SSP3_G2087/phage/annotation/
Solution: P-SSP5 has replaced P-SSP3 in data_project table.
According to the memo, the old P-SSP3 has been renamed as P-SSP5 currently in my Proportal development database. However, the correct P-SSP3 from CAMERA hasn't been integrated into my DB yet. Therefore, it will help me a lot if you can find the gbk for P-SSP3.
January 30, 2012
SSSM7 is a phage, the rest are Prochlorococcus. Should this be an orphan phage gene in the SSSM7 genome?
>PMED4_13831|3728 >P9303_01001|3728 >A9601_14421|3728 >SS120_13441|3728 >SSSM7_186|3728
To verify this problem, use the query,
SELECT * FROM `ocean-dev`.`data_protein` A left join data_scaffold B on A.scaffold_id=B.id left join data_project C on C.id = B.project_id where cluster_id=3728;
To find all hmmscan cases, use
SELECT * FROM `ocean-dev`.`data_protein` A left join data_scaffold B on A.scaffold_id=B.id left join data_project C on C.id = B.project_id where cluster_evi like '%hmmscan' order by cluster_id;
January 24, 2012
Add P-SSP3 genome into Proportal and run cluster pipeline again. Include the following genomes in the output,
76 P-GSP1
54 P-HP1
75 P-RSP2
55 P-RSP5
71 P-SSP10
72 P_SSP6_G2088
24 P_SSP7
57 P-SSP9_G2089
49 SYN5_gp01 [NC_009531]
56 P-SSP2
58 P-SSP5 (old P-SSP3) should be removed from Proportal?
Annotation Pipeline
February, 2012
martlny, Chisholm, 2006, parallel genome comparison
Rodrigue et al, 2010, RNA-seq pipeline
annotating using rast (rapid annotation Aziz et al 2008, bmc genomics
10 weiniger hill drive,
IMG-act annotation collaboration toolkit two main types: gene and pathway/Structure
wiki-based housed at doe-jgi
localization
orf
structure
enzymatic
duplication and degradation
horizontal gene transfer
RNA info
img-act.jgi-psf.org/usr/login
dufrense et al, 2003, pathway graph
kegg
metacyc
pathway tools version 15
filled pathway tools
using blast to identify missing enzyme.
pfam/domain analysis for motif
October, 2011
another annotation pipeline.
B2G4PIPE - Blast2GO without graphical interface. The Blast2GO Pipeline Version (B2G4Pipe) runs Blast2GO without graphical interface.
For more information, refer to http://www.blast2go.com/b2glaunch/resources
September 30, 2011
Kat: Since Matt already offered his pipeline and it sounded like it has been continuously maintained and developed, it does sound like a good option. However, pay attention to how they train the gene calling program and what program(s) are used. The old method (described in the T4 paper) was dependent on a gene calling program, GeneMark. I think Matt's pipeline's improvement was mostly on the start sites... But it's perhaps not that critical to get the start sites right depending on the focus of your project.
The general idea of a pipeline is simple if you'd rather build one yourself: 1. Evaluate the gene calling programs and figure out the best way to train the programs for phage genomes. 2. Combine the results into a final set. 3. Filter false positives. For Prochlorococcus genomes, I filter the short orphan gene models (< 50aa without any homologs in sequenced genomes).
For step 1, this has to be a continuous effort and it's most time-consuming since new programs and better algorithms are continuing to be developed and so any annotation pipeline requires constant maintenance and re-evaluation.
September 21, 2011
Simon: I met Matt Henn last Friday and we talked about the phage annotation pipeline. We can send them our sequences for annotation but both of us would prefer to have the pipeline independent. The problem is (or are) that there are in-house dependencies linked to the annotation pipeline. So to make it public, we would need to remove/move these. Matt estimate that it could be between 3-4 months of work for one person.
September 9, 2011
katya: Would you guys be available next week to discuss setting up a pipeline for reannotating some of our newer phages, e.g. the strange new siphos, which were pitifully annotated by the Broad pipeline? (I'd also like to revisit a couple of the myos that were annotated by Matt's group once we have a pipeline we're happy with in place.)
Data Download
September 30, 2011
We should add the iron microarray data since it is published. The Supp Info of the paper does not include the entire microarray dataset, only the differentially expressed genes in MED4/MIT9313.
Here's the data as log2 fold change. The 70 (and 72) hour time points come after an iron rescue to the experiment (-Fe) treatment.
September 23, 2011
The data posted for the different papers should look much more professional, or take it down. The names of the files are hokey, and not transparent, for one thing... (that would be easy to fix).
More importantly, the spread sheets for the temp and light data have those messy graphs on them. We should delete the graphs. And there is no annotation on the spread sheets so they would not be useful to anyone, and they don't have units. And they have too many significant figures. Just not ready for the public eye. Just too "raw" to have out there for the whole world to see.
The data we have under the different publications: http://proportal.mit.edu/download/ We probably should take some of it down for now until we can figure out how to clean it up. We should discuss in the next lab meeting.
Data Upload
A number of new strains should be uploaded into the DB. Refer to the Strain Discussion for more detail.
Broken Links
September 30, 2011
For instance: from our UI, we can query a specific Pro/Syn/phage read, and see which genome it is recruited to and what gene(s) it overlaps with: http://proportal.mit.edu/gosread/JCVI_READ_1105499780090/
But, strangely the fasta report isn't reported correctly.
Non-coding RNAs
February 3, 2012
150 ncRNAs for prokaryotic genomes asRNAs: Antisense RNAs, RNA degradation, translation inhibition, mRNA stabilization
Is the TATA-box of asRNAs conserved? not conserved!
Transcription of ORFs is conserved. Transcription of asRNAs is not conserved.Why?
Which ncRNAs are functional? Current sequence analysis is good for gene comparison but does not account for differences in regulation (transcriptome,ncRNAs.).
The fact is: the transcription of ORFs is much more conserved than that of asRNAs; the majority of ORFs have conserved TSS?! lack of conservation is mostly due to differences in promoter sequence
Comparing transcriptomes is important