Proportal: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
 
(40 intermediate revisions by the same user not shown)
Line 1: Line 1:
# [http://proportal.mit.edu/ '''ProPortal'''] Prochlorococcus Portal is a web analytical tool for Prochlorococcus, a model system for Integrative Systems Biology.
{{BioMicroCenter}}
 
[http://proportal.mit.edu/ '''ProPortal'''] Prochlorococcus Portal is a web analytical tool for Prochlorococcus, a model system for Integrative Systems Biology.
 
The information provided here is for the general purpose of user contribution and especially collaboration. While the lab members may have access to their internal secure wiki's, however, people will benefit from sharing the information and maintaining a collaborative environment for sustainable development.
 
=References=
 
*[http://cyano2.mit.edu/mediawiki/index.php?title=Main_Page '''Proportal Wiki''']
*[https://wikis.mit.edu/confluence/display/CHISHOLMLAB/Bioinformatics '''Chisholm Lab Bioinformatics Wiki''']: You need login to get access.
*[http://www.broadinstitute.org/annotation/viral/Phage/Home.html '''Marine Phage, Viruses and Viromes'''] The Gordon & Betty Moore Foundation's Marine Microbiology Initiative (MMI) at Broad institute.
 
=Project Management=
For the project management in previous years, refer to [[Proportal_Project_2011 | Year 2011]].


=Project Timelines=
{| class="wikitable" style="width:90%" border="1"
{| class="wikitable" style="width:90%" border="1"
|+Schedule
|+Schedule
Line 7: Line 19:
! Timelines !! Plan !! Status !! Comments
! Timelines !! Plan !! Status !! Comments
|-
|-
! Jul 22
! August 31, 2012
| Finish the modification of Proportal DB Schema. Re-create all tables in a new database with all the missing foreign keys and their cascade properties. || The new database schema has been created by adding all missing foreign keys in the current version. A number of errors are to be fixed in next week to synchronize the datasets stored in the development and production databases. || Add your comments
| Paper for RNA-seq data analysis || Finalize the data analysis.
|| Add your comments
|-
! August 31, 2012
| Proportal || Single cell genomes, links for the RNA-seq and microarray data are to be added to Proportal
|| Add your comments
|-
! July 31, 2012
| COG pipeline || 96 partial genomes from BATS and 19 from HOT (115 in total) are to be processed.
|| Add your comments
|-
! June 31, 2012
| Paper for overlapped genes in both host and phage genomes || Review the manuscript and prepare the latest version of COG clusters with overlapped genes.
|| 33 new genomes have been processed using the modified COG pipeline. The paper will be based on the final version of COG clusters for shared phage/host genes.
|-
! June 31, 2012
| assemblies for HL III/IV single cell || Access biome.mit.edu and add them to the ProPortal.
|| GenBank files have been prepared for the submission of Rex's paper.
|-
! June 15, 2012
| COG cluster pipeline. || Modify the script for uploading genomes with multiple contigs, which was missing in the current pipeline.
|| The bug in the COG pipeline has been fixed for processing genomes with multiple contigs and undefined unique protein identifiers.  
|-
|-
! Jul 29
! May 31, 2012
| Finish the update of datasets in Proportal public website. Migrate/merge all datasets stored in the development and production databases. Verify the integrity and consistency between the development and production databases and websites. || Ongoing || Add your comments
| RNA-Seq processing and analysis pipeline. || Verify RNA-Seq processing pipeline. Prepare genome annotations. Run downstream analysis on phage infection RNA-Seq data.
|| The RNA-Seq pipeline is working as expected using BWA/DESeq/GSEA. Genome functional annotations have been investigated. A better set of parameters in Cufflinks is to be determined for Prochlorococcus genome. 
|-
|-
! Aug 5
! February 28, 2012
| Finish the modification of cluster analysis using static/manual processing. || Ongoing || Add your comments
| Run COG cluster pipeline on new genomes. || The COG cluster pipeline is applied to both host and phage genomes.
|| The current version of the COG scripts does not work well for genomes that have multiple contigs or whose unique protein identifiers are not defined yet.
|-
|-
! Aug 15
| Due date for submitting the database paper || TBD || Add your comments
|}
|}
=To-do List=
For the detail of current issues that have been identified, refer to the [[Proportal_ToDoList | To-do List]].


=Proportal DB Schema=
=Proportal DB Schema=
[[Image:Ocean-DBschema-v2.gif|alt text]]
The new [[Proportal_DBSchema | Proportal DB Schema]] is created by adding all the missing foreign keys back into the database. Some orphan records have been identified from this process and will be fixed/removed from next release of Proportal.
==User Module==
==Project Module==
===Table: data_project===
A list of projects
 
    * 72 projects, as of 07-21-2011 (To be updated: 58 in production DB)
    * Last updated: 2010-12-10
    * No foreign key
Notes
 
The following distinct "type" can be moved into a separated table for a clear definition,
    * cpm, Cyanophage genomes part 1 (<B>To be updated: 18 records in PRO DB, 28 in DEV DB</B>)
    * cpp, Cyanophage genomes part 2 (<B>To be upadted: 8 records in PRO DB, 11 in DEV DB</B>)
    * cps, Cyanophage genomes part 3 (2 records in both DBs)
    * ma, physiology experiments (4 records in both DBs): Light Sensing, Nitrogen Availability, Phage Infection, and Phosphate Starvation)
    * mt, expression experiment (1 record in both DBs): Microbial community gene expression in ocean surface waters
    * p, Prochlorococcus genomes(13 genomes in both DBs)
    * pb, Prochlorococcus Publications (1 record in both DBs)
    * s, Synechococcus genomes(11 genomes in both DBs)
 
The link for "tax_id" is defined in data_url table. 
    * type_id = 59919
    * source = tax
    * url = http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=59919
 
===Table: data_projectpub===
A list of publications from various projects.
 
    * 32 publications as of 07-21-2011
    * Foreign keys:
          o project_id: = data_project.id
          o pubmed_id: = data_publication.id
 
Notes
 
This table is used for mapping projects with related publications, which is displayed on the strain page in the "Genomes" section, for instance: [http://proportal.mit.edu/genome/id=1/ MED4].
 
===Table: data_genepub===
This table is empty. Consider to use data_publication table instead?
 
    * Foreign key: data_project
 
===Table: data_publication===
A list of publications related to Prochlorococcus, Cyanophage, and Synechococcus.
 
    * 2528 publications listed as of 08-1-2011 (both DEV and PRO DBs are updated)
    * Not refered by any other table
    * pubmed_id can be used as a foreign key.
    * "year": last updated 2010
 
===Table: data_url_map===
This table is empty.
===Table: data_url===
The list of data links or data folders.
==Meta Data Module==
===Table data_bats_ts===
Information about field investigation.
 
    * No foreign key
 
===Table: data_meta_data===
Information about field investigation for each project
 
    * 66 meta data sets, as of 07-22-2011
    * Foreign key: data_project.id, to be fixed.
<B>Error</B>
    * One project_id 26 is missing in data_project table.
    * Six projects defined in data_project table (all in year 2008) do not have meta data defined in  this table.
 
==Genome Data Module==
===Table: data_scaffold===
A list of strains/genomes used in various projects.
 
    * Last updated: 12-10-2010
    * 213 strains, as of 07-22-2011
    * Fireigh key: data_project.id
Questions
    * "refseg_id" not defined
    * "seq" field can be removed because its content is further defined in data_dna and data_protein tables.
 
===Table: data_position===
List of start and end positions of gene/DNA for each strain defined in Table data_scaffold.
 
    * 67516 pair of positions, as of 07-22-2011
    * 9 types of sequences are defined: 16s, 23s, 5s, as, m, n, orf, ps, t
    * Foreign key: data_scaffold.id.
 
===Table: data_dna===
A list of DNA sequesnces in correspondence to sequence postion information defined in data_position table.
 
    * 67516 pieces of DNA sequences stored, as of 07-22-2011
    * Three foreign keys: data_position.id, data_scaffold.id and data_protein.id
Error
    * Foreign key  pos_id has error:
          o Two position ids in data_position table: 37163 and 46814 are missing in this table
          o Two pos_id: 36978 and 37113 do not exist in data_position table.
 
===Table: data_protein===
A list of protein sequences.
 
    * 65909 proteins defined, as of 07-22-2011 (1607 DNA sequences are not present in this table)
    * Two foreign keys: data_scaffold.id and data_protein.id
Notes
    * "cluster_id" should be removed from this table
 
===Table: data_ortholog===
Protein orthologs.
 
    * 830944 orthology pairs defined, as of 07-22-2011
    * Foreign keys: protein_id and ortholog_id
 
===Table: data_protein_xref===
Definition: ?
    * 36774 records stored, as of 07-22-2011
    * Foreign key: data_protein.id, to be fixed,
Error
    * Two records have missing protein_id: 36950 and 45482 in data_protein table
 
==Affychip Expression Module==
===Table: data_affychip===
Information about each affychip used.
 
    * 1 chip defined, as of 07-22-2011
    * No foreign key
 
===Table: data_affyexp===
A list of affychip experiments.
 
    * 20 affychip experiments, as of 07-22-2011
    * Foreign key: project_id, only three projects involved affychip experiments.
 
===Table: data_affyprobeset===
A list of probe sets for various affychip experiments.
 
    * 9966 records, as of 07-22-2011
    * Three foreign keys:
          o chip_id:
          o scaffold_id: has missing keys
          o feature_id: not defined
Notes
    * feature_id not defined
    * Use "begin" and "end" to match DNA\gene\protein?
 
===Table: data_affyprobe===
A list of probes for various affychip experiments.
    * 89749 records, as of 07-22-2011
    * Foreign key: probeset_id
 
===Table: data_affydata===
The expression results of Affychip experiments.
 
    * 110848 records, as of 07-22-2011
    * Foreign keys,
          o exp_id
          o probeset_id
Notes
    * No DNA\gene\protein info, use probeset_id?
 
===Table: data_diel===
The results of Affychip time course experiments.
 
    * 1695 records, as of 07-22-2011
    * Foreign keys,
          o probeset_id
          o protein_id
          o gene_id: not defined
Notes
    * gene_id not defined
 
===Table: data_dieltimepoint===
Time courses of Affychip experiemnts.
 
    * 42375 records, as of 07-22-2011
    * Foreign key: diel_id
 
==Cog Module==
===Table: data_cog_fun===
A list of Cog gene functions.
 
    * 24 funtion categoriess, as of 07-22-2011
    * No foreign key
 
===Table: data_cog===
A list of Cog genome annotations
 
    * 4874 records, as of 07-22-2011
    * Foreign key: data_cog_fun.funcode ?
Notes
    * data_cog_fun.funcode can't be regarded as a foreign key becuase some of funcodes in this table are missing in data_cog_fun table.
 
===Table: data_protein_cog===
The mapping between Cog genome and proteins.
 
    * 18498 records, as of 07-22-2011
    * Foreign keys: data_protein.id and data_cog.id
 
==Microarray Module==
===Table: data_gos_site===
A list of Gos field experiments, such as sites of experiments etc.
 
    * 78 records, as of 07-22-2011
    * No foreign key
 
===Table: data_gos_read===
A list of field reads for various Gos experiments.
 
    * 9893120 records, as of 07-22-2011
    * Foreign key: data_gos_site.id, no error
 
===Table: data_gos_to_protein===
The mapping between Gos genomes and proteins.
 
    * 926072 records, as of 07-22-2011
    * Foreign keys:
          o data_protein.id, has error, to be fixed
          o data_gos_read.id, has error, to be fixed
Error
 
    * The foreign key: read_id=0 is not defined in data_gos_read table for id=1 and id=705172 in this table
    * The foreign key: protein_id=0 is not defined in data_protein table for id=1 and id=705172 in this table
 
===Table: data_gos_blastn===
A list of sequences from Gos experiments.
 
    * 8666847 records, as of 07-22-2011
    * Foreign keys:
          o data_scaffold.id, has error, to be fixed
          o data_gos_read.id, has error, to be fixed
Error
 
    * The foreign key: scaffold_id=0 is not defined in data_gos_read table for 211 records in this table
    * The foreign key: read_id=0 is not defined in data_gos_read table for 56438 records in this table
 
==Cluster Module==
===Table: data_protein_cluster===
A list of protein clusters.
 
    * 5597 records, as of 07-22-2011
    * No foreign key
Notes
    * Two distinct "type": phCOG and CyCog
    * "gene_name" not in use
 
===Table: data_protein_cluster_synonym===
The table is empty.
===Table: data_protein_cluster_xref===


    * 1100 records, as of 07-22-2011
=Release Notes=
    * Foreign key: data_protein_cluster.id, has error, to be fixed
Some issues in the current version of Proportal have been resolved. Refer to the [[Proportal_ReleaseNotes |Release Notes]] for more details.
Notes
    * Only one "type": c
    * "xref": COG reference id, which may correspond to multiple cluster ids
Error
    * The foreign key: some cluster_ids are not defined in data_protein_cluster table for about 880 records.


===Table: data_protein_cluster_cog===
=Strain Discussion=
This table is empty.
This section is devoted to the discussion on issues related to [[Proportal_Strains | various strains]] stored in Proportal DB.
===Table: data_clusterlink===
A list of pairs of clusters.


    * 71 records, as of 07-22-2011
=Frequently Asked Questions=
    * Foreign key: data_protein_cluster.id,has error, to be fixed
This section provides answers to some [[Proportal_FAQs | frequently asked questions]].
Notes
    * "evidence" is not in use
Error
    * The foreign key: cluster_id=0 is not defined in data_protein_cluster table.

Latest revision as of 08:09, 9 July 2012

HOME -- SEQUENCING -- LIBRARY PREP -- HIGH-THROUGHPUT -- COMPUTING -- OTHER TECHNOLOGY

ProPortal Prochlorococcus Portal is a web analytical tool for Prochlorococcus, a model system for Integrative Systems Biology.

The information provided here is for the general purpose of user contribution and especially collaboration. While the lab members may have access to their internal secure wiki's, however, people will benefit from sharing the information and maintaining a collaborative environment for sustainable development.

References

Project Management

For the project management in previous years, refer to Year 2011.

Schedule
Timelines Plan Status Comments
August 31, 2012 Paper for RNA-seq data analysis Finalize the data analysis. Add your comments
August 31, 2012 Proportal Single cell genomes, links for the RNA-seq and microarray data are to be added to Proportal Add your comments
July 31, 2012 COG pipeline 96 partial genomes from BATS and 19 from HOT (115 in total) are to be processed. Add your comments
June 31, 2012 Paper for overlapped genes in both host and phage genomes Review the manuscript and prepare the latest version of COG clusters with overlapped genes. 33 new genomes have been processed using the modified COG pipeline. The paper will be based on the final version of COG clusters for shared phage/host genes.
June 31, 2012 assemblies for HL III/IV single cell Access biome.mit.edu and add them to the ProPortal. GenBank files have been prepared for the submission of Rex's paper.
June 15, 2012 COG cluster pipeline. Modify the script for uploading genomes with multiple contigs, which was missing in the current pipeline. The bug in the COG pipeline has been fixed for processing genomes with multiple contigs and undefined unique protein identifiers.
May 31, 2012 RNA-Seq processing and analysis pipeline. Verify RNA-Seq processing pipeline. Prepare genome annotations. Run downstream analysis on phage infection RNA-Seq data. The RNA-Seq pipeline is working as expected using BWA/DESeq/GSEA. Genome functional annotations have been investigated. A better set of parameters in Cufflinks is to be determined for Prochlorococcus genome.
February 28, 2012 Run COG cluster pipeline on new genomes. The COG cluster pipeline is applied to both host and phage genomes. The current version of the COG scripts does not work well for genomes that have multiple contigs or whose unique protein identifiers are not defined yet.

To-do List

For the detail of current issues that have been identified, refer to the To-do List.

Proportal DB Schema

The new Proportal DB Schema is created by adding all the missing foreign keys back into the database. Some orphan records have been identified from this process and will be fixed/removed from next release of Proportal.

Release Notes

Some issues in the current version of Proportal have been resolved. Refer to the Release Notes for more details.

Strain Discussion

This section is devoted to the discussion on issues related to various strains stored in Proportal DB.

Frequently Asked Questions

This section provides answers to some frequently asked questions.