Survey of metagenomic data
Survey of CAMERA Metagenomic Data
Joshua Ladau (with help from Samantha Riesenfeld)
A preliminary step of the iSEEM project is to use CAMERA to assemble information about existing metagenomic data. We compiled information on the types of metadata that are available for each project on CAMERA by consulting (i) individual metadata files for each project (on each project's webpage) and (ii) the listing of metadata on the CAMERA Project Samples webpage. We also assembled information on sequence data for each project by consulting files available for download on each project's webpage, published papers for each project, and the File Server Download page.
Our findings are summarized in the Metadata and Sequencing Tables below.
- Click on a table to see a larger version:
The metadata available in the individual project files and those compiled on the Project Samples page are generally alike, although in a few cases, metadata are available in only one location. The available types of metadata fall into two broad categories: attributes of samples and attributes of locations from which samples were collected. In the former category, most projects provide date, location, and sample size information. In some cases, the projects also include information on sampling procedure used (e.g., filter size), number of organism sampled (e.g., bacteria abundance), and volume of substrate sampled. With regard to the location attributes, almost all studies provide information on the temperature, habitat type, and depth of samples. Other attributes of the abiotic and biotic environment are also often included, including major ion concentrations and biomass estimates. See the Metadata Table above.
With respect to the sequence data, all projects, except the Ocean Viruses and Moore Microbial Sequencing projects, make available raw reads, amino acid sequences, open reading frames, and ribosomal sequences. Six projects also provide assemblies. See the Sequencing Table above.
Conclusions and Recommendations
Overall, our experience working with CAMERA was positive. In general, the website is clearly designed, reliable, and fast. However, we did identify some aspects in which the website could be improved. First, the File Server Download page could be better organized, with data sets perhaps grouped by project or data type (e.g., nucleotide sequences, amino acid sequences, etc.). Given the size of many of the files, it would also be very helpful to have direct access to the database or command line access to the directories that contain the flat files. Second, while many files are accompanied by helpful documentation, additional documentation would at times be useful. For instance, it is unclear whether the ORF files available for different projects are equivalent.
We believe that the metadata and sequence data available on CAMERA will be useful in the development and application of new approaches to metagenomic data analysis. Cataloguing the available data comprises an important initial step towards reaching these goals.