GOS Range Map

From OpenWetWare
Jump to navigationJump to search

Skype Meeting Notes



Jospin Kembel Koeppel Ladau Sharpton

Progress on Tasks and Problems

How to resolve discrepancies between the sample scoring that Steve, Alex and Tom undertook. Will use the PubMed IDs that Guillaume picked up and back check the number 2. To resolve, majority rules where appropriate and default to 2 where no majority. How do we turn all of the 2s into 1s or 3s? How much time do we want to sink into this? Easy enough to lookup genbank record for all 2s with PM ids and use the reference manuscript to try to salvage. For those that are majority 2s but with a single 3 vote and no PM id, try to do a literature search to see if we can find a MS. Will divide all of the 2s between Alex, Steve and Tom and try to salvage as many as possible. Steve will randomly divvy up the load.

Only 52 records where we all put 3, about another 10 where scores were 3,3,2, another 50 or so where 3,2,2. About a total of 100 records that we mightly safely say we can keep the record.

Additional Samples

Josh: Spent quite a bit of time figuring out the ICOMM database. Passed two links around via email, (http://icomm.mbl.edu/microbis/overview/ and http://vamps.mbl.edu/). MICROBIS contains the reads that have come out of ICOMM and the metadata associated with the samples. About 640 samples, many from the same location. VAMPS is the more informative database and provides a classification tool. Josh took the ICOMM reads, identified the taxonomic group, binned sequences into the same taxonomic Order. Compared the ICOMM locations to the MegX locations (built a map, hopefully Josh will upload). It looks like the datasets are fairly complementary. Will have to ensure that inland seas are removed from the ICOMM dataset. Josh also made an ordinal richness estimate from the MICROBIS data. He did niche modeling with just those. He finds the image rather suspicious - relatively little diversity around the equator relative to the mid-temperate regions of the globe. Adding the MegX/GOS data may help here.

Steve has used RDP to classify the MegX data, may need to standardize classification method.

Need to integrate the GOS data as well. Josh is still looking into the environmental layers.

Simulation idea from Josh

Use macroorganism data, predict, given the data we have, how well it predicts for organisms with known distributions (cross validation). Steve suggests sub sampling from various latitudinal bands to see how that is driving the pattern of diversity. Josh mentions that macroorganism data is more dense so there is more information about distribution. Might help test the effect of sampling limitation. Concerns about how meaningful this type of analysis would be, but difficult to find a better method at the moment.

Josh met with Adam Smith (postdoc at Berkeley), who is trying to validate the predicted niche spaces from various models using museum specimens and field work. He advocates the use of MaxEnt in the fashion we've applied here. Steve mentioned that Ackerly would be a good person to discuss, esp. in regards to climate models.

Implications for Global Warming

We might be able to use our methods and maps to evaluate how marine microbiology biodiversity will change as an affect of global warming. Various models of climate change prediction that are easily obtainable. Steve will ship a paper that looks at climate change affect on plants. Might be a problem if the remote data predictions are limited in the environmental parameters they estimate. If they don't model the environmental parameters that seem to control niche space distributions as per our analysis, then we may be limited.



Jospin Kembel Koeppel O'Dwyer Ladau Sharpton

Project Background Discussion

Background Idea: use ecological niche modeling to estimate spacial distribution on a global scale. If we have a bunch of samples scattered around, take a taxon of interest and estimate what its niche looks like. Take remote data, map all locations on planet where we would expect to see that taxon given the niche data.

Niche modeling has been used frequently for macroorganisms. Very little, if any, has been done using microorgamisms. Introducing niche modeling to microbes might be a novel contribution. A nice biological story from these maps might constitute a novel contribution. Additionally, we could map functions onto this map. Alternatively, we could develop a new method using distance-decay work, but would be a lot of work. Could look to see if microbes follow similar environmental gradients as macroorganisms.

Number of methods have been proposed to do this: BioClem, MaxEnt, Garp, etc. Elected to use MaxEnt at least to initial analysis because it doesn't require absence records. Need to know where it was observed, but not where is isn't observed. Given the nature of our data, we can't be certain a taxon is actually absent (limited sequencing depth could result in a false negative).

We don't necessarily know the full dimensionality of niche space. MaxEnt does a lot of cross validation based methods to help determine if the predictions made are accurate. It isn't foolproof - we are constrained by the data we have available.

Preliminary analysis includes the GOS OTUs. Have a variety of metadata parameters, want to expand the data set. MegX may not have the metadata parameters we require, so we might need to infer from global databases (NASA, NOAA, etc). Josh knows of a paper that catalogs data from several environmental layers.

How are we going use the sequence data from MegX? For now, taxonomically classify sequences via RDP and evaluate the how various taxonomic groups are distributed. It's probably going to be family or order.

Current Problems and Tasks

How do we identify relevant samples Only want marine, not sediment What about depth cutoff Can we use manuscript reference to retain certain ambiguous samples Have someone make a first pass, someone else come in and make a second pass MaxEnt is good for sparse data Score samples: 1=Definitely not suitable, 2=Possibly suitable, 3=Definitely suitable Sediment not suitable, anything lower than 150 is unsuitable (restrict to the photic zone) Steve, Tom and Alex will do the scoring independently Looks like Guillaume can quickly retrieve accession for Silva DB

Need to get environmental layer data together Josh is going to try to get the environmental layers together and see what we can get from ICOMM

Meet again 1:30 Next Thursday to discuss progress.



Jospin Koeppel Ladau Sharpton

Environmental Layers

Josh shipped around a paper (Reddy et. al.) that contains information about environmental layer data (e.g., "remote data" for MaxEnt). There are a lot of different layers that we can use. In the paper, they list 5 or 6 (ocean depth, surface temp, salinity, ice cover, productivity). In the dataset, there are many more available (such as bottom temp, wave height, coral proportion, etc.). Over 60 different variables that we have access to, not clear if this is for the entire ocean or not. The 5 or 6 in the paper are for the global ocean. We don't want to use too many variables in the model. If the model is overfit with data parameters, then we may run into prediction problems.

Josh also shipped out some new MaxEnt maps. MaxEnt provides a relative suitability output (range between 0 and 1, 1 being a very suitable habitat for an individual organism). Josh converted every ocean cell to a yes or a no for each taxon by picking the minimum MaxEnt suitability at which the taxon was observed and all cells in the ocean with scores at least that large are within the yes, everything else is a zero (method based on a paper he came across). This seems to make a big difference in the prediction. Suitability does not mean the same thing between species, this helps normalize. The new maps are MICROBIS (taxonomic classifications - Orders) and GOS (OTUs) diversity maps. The general patterns of diversity are similar (hole in Indian ocean, low diversity; high diversity between Australia and Africa). Two independent data sets confirm similar patterns.

Will Megx corroborate.

MegX data selection update

Alex and Tom found that many of the previously ambiguous cases were recovered into the appropriate sample bin, some are also definiately not appropriate. Will merge our scores and pull out those that are appropriate. Hand off to Guillaume who will give us the appropriate sequences. Take sequences and process with RDP.

Next Steps

  • Analysis of MegX and GOS-MICROBIS-MegX merged data.
  • Validation of the map.
  • Protein families from GOS? Probably takes time.
  • Biological inferences from the taxonomy map.
  • Collaboration with an oceanographer.
  • Any new GOS samples available? Check CAMERA

We expect a completely validated map within 2-3 weeks. We want to take the iSEEM conference call three weeks from today.



Jospin Kembel Koeppel O'Dwyer Ladau Riesenfeld Sharpton

MegX sample partitioning

Need to go back and select those samples with 3,3,3, and two 3s and a 2 to be included in Guillaume's sequence extraction. Try to get done today.

Biological Stories from Literature

  • Rappaport's rule: The mean size of taxon ranges increases with latitude.
    • Josh sent around a map last night. This is a plot of mean range size using the MicroBIS data (taxonomic Orders). Calculated the mean size of the taxon at each location. Green is low range size, white is high.
    • If diversity gets lower near the poles, then maybe competition decreases and range gets larger.
    • Climate may get more variable near the poles, may be seeing variation in latitudinal range size
    • Could plot the mean niche size instead of mean range size to determine if tolerance to a greater range of conditions is driving the observed pattern. Here, take average of niche area across temp and salinity space.
    • How well does this apply to non-terrestrial systems (Rappaport's developed for terrestrial organisms) and microbial systems
  • Could do something similar but with phylogenetic diversity (PD)
    • Highly correlated with species richness, so different answer would be surprising.
  • Link potentially functional traits, esp. metabolism.
    • other things: GC ratios, nitrogen content, etc.

Next Week

  • MegX map by next phone call
  • Will do MICROBIS and GOS separately for now.
  • Scouring literature for interesting stories



Green Jospin Ladau Kembel Sharpton

Sampling bias

Two big issues at the moment:

  • which env. layers to use.
  • biased sampling - env data available from samples may not reflect env data across all possible ranges

Loiselle paper describes a method that Josh is playing with. Go through all the grid cells across the space of env data. Create a histogram of where the samples are located in that env data space. Then calculate a statistic (bias_d), is high if oversampled, is low if undersampled. Can do this for each grid cell across env layers. Added up biases across each layer for each cell and created a "bias map". Could try all four separately or average them all. It looks like bias is correlated with richness. Either the method we're using is over calculating richness based on biased sampling, or researchers have biased their sampling towards richer locations.

Jess will be giving a talk in two weeks and will be around people with ideas about these kinds of analyses. Josh can also query an expert next week at Stonybrook. There are methods on the market that deal with issues of nonuniform sampling. Other papers in the field may ignore sampling bias. There may not be a rigorous means of dealing with this issue. Would be interesting to see what correlation looks like for individual variables.

Selection of environmental layers

People seem to choose predictor (env) vars based on those vars that they expect to affect the species under study. Not a lot of model selection in the literature. MaxEnt may do some sort of weighting. Jess points to Science 2003 Hanitamisto: subject env vars to backward elimination to obtain only those with statistically significant variation. This is for some sort of analysis of variation, not niche modeling. A similar idea might be applicable: goodness-of-fit predictions from MaxEnt? Likelihood or AIC values? Need to explore what MaxEnt produces and how this can be leveraged.

A two page literature review of the key approaches science has used to deal with sample bias in niche modeling would be useful. Knowing why it would be different compared to any other ecological analysis would be useful. Is there a list of papers that we already know of that would be useful to take a look at? Let's pass some references around. Divide up sample bias from env. layer issue.



Jospin Kembel Koeppel Ladau Sharpton Pollard


  • Josh wants to try Science (editor and previous papers are on related topics) or Nature (recent marine microial ecology issue)
    • Pretty to look at, topic is generally interesting
    • Will know for sure when we see the results
  • PNAS is another good option
  • Open access?
    • Nature/ISME and PNAS have open access options.
    • Not sure about Science.
    • PLoS Biology has a "special feature" section on marine microbiology

Writing Strategy

  • Who did what? Person who did each thing is best suited to write it up.
  • Introduction, discussion need to be tied together. Harder to write as a team, but could be outlined by one person, written by another.
  • Could start on methods/results right now. These will indicate what the story is.


  • What to emphasize?
    • diversity (alpha, beta)
    • Rappaport's Rule
    • Maps for important groups
  • Action item (for Josh)
    • final set of maps
      • total alpha diversity at order level
      • total beta diversity at order level
      • same but for other taxonomic classifications
      • suitability maps for individual taxa (SAR11, prochlorococcus)
      • Rappaport's Rule


  • Need to write up data gathering/RDP analyses
    • Guillaume, Alex, Steve and Tom worked on it
    • Guillaume will make first draft, Tom will edit
  • Steve will write description of how we picked studies


  • Josh wants to use bibtex (EndNote. Google Scholar, Citeulike will generate bibtex files)
  • Ask Barbie to track down references for papers already cited by Josh
  • Steve and Alex could look into any missing references and current state of thinking in field, Katie will summarize for intro

Next Call

  • Joint effort to finalize plots and bullet points for each
  • Tuesday 10AM

Open Access Policies for Journals of Interest


  • No Open Access Policy [1]

License Authors retain copyright but agree to grant to Science an exclusive license to publish the paper in print and online. Any author whose university or institution has policies or other restrictions limiting their ability to assign exclusive publication rights (e.g., Harvard, MIT, Open University) must apply for a waiver or other exclusion from that policy or those restrictions.

Access policies After publication, authors may post the accepted version of the paper on the author's personal Web site. Science also provides an electronic reprint service in which one referrer link that can be posted on a personal or institutional Web page, through which users can freely access the published paper on Science's Web site. For research papers created under grants for which the authors are required by their funding agencies to make their research results publicly available, Science allows posting of the accepted version of the paper to the funding body's archive or designated repository (such as PubMed Central) six months after publication, provided that a link to the final version published in Science is included. (Details on this can be found in the license agreement for authors.) Original research papers are freely accessible with registration on Science's Web site 12 months after publication.


  • Uncertain Open Access Policy [2]. We can retain copyright but Nature granted exclusive license to publish. Only for research articles. Uncertain if for all RA associated with any funding agency or if only those that require OA (e.g., NIH). Unclear if this is immediately available to the public or a 6 month delay. The Creative Commons license made available by Nature seems to apply only to genome sequence announcements. I have emailed to inquire.

NPG does not require authors of original (primary) research papers to assign copyright of their published contributions. Authors grant NPG an exclusive licence to publish, in return for which they can reuse their papers in their future printed work without first requiring permission from the publisher of the journal. For commissioned articles (for example, Reviews, News and Views), copyright is retained by NPG.

When a manuscript is accepted for publication in an NPG journal, authors are encouraged to submit the author's version of the accepted paper (the unedited manuscript) to PubMedCentral or other appropriate funding body's archive, for public release six months after publication. In addition, authors are encouraged to archive this version of the manuscript in their institution's repositories and, if they wish, on their personal websites, also six months after the original publication.


  • Restricted OA Policy [3]. They retain exclusive license to publish, but we retain copyright [4]. Immediate public access to the information (freely available in 6 months regardless).
  • Cost: $1,275 or $950 depending on Institutional Site License

PLoS Biology

  • Completely Open Access Policy [5]
  • Cost: $2900 to publish (includes CC license). Discount possible. [6]


  • Probably same as Nature, but found additional clarification regarding OA on their website:

Authors of original research articles are encouraged to submit the author's version of the accepted paper (the unedited manuscript) to their funding body's archive, for public release six months after publication. In addition, authors are encouraged to archive this version of the manuscript in their institution's repositories and on their personal websites, also six months after the original publication. This is in line with NPG's self-archiving policy.

Authors of research articles can also opt to pay an article processing charge of £2,000 / $3,000 / €2,400 (+VAT where applicable) for their accepted articles to be open access online immediately upon publication. By paying this charge authors are also permitted to post the final, published PDF of their article on a website, institutional repository or other free public server, immediately on publication.

Please see the FAQs for further details or click here to download the payment form and license to publish form.

NPG's publishing policies ensure that authors can fully comply with the public access requirements of the major funding bodies worldwide - please click on www.sherpa.ac.uk for more information. However, it is the author's responsibility to take the necessary actions to achieve compliance. These may include self archiving, opting into NPG's manuscript deposition service and / or choosing open access publication.

  • According to [7], OA options (the so called-hybrid model) remain restricted for Nature, which it is available at ISME.



Jospin Kembel Koeppel Ladau Sharpton Pollard

GUI for automating the workflow

  • Could distribute code/program with paper


  • Converting continuous maxEnt output to binary presence-absence calls for maps
    • for these, used minimum suitability at which we have an observation
    • looked at two other methods, but still see that results depend on how you choose the threshold
    • we will need to acknowledge the bias, but may not need to totally solve this problem
  • Richness maps: genus, family, order
    • generally similar patterns
      • temperate bands of high richness
      • band at tropics, especially in order analysis
      • North Sea
    • could there be bias?
      • GOS samples
    • lots of singletons means that we're undersampling and should go to a higher level of taxonomy
    • color scale is log (base e), but need to show raw numbers on legend
    • bootstrap cutoff of 75% means something different for genus vs. order vs. phylum
      • could use a cutoff so that there are a similar amount of false positives
  • Range Breadth maps: genus, family, order
    • different patterns than richness maps
    • do not vary much
    • taxa at both poles get a big value, even if not found in between
    • big ranges for average taxon in Atlantic (which is not very rich)
    • low range breadth along coasts
    • small ranges in Gulf Stream
  • Which taxonomic level to use? Genus vs. order
    • check diagnostics
      • number of sequences
      • number of categories at each taxonomic level
    • phylum is too broad
    • with 200 or so genera, only cover about 10% of all genera

Biological questions to explore

  • Diversity and range sizes versus latitude, environmental variables
    • richness vs. latitude plots
      • supports decay of richness with latitude (at higher latitudes)
      • fairly flat at low latitudes, except genus plot
  • Dispersal and selection
    • beta diversity
    • individual taxa maps (and how many truncated by >10 samples rule)
  • Ecological coherence at higher taxonomic ranks
    • are there particular taxa with more environmentally defined niches?



Ladau Pollard Kembel Jospin Green


Note are here.



Alex Guillaume James Jess Josh Katie Steve Tom


  • James has been working on 'species-area' analyses based on the predictions. i.e. not really at species level - at phylum or family level.
    • One issue is that there are many cells with no data - i.e. prediction of zero or NA. This is problematic for the species-area analyses. i.e. Anything near a coastal area will be underrepresented because right now the approach will throw out things near missing data points (i.e. land). How to deal with this?
    • Can we work around this - do something like take the mean over the non-land and non-NA cells and adjust area accordingly? Try to avoid throwing out data. This may affect the shape (since we're not using perfectly square cells anymore) but will avoid bias in terms of throwing out data.
    • Is it an issue that we are going to be binning across habitat types at different scales? i.e. is combining small-scale species-area counts from coastal and open ocean habitats going to mess things up? Try colour coding a scatter plot to see if different things are happening in different habitats. Rosensweig's book talks about this issue.
    • We can try looking separately at different habitats in the data. How do we define habitats in the ocean? Is there a data set that classifies the oceans into biomes or habitats? How about just distance from land? Biome definitions for one group of organisms might not work for other groups of organisms.
    • Does taxonomic resolution matter? i.e. is it ok that we are looking at phylum-area or genus-area relationships rather than species-area? Jess suggests not a problem, there is a literature on how different taxonomic ranks vary in space etc. that we should consult. Expectation is a shallower slope at coarser taxonomic/phylogenetic resolution. Call it a taxa-area relationship instead and people won't flip out.
    • Banana pants
  • Josh circulated an email with new information/figures.
    • Some of the data had to be removed. Many of the Microbis samples were sediment samples. These were removed. Additionally there were samples with unusual environmental data - these turned out to be hot springs and were removed. This reduced the number of reads we had by about 20%, and eliminated several sampling locations. Also eliminated the 'dust' environmental layer because it didn't add anything/gave anomalous results.
    • Default parameter settings in MaxEnt do not seem to be appropriate for our data (figure maxent tuning.pdf). MaxEnt fits a number of different distributions to the data. Can evaluate the fit of the species distribution model to data with AUC (higher is better) or log loss (lower is better). The default setting in MaxEnt performs poorly with our data. Looks like auto setting is bad and we should be using linear, quadratic product features. Some features perform better at high vs. low number of occurrences. Could we use a mixture of feature sets - i.e. hinge, quadratic for < 20 occurrences and linear, quadratic for > 20 occurrences? But could this induce biases or overfitting, would it be simpler just to pick a single decent-performing feature set such as linear, quadratic which is >0.7AUC across all occurrences?
    • Josh re-ran analyses with the better performing feature sets. 'richness-map-genus.pdf' is a map based on the linear, quadratic feature set and omitting the newly deleted data. Now we don't see a latitudinal richness gradient - we now predict low diversity in the neotropical oceans around Central America and Malay archipelago, high diversity at intermediate latitudes.
    • Plot of Chao vs. MaxEnt richness estimates at the genus level don't show a strong linear relationship. What does this mean, is this a problem? In locations that are sampled deeply the Chao and MaxEnt predictions are similar, but elsewhere they do not. But this shouldn't be a problem since we expect a difference since we know we're undersampling at most locations. Circle size is sampling depth and the well-sampled samples do match between Chao and MaxEnt. Can we evaluate with bias maps whether the pattern of low diversity at the equator is real or an artefact of lower sampling density near the equator? This pattern does contradict published studies such as Pommier et al (16S) and Fuhrman et al (ARISA) who found high diversity near the equator. Doesn't mean it's wrong but we should check for sampling bias. There are other groups for which patterns of high diversity at midlatitudes are observed (air borne spores, etc.). Suggestion is to use a jackknife analaysis to see how good of a job we are doing at predicting (confidence vs. location).
    • The map of range breadth vs. latitude does support Rapoport's rule (increased range size at high latitudes).