OpenSourceMalaria:Technical Operations

From OpenWetWare
Jump to navigationJump to search

Malaria Home        OSM So Far        Compound Series        Links        Open Source Research Home        Tech Ops        FAQ       

This provides an outline of the technical and development operations for the Open Source Malaria (OSM) project.

This document is intended to provide an outline of the technical and development operations for the Open Source Malaria (OSM) project. It also includes some related information about social media accounts.

Main website

The main landing page for the project can be found here. The project activity is pulled directly from other sites.

A guide to getting started as a contributor can be found here. The various platforms used are also summarised below.

Molecules already entered into ChEMBL may be browsed on ChEMBL's page for the project.

The source for the landing page is also available if needed; the pulling activity uses Ruby/Sinatra.

How to add people to the "Meet the Team" section of the landing page, and the relevant code, is here.

Lab Notebook

Users use the open source lab notebook Labtrove (previously Lablog) a PHP web application developed by the University of Southampton. Currently the primary malaria blogs run on on a Debian server at the University of Southampton.

How To:

Molecule Visualisation

Series Dataset: Google Doc, visualised with ChemInfo or Vortex

The SD file on Github is currently out of date compared to the Google Doc above. There is an out-of-date repository of details for each OSM compound in the Experimental Procedures page.

A software solution that has not been explored well is DataWarrior.

How best to visualise the molecules in OSM has been discussed several times (Github issues 128, 112 and 99). It would be nice to develop a means to upload to ChEMBL automatically so the molecules appear on the open drug discovery page.

More than one cheminformatics string for each molecule in OSM should be included to ensure some redundancy in searches, e.g. to get over any issues arising from implicit vs explicit H and tautomers (GHI 230 and this post)

The larger question of how best to manage the data (to e.g. construct the SDF) is below.

Med Chem Relevant Tech Ops

Simple calculation of logP

Data Management

What is the best way to collect OSM's data together into a single place where we can browse all the molecules and their properties?

The best way to do this is probably to construct an SD file (SDF). This would allow other software to interrogate/display the data (we've been talking with ChEMBL about automatic imports to their database, but have no solution as yet. Earlier discussion: GHIs 128, 127 and 99

So: How do we make the SDF? How can we ensure the SDF remains up to date? Here are the three approaches.

The Three Strategies for Generating an Auto-updating SDF for Open Source Malaria

1) Manually Write the SDF

This is what we've been doing. It's not working.

At the moment people enter information into the ELN in human-readable form. That's nice for the experimentalists. The data are not particularly machine-readable, and people do not reliably enter metadata. So coming to OSM with a question like "Has anyone ever made molecule X, and if so how many attempts were made?" is impossible to answer well. Links between synthetic entries and biological entries are poor or non-existent.

To remedy this we started making an Improvised Compound Registration System - a collection of pages, each of which collates the information about each molecule made in the consortium. This involves manually linking every experiment with the relevant molecule page. The result is a fantastic resource (example for OSM-S-5). The problem is that it takes a huge amount of time to assemble such pages, to the extent that this system is probably unsustainable.

2) Automatically Write the SDF

We could write results/ELN/wiki more carefully, and then automatically scrape together the SDF. Probably won't work.

OSM contributors are pretty good about including cheminformatic strings (SMILES, InChI, InChIKey) in ELN entries to allow machines to understand which molecules are being discussed on a given page. But many people do not do this, partly because it's labour-intensive. We often forget. The strings themselves contain no other relational information (e.g. "this molecule has a potency of X"), meaning even with a way to build the SDF from the strings we'd still not necessarily build a good SDF, without some pretty serious addition of metadata. The SDF, when made, can contain all the data for a given molecule. We could manually collect the data on a wiki page (e.g. here) and then write a script that creates the SDF. But the effort involved, and the risk of people doing this wrongly or badly, is significant.

3) Construct Something Else that Writes the SDF

We are not yet doing this, but perhaps we should.

Bill Mills from Mozilla Science Lab suggested an interesting solution. How about we (OSM experimental contributors) input data using a system like an online form. We input data on the molecules we're using, what we've done to make those molecules and any and all properties that pertain to those molecules. This creates a relational database that is perfectly machine readable.

The system then both i) writes ELN entries on the fly, and ii) writes the SDF.

This is neat on multiple levels. We should build a prototype "form" and test it for the creation of a typical ELN entry, i.e. whether we can auto-generate a good ELN entry without writing it ourselves. This could be a very powerful approach.

Note on downloading and using SDFs from Cheminfo Master Database.

Odd Jobs

For random jobs where we require hosting/a bit of compute the tendency is to use Nectar, a cloud based provider for academic and research institutions in Australia. It provides two free instances to researchers with reasonable enough specs that they can be used for most jobs. Debian or Ubuntu is typically the flavour of choice, but Nectar provides a wide range of images and snapshots including versions of Scientific Linux. For jobs which may require significantly more processing we may rely instead upon EC2 instances.


There are several different means used for communication, with email being the least favoured (due to a lack of openness).

OSM has a Twitter account, a Google+ account and a Facebook page. In addition Youtube account is used to post the recorded videos of meetings.

The primary means of communicating issues requiring action/input (admininstration, science or technical) is on Github.

For news, the project has used old-school PDF-based newsletters to reach people unfamiliar with the platforms above, though an email-based newsletter has been discussed.

Publicity is important for the project to attract new inputs. Google ranking was assessed and could be improved (GHI 231) and a meeting was held to address a number of website-related issues. Related: GHI 64


Github is used for project management - a place to keep the To Do list. Tasks are called "Issues" and may be assigned a person responsible, a deadline and some tags to allow active items to be grouped. When a task is complete, it can be closed.

Almost all code and data for the OpenSourceMalaria organisation account (and landing page website) is resident on one of the Github repositories. The main .sd file of all compounds, for example, is kept there. All other experimental data will be on the electronic lab notebook, or summarised on the wiki.

If you still are unable to find something, post an issue on the Github Issues (to do) list and tag it with "Administration" and "question".

Online Meetings

Online meetings use Adobe Connect provided and hosted by the University of Sydney. As with everything else, these meetings are open to everyone and each meeting is recorded and subsequently uploaded to the OSM youtube account.