OpenSourceMalaria:Technical Operations: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
(6 intermediate revisions by the same user not shown)
Line 14: Line 14:


The [http://github.com/OpenSourceMalaria/OSM_Website_Code source] for the landing page is also available if needed; the pulling activity uses Ruby/Sinatra.
The [http://github.com/OpenSourceMalaria/OSM_Website_Code source] for the landing page is also available if needed; the pulling activity uses Ruby/Sinatra.
How to add people to the "Meet the Team" section of the landing page, and the relevant code, is [https://github.com/OpenSourceMalaria/OSM_Website_Data/issues/1 here].


== Lab Notebook ==
== Lab Notebook ==
Line 28: Line 30:
==Molecule Visualisation==
==Molecule Visualisation==


How best to visualise the molecules in OSM has been discussed several times (Github issues [https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/128 128], [https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/112 112] and [https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/99 99]). Current protocol is to develop a means to upload to ChEMBL automatically so the molecules appear on the [https://www.ebi.ac.uk/chembl/malaria/source open drug discovery page]. The central repository of the molecules is the SD file, but the repository of details for each OSM compound may be found in the [http://malaria.ourexperiment.org/osm_procedures/group/Compound%20List Experimental Procedures page]. A possible solution is [http://www.openmolecules.org/datawarrior/ DataWarrior].
Series Dataset: [https://docs.google.com/spreadsheets/d/1Rvy6OiM291d1GN_cyT6eSw_C3lSuJ1jaR7AJa8hgGsc/edit#gid=510297618 Google Doc], visualised with [http://www.cheminfo.org/Chemistry/Parsing%20data/Tab_delimited_Parallel_Coordinates.html?tsvURL=http%3A%2F%2Fgoogledocs.cheminfo.org%2Fspreadsheets%2Fd%2F1Rvy6OiM291d1GN_cyT6eSw_C3lSuJ1jaR7AJa8hgGsc%2Fexport%3Fgid%3D0%26format%3Dtsv ChemInfo] or [http://macinchem.org/reviews/vortex/tut26/scripting_vortex26.php Vortex]<br>
 
The SD file on Github is currently out of date compared to the Google Doc above. There is an out-of-date repository of details for each OSM compound in the [http://malaria.ourexperiment.org/osm_procedures/group/Compound%20List Experimental Procedures page].
 
A software solution that has not been explored well is [http://www.openmolecules.org/datawarrior/ DataWarrior].
 
How best to visualise the molecules in OSM has been discussed several times (Github issues [https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/128 128], [https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/112 112] and [https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/99 99]). It would be nice to develop a means to upload to ChEMBL automatically so the molecules appear on the [https://www.ebi.ac.uk/chembl/malaria/source open drug discovery page].  


More than one cheminformatics string for each molecule in OSM should be included to ensure some redundancy in searches, e.g. to get over any issues arising from implicit vs explicit H and tautomers (GHI [https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/230 230] and [https://plus.google.com/u/0/+MatthewTodd/posts/Lb14nGpFmdb this post])
More than one cheminformatics string for each molecule in OSM should be included to ensure some redundancy in searches, e.g. to get over any issues arising from implicit vs explicit H and tautomers (GHI [https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/230 230] and [https://plus.google.com/u/0/+MatthewTodd/posts/Lb14nGpFmdb this post])


Larger question of how to manage the data (to e.g. construct the SDF) is below.
The larger question of how best to manage the data (to e.g. construct the SDF) is below.


==Data Management==
==Data Management==


Related to molecule visualisation, above. The question is: how can we best collect OSM's data together into a single place where we can see all the molecules and their biological activities?
What is the best way to collect OSM's data together into a single place where we can browse all the molecules and their properties?
 
The best way to do this is probably to construct an SD file ([http://metamolecular.com/blog/2012/10/05/nine-things-every-organic-chemist-should-know-about-structure-data-files-sdfiles/ SDF]). This would allow other software to interrogate/display the data (we've been [https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/225 talking with ChEMBL] about automatic imports to their database, but have no solution as yet. Earlier discussion: GHIs [https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/128 128], [https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/127 127] and [https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/99 99]
 
So: How do we make the SDF? How can we ensure the SDF remains up to date? Here are the three approaches.
 
[[Image:SDF_Strategies.png|thumb|center|500px|The Three Strategies for Generating an Auto-updating SDF for Open Source Malaria]]


The best way to do this is to construct an SD file. This would allow other software to interrogate the data.
'''1) Manually Write the SDF'''


There are three possible approaches
This is what we've been doing. It's not working.


'''1) Manually construct the SDF'''
At the moment people enter information into the ELN in human-readable form. That's nice for the experimentalists. The data are not particularly machine-readable, and people do not reliably enter metadata. So coming to OSM with a question like "Has anyone ever made molecule X, and if so how many attempts were made?" is impossible to answer well. Links between synthetic entries and biological entries are poor or non-existent.


This is what we're doing. It's not working.
To remedy this we started making an [http://malaria.ourexperiment.org/osm_procedures Improvised Compound Registration System] - a collection of pages, each of which collates the information about each molecule made in the consortium. This involves ''manually linking'' every experiment with the relevant molecule page. The result is a fantastic resource ([http://malaria.ourexperiment.org/osm_procedures/7054/Preparation_of_OSMS5.html example for OSM-S-5]). The problem is that it takes a huge amount of time to assemble such pages, to the extent that this system is probably unsustainable.


At the moment people enter information into the ELN in human-readable form. The data are not machine-readable, necessarily, and people do not enter metadata. There's no consistent system. So coming to OSM with a question like "Has anyone ever made molecule X, and if so how many attempts were made?" is impossible to answer well. Links between synthetic entries and biological entries are poor or non-existent.
'''2) Automatically Write the SDF'''


To remedy this we started making an Improvised Compound Registration System. This involves manually linking every experiment with the relevant page for the relevant molecule. Such pages can collect together all the data relevant to a given molecule. The result is a fantastic resource. The problem is that it takes a huge amount of time to assemble such pages, to the extent that this system is unsustainable.
We could write results/ELN/wiki more carefully, and then automatically scrape together the SDF. Probably won't work.


'''2) Automatically construct the SDF'''
OSM contributors are pretty good about including cheminformatic strings (SMILES, InChI, InChIKey) in ELN entries to allow machines to understand which molecules are being discussed on a given page. But many people do not do this, partly because it's labour-intensive. We often forget. The strings themselves contain no other relational information (e.g. "this molecule has a potency of X"), meaning even with a way to build the SDF from the strings we'd still not necessarily build a good SDF, without some pretty serious addition of metadata. The SDF, when made, can contain all the data for a given molecule. We could manually collect the data on a wiki page (e.g. [http://openwetware.org/wiki/OpenSourceMalaria:Triazolopyrazine_%28TP%29_Series#Strings_for_Google here]) and then write a script that creates the SDF. But the effort involved, and the risk of people doing this wrongly or badly, is significant.


OSM contributors are pretty good about including cheminformatic strings in ELN entries to allow machines to understand which molecules are being discussed. But many people do not do this, partly because it's labour-intensive. The strings contain no other relational information (e.g. this molecule has a potency of X). For that we need the SDF, which can contain all the data on a given molecule. We could manually create the SDF on a wiki page, for example here. But the effort involved, and the risk of people doing this wrongly or badly, is too great.
'''3) Construct Something Else that Writes the SDF'''


'''3) Construct something else that constructs the SDF'''
We are not yet doing this, but perhaps we should.


We are not yet doing this, but perhaps this is what we should build.
[http://www.mozillascience.org/u/billmills Bill Mills] from Mozilla Science Lab suggested an interesting solution. How about we (OSM experimental contributors) input data using a system like an online form. We input data on the molecules we're using, what we've done to make those molecules and any and all properties that pertain to those molecules. This creates a relational database that is perfectly machine readable.


Bill Mills from Mozilla Science Lab suggested an interesting solution. How about we (OSM experimental contributors) input data using a system that has something like an online form. We input data on the molecules, what we've done to make those molecules and any and all properties that pertain to those molecules. This creates a relational database that is perfectly machine readable. The system the both i) creates the ELN entries on the fly, and ii) creates the SDF.
The system then both i) writes ELN entries on the fly, and ii) writes the SDF.


This is neat on multiple levels. We should build a prototype. It could be very powerful.
This is neat on multiple levels. We should build a prototype "form" and test it for the creation of a typical ELN entry, i.e. whether we can auto-generate a good ELN entry without writing it ourselves. This could be a very powerful approach.


== Odd Jobs ==
== Odd Jobs ==

Revision as of 18:29, 30 July 2015

Malaria Home        OSM So Far        Compound Series        Links        Open Source Research Home        Tech Ops        FAQ       


This provides an outline of the technical and development operations for the Open Source Malaria (OSM) project.

This document is intended to provide an outline of the technical and development operations for the Open Source Malaria (OSM) project. It also includes some related information about social media accounts.

Main website

The main landing page for the project can be found here. The project activity is pulled directly from other sites.

A guide to getting started as a contributor can be found here. The various platforms used are also summarised below.

Molecules already entered into ChEMBL may be browsed on ChEMBL's page for the project.

The source for the landing page is also available if needed; the pulling activity uses Ruby/Sinatra.

How to add people to the "Meet the Team" section of the landing page, and the relevant code, is here.

Lab Notebook

Users use the open source lab notebook Labtrove (previously Lablog) a PHP web application developed by the University of Southampton. Currently the primary malaria blogs run on malaria.ourexperiment.org on a Debian server at the University of Southampton.

How To:

Molecule Visualisation

Series Dataset: Google Doc, visualised with ChemInfo or Vortex

The SD file on Github is currently out of date compared to the Google Doc above. There is an out-of-date repository of details for each OSM compound in the Experimental Procedures page.

A software solution that has not been explored well is DataWarrior.

How best to visualise the molecules in OSM has been discussed several times (Github issues 128, 112 and 99). It would be nice to develop a means to upload to ChEMBL automatically so the molecules appear on the open drug discovery page.

More than one cheminformatics string for each molecule in OSM should be included to ensure some redundancy in searches, e.g. to get over any issues arising from implicit vs explicit H and tautomers (GHI 230 and this post)

The larger question of how best to manage the data (to e.g. construct the SDF) is below.

Data Management

What is the best way to collect OSM's data together into a single place where we can browse all the molecules and their properties?

The best way to do this is probably to construct an SD file (SDF). This would allow other software to interrogate/display the data (we've been talking with ChEMBL about automatic imports to their database, but have no solution as yet. Earlier discussion: GHIs 128, 127 and 99

So: How do we make the SDF? How can we ensure the SDF remains up to date? Here are the three approaches.

The Three Strategies for Generating an Auto-updating SDF for Open Source Malaria

1) Manually Write the SDF

This is what we've been doing. It's not working.

At the moment people enter information into the ELN in human-readable form. That's nice for the experimentalists. The data are not particularly machine-readable, and people do not reliably enter metadata. So coming to OSM with a question like "Has anyone ever made molecule X, and if so how many attempts were made?" is impossible to answer well. Links between synthetic entries and biological entries are poor or non-existent.

To remedy this we started making an Improvised Compound Registration System - a collection of pages, each of which collates the information about each molecule made in the consortium. This involves manually linking every experiment with the relevant molecule page. The result is a fantastic resource (example for OSM-S-5). The problem is that it takes a huge amount of time to assemble such pages, to the extent that this system is probably unsustainable.

2) Automatically Write the SDF

We could write results/ELN/wiki more carefully, and then automatically scrape together the SDF. Probably won't work.

OSM contributors are pretty good about including cheminformatic strings (SMILES, InChI, InChIKey) in ELN entries to allow machines to understand which molecules are being discussed on a given page. But many people do not do this, partly because it's labour-intensive. We often forget. The strings themselves contain no other relational information (e.g. "this molecule has a potency of X"), meaning even with a way to build the SDF from the strings we'd still not necessarily build a good SDF, without some pretty serious addition of metadata. The SDF, when made, can contain all the data for a given molecule. We could manually collect the data on a wiki page (e.g. here) and then write a script that creates the SDF. But the effort involved, and the risk of people doing this wrongly or badly, is significant.

3) Construct Something Else that Writes the SDF

We are not yet doing this, but perhaps we should.

Bill Mills from Mozilla Science Lab suggested an interesting solution. How about we (OSM experimental contributors) input data using a system like an online form. We input data on the molecules we're using, what we've done to make those molecules and any and all properties that pertain to those molecules. This creates a relational database that is perfectly machine readable.

The system then both i) writes ELN entries on the fly, and ii) writes the SDF.

This is neat on multiple levels. We should build a prototype "form" and test it for the creation of a typical ELN entry, i.e. whether we can auto-generate a good ELN entry without writing it ourselves. This could be a very powerful approach.

Odd Jobs

For random jobs where we require hosting/a bit of compute the tendency is to use Nectar, a cloud based provider for academic and research institutions in Australia. It provides two free instances to researchers with reasonable enough specs that they can be used for most jobs. Debian or Ubuntu is typically the flavour of choice, but Nectar provides a wide range of images and snapshots including versions of Scientific Linux. For jobs which may require significantly more processing we may rely instead upon EC2 instances.

Communication

There are several different means used for communication, with email being the least favoured (due to a lack of openness).

OSM has a Twitter account, a Google+ account and a Facebook page. In addition Youtube account is used to post the recorded videos of meetings.

The primary means of communicating issues requiring action/input (admininstration, science or technical) is on Github.

For news, the project has used old-school PDF-based newsletters to reach people unfamiliar with the platforms above, though an email-based newsletter has been discussed.

Publicity is important for the project to attract new inputs. Google ranking was assessed and could be improved (GHI 231) and a meeting was held to address a number of website-related issues. Related: GHI 64

Github

Github is used for project management - a place to keep the To Do list. Tasks are called "Issues" and may be assigned a person responsible, a deadline and some tags to allow active items to be grouped. When a task is complete, it can be closed.

Almost all code and data for the OpenSourceMalaria organisation account (and landing page website) is resident on one of the Github repositories. The main .sd file of all compounds, for example, is kept there. All other experimental data will be on the electronic lab notebook, or summarised on the wiki.

If you still are unable to find something, post an issue on the Github Issues (to do) list and tag it with "Administration" and "question".

Online Meetings

Online meetings use Adobe Connect provided and hosted by the University of Sydney. As with everything else, these meetings are open to everyone and each meeting is recorded and subsequently uploaded to the OSM youtube account.