Biogang:Discussion/Project Ideas

At this point is another placeholder, but I thought we could dump here ideas and thoughts about possible projects.

See summary of the FriendFeed discussion on the Elsevier Grand Challenge.

We could categorize ideas into following buckets:

Tools
(standalone and web-based pieces of useful code)
 * would it be useful to provide links to existing code written by our community?
 * are community members interested in forming "sub-groups" based on programming experience and/or common interests?

''Matt and I (Deepak) were talking about listing existing code and experience somewhere as a starting point. These points above fit that perfectly.''


 * Another "tool" idea is to revive a project of mine (Paulo) that never took flight: InFasta. I can try to add some more ideas to the blog and we can even migrate it to Google AppEngine. The initial code was developed in C++ and the intention is to convert everything to Python.
 * mention of the GAppEngine rang a bell in my head - would it be worth to create something resembling MPI bioinformatics toolkit starting from the original InFasta project? Basic functionality (sequence manipulation) is there and many tools could be ran via api - the most time-consuming would be writing a script to forward outputs to another tool (pretty unique feature among such websites). The toolkit is not available for download (and it's in Ruby anyway) - I think it's worth replicating. -> I think this is a good idea, starting with some file manipulation, then moving to tool integration with other systems.
 * I like the idea as well Deepak Singh 01:53, 20 June 2008 (UTC)
 * Second this. I'm a big fan of the MPI toolkit, an open source version would be great Andrew Perry


 * I work on the bug tracking repository (with some help of people behind Lighthouseapp) - does it fit into this bucket as a community effort (given it works)? Pawel Szczesny
 * I'd say bug tracking goes under "Tools" - Neil

Analyses
(collaborative blog posts or similar, our own Journal of Biogang Research?)
 * could this be the new form of Bio::Blogs?
 * That would be great. It's an idea that's come up in different discussions in the past. Something to replace Bio::Blogs as a collection of some of the more popular topics of interest.
 * We could share ideas for blog posts and if there's enough people interested, such (obviously longer) piece of work could be submitted to NatPrecedings. Together it would be much easier to get to NP, than writing everything by oneself. How does impact of NP articles compares to blog posts?
 * I thought about this idea of using Bio::Blogs instead for collaborative longer posts (reviews for example) that could eventually be published in journals as well, giving a stronger reward to the authors. Eventually the "bold" goal could to write a collective book by collecting the different reviews together. This could be given way for free online and print-on-demand for a small fee.
 * In addition to the inFasta idea, I like this one a lot -- Deepak


 * The similar to Bio:Blogs curation of topics seems excellent and easily can lead to a NP 2-3 page paper. It would be better though for everyone to choose a topic of interest and collect / review information from blog posts. Small groups of people can work on each paper based on the topic that interests them... I started a page with tittle "Web 2.0 and online project communities in bioscience" (feel free to suggest a new tittle). We can start by submitting links from blog posts to that page relevant to the topic, and as people have time they follow the links and write up - edit other people's writings.

Pure science
(projects possibly ending with publication)
 * this would be the best demonstration to (academic) sceptics that the process can work
 * can we devise a distributed data project: in which a large task (e.g. a genomic analysis) is broken up and sent out to community members?
 * all we need is a proof of concept, one that essentially proves that this is possible (and that it can scale)
 * as far as I know, few sequencing consortia (for example tomato) do so-called community annotation - annotation process in splitted between few groups that use their own tools (domain annotation is done by one group, gene prediction by the other)


 * Building a mechanistic error model for the Solexa Sequencer
 * Basically the idea is to take the intensity data for well resolved spots on each cycle of a sequencing run
 * Using a known sequence it is possible to tell what the 'true read' should be
 * Test the experimental intensities for each base against mechanistic models of failure and use the fits to optimise model parameters
 * Possible models are:
 * All base insertions fail at same rate, no sequence context effects, no effect on next cycle (except obviously the wrong base goes in)
 * All base insertions fail at same rate, no sequence context effects, does effect insertion rate at next cycle
 * Bases fail at different rates, no sequence context effects, no effect on next cycle
 * Bases fail at different rates, failure depends on previous base identity, no effect on next cycle
 * etc etc
 * Relatively simple models and data is available for testing.
 * Small, focussed project which is publishable, have support from the Sanger Centre, and this is useful for them
 * Central issues also could be expanded to the ABI systems, Pacific Biosystems and most other sequencing by synthesis approaches
 * Blog post with some comments.