BioSysBio:abstracts/2007/Derek Gatherer

From OpenWetWare
Revision as of 02:41, 21 September 2006 by DerekGatherer (talk | contribs) (→‎How Do We Reduce the Search Space?)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Less is More: The Battle of Moore's Law against Bremermann's Limit on the Field of Systems Biology

Author(s): Derek Gatherer
Affiliations: MRC Virology Unit, Institute of Virology, Glasgow
Contact:email: d.gatherer at mrcvu dot gla dot ac dot uk
Keywords: 'finite class' 'computability' 'combinatorial explosion' 'network size'

Background

I run my bioinformatics tasks on two machines. The first one is a Tru64 DS20 AlphaServer, bought in 1999. This has two processors running at 512 MHz with 2 Gb of memory. The second is a custom-built Linux box, purchased in early 2005, which has 8 processors running at 2.7 GHz and 12 Gb of memory. Although performance speed does not quite scale linearly against processor speed, this represents just over a 5-fold increase in computing power over a period of 6 years.

This kind of thing has been happening since the 1960s, when Intel-founder Gordon Moore (1965) observed that available processing power doubles every 2 years or so. Indeed, my "fast" Linux box is already quite pedestrian compared to the 3.8 GHz processors that are now routinely available. Under these circumstances, it is easy to become complacent about handling awkward jobs. Large, viral-genome-scale ClustalW jobs that used to run for several days on my DS20 are now finished overnight. Within the horizon of a typical 3-year scientific project, I could conceivably be able to buy a machine that shortens the time to a couple of hours or so.

But just because bioinformatics tasks are now becoming increasingly trivial in terms of computer time, does that mean we can expect similar gains in systems biology problems? There is a whole industry of popular science books which attempt to persuade us that "Moore's Law" will be the basis of a future world in which anything is computable almost instantly, and we will, even within our lifetimes, know "everything" (e.g. Kurzweil 2005). How seriously should we take such claims and what relevance do they have to research in the here-and-now?

The "Ultimate Laptop"

Seth Lloyd (2000) contrasts a typical current processor, say one of 3 GHz, corresponding to just over 1010 operations per second (FLOPS - GHz and FLOPS do not have a single scale of convertibility, since other aspects of the architecture are involved, the figure is for a Pentium 4), with what he calls the "ultimate laptop", a machine of approximate size and weight to a modern laptop performing at physically possible limits. This machine would run at 1051 FLOPS, a figure based on the maximum information carrying capacity of 1 kg of matter. This figure orginates in the work of Hans Bremermann (1992) who was the first to perform calculations of this kind based on the equations of quantum mechanics, and after whom "Bremermann's Limit" is named. Lloyd points out that such an ultimate machine, in order to utilise every single atom for information processing (that's why it is ultimate), would need to run at a temperature of 109 Kelvin. A more practical limit could therefore be around 1040 FLOPS. If one assumes Moore's Law to continue to hold at its present rate, such machines may be available around the year 2105-2140 or so. For various technical reasons, Lloyd slackens this figure to the year 2250. Of course, this is a highly speculative projection because current computers are only likely to obey Moore's Law for a decade or two more, since by then integrated circuits will have reached atomic level detail. The ultimate computer requires quantum computing, the details of which are only very hazy. However, the question here is not how soon will we get an ultimate laptop, but rather what good would it do us?

The Ultimate Laptop Explores Exhaustively

Jedrzejas & Huang (2003) identified, by genomics, 193 genes involved in spore formation and degradation in Bacillus. Imagine if this network is represented as 193 binary nodes, in the simplest possible star topology (i.e. all genes are either on or off and all have an equal weight in determining the state of the pathway and there are no interactions between genes). Then, exhaustive exploration of such a state space requires 2193 or 1058 operations. My Linux box requires about 1048 seconds for such an operation, or about 4x1040 years. The ultimate laptop does a little better, managing in just under 4x1010years or slightly under 3 times the age of the universe. Parallelisation (the ultimate grid of ultimate laptops) is one solution. If the entire surface area of the UK were covered in ultimate laptops, we could wire a grid of just under 1012 nodes. This would allow the state space to be explored in just over 15 days. However, even though that solves the theoretical problem, it is worth considering that if the entire mass of the Earth were composed of Bacillus, there would only be a maximum of 7x1039 experimental states (i.e. individual bugs) against which to validate our computer model. We would therefore need to have 2x1018 Earth-like planets composed entirely of bacteria in order to experimentally validate the system. The experimental system constitutes a "finite class" (Elsasser 1958), meaning one in which the number of theoretical states of the system vastly exceeds the number of states of such a system that can exist.

How Do We Reduce the Search Space?

One alternative is simply to study only tractable problems. On my Linux box, I am prepared to wait a month for a result. This would allow me to explore a state space of 2.7x1016 possibilities. A binary star-toplogy network of 54 genes is thus as much as I ought to consider. If I wait for the arrival of the ultimate laptop, I might consider 154 genes but after that I'm going to have to start parallelizing. It is a sobering thought that a transition from present day computing capacity to ultimate computing capacity will only increase the tractable size of a binary star network by a mere 3-fold.

Shortcuts therefore need to be found. Computer scientists have long developed various heuristics for solving combinatorially explosive problems, such as chess or natural language parsing. What these have in common is that they use rules to prune the number of possible states. The problem in biology is that, unlike in chess, we frequently have only the haziest idea of what the rules might be. Nevertheless, there are some approaches that may be useful. For instance, Hofmann et al. (2006) attempt to define components of networks that behave as modules. The fact that real biological networks do not behave as independent star topologies may be the key to the reduction of complexity. In a star topology, every node has an independent effect on the outcome, but where cycles exist in the network topology, it may be possible to treat whole segments of the network as fuzzy individuals. Certain parts of networks could then be conveniently "black boxed", losing total information in the process but critically pruning the entire tree of possibilities. Another source of potential ideas for this activity could be found in the field of quantitative genetics. Animal and plant breeders have long wrestled with systems in which quantitative traits such as fertility, body mass or milk production are controlled by large numbers of small-effect genes (e.g Agrawal et al. 2001). Although there are 193 genes that have some effect on sporulation in Bacillus, the subset that account for the largest variance may be rather small. One possible methodology, combining genomics with modelling at the functional level is given by Tardieu (2003). Returning to the chess analogy, we need to find rules in small-scale networks and then apply them to prune large-scale ones.

Conclusion

Because network modelling involves combinatorial explosion, even a hypothetical ultimate computer will only be able to brute-force explore networks up to around 3-times larger than those which can be handled by present-day machines. Therefore, systems biologists should not be tempted to neglect complex problems in the hope that Moore's Law will produce future machines able to solve them using sheer power - they won't! Instead, the best way to ensure that we can deal with the complex is to concentrate on the small scale. By exhaustive understanding of small networks, we can learn to abstract them as modules, feeding them as approximations into larger networks. As well as the computer science literature on complexity reduction, systems biologist should also explore the work of quantitative geneticists, who have several decades of experience in methods for reliable estimation of phenotypic output from complex and poorly understood genetic input. Moore is Less, but Less is More.

References

Agrawal AF, Brodie ED & Rieseberg LH (2001) Possible consequences of genes of major effect: transient changes in the G-matrix. Genetica 112-113 33-43.
Bremermann HJ (1982) Minimum energy requirements of information transfer and computing. Int. J. Theor. Phys. 21203-217.
Elsasser WM (1958) The Physical Foundations of Biology. An Analytical Study. London: Pergamon Press.
Hofmann KP, Spahn CM, Heinrich R & Heinemann U (2006) Building functional modules from molecular interactions. Trends Biochem Sci. 31 497-508.
Jedrzejas MJ & Huang WJM (2003) Bacillus species proteins involved in spore formation and degradation: From identification in the genome, to sequence analysis, and determination of function and structure. Critical Reviews in Biochemistry and Molecular Biology 38, 173 - 198.
Kurzweil R (2005) The Singularity is Near, When Humans Transcend Biology London: Viking/Penguin.
Lloyd S (2000) Ultimate physical limits to computation. Nature 406, 1047-1054.
Moore GE (1965) Cramming more components onto integrated circuits. Electronics Magazine 19 April 1965.
Tardieu F (2003) Virtual plants: modelling as a tool for the genomics of tolerance to water deficit. Trends Plant Sci. 8 9-14.