Open writing projects/Scientific Programming with Python and Subversion/Outline

From OpenWetWare
Jump to navigationJump to search



Outline

Preface

Why this book?

The current thinking in computationally-intensive sciences values algorithm development and investigation results over any particular methodology. There's lots of information about what you can do with computers in biology, chemistry, physics, and informatics, but little training in how to do it in a scientifically rigorous way. Scientists often lack the sort of training in computational methodology that affords rigorous scientific practice through data integrity and reproducibility.

By combining programming practices used by professional software engineers with modern programming tools, this book teaches scientists a computational workflow that can be used in any computational application. This structured workflow makes it easier to manage large, data-intensive investigations, and at its core promotes data integrity and reproducibility.

How to use this book

Scientific Programming with Python and Subversion takes the reader through a typical scientific investigation from start to finish--from generation of raw data, processing of this data according to hypotheses, creating data visualizations to evaluate the hypotheses, and modifying processing code in light of new hypotheses. Along the way we introduce pieces of our scientific workflow, and specific tools to carry them out.

In each chapter we introduce a new concept to our workflow, and a new tool or aspect of a tool to help us carry out that concept. Appendixes provide more in-depth detail of each tool and further technical information on the tools where appropriate.

Scientists just beginning their journey into computational investigation should read the book from start to finish. More experienced scientists might want to start with the first chapter to see which parts of the workflow they can incorporate into their current work, and then skip to relevant chapters.

At its heart, this book is about the concepts behind the scientific workflow that promote data integrity and reproducibility, independent of the particular tools used. As this work progresses, new tools will emerge that might be better suited to specific aspects of the workflow. In such cases, we include separate sections describing these alternative ways to carry out the task.

All our examples are written in Python, a flexible programming language well suited to the demands of computational science. We manage our workflow in Subversion, a powerful piece of version control software. Based on our experience (and we've tried lots of languages and workflow management tools), we feel that Python and Subversion are the best combination. If you have a favorite language or tool, then please read along to learn the concepts behind the workflow and use your preferred tools to carry them out. If you feel that your tools are better, send us email; or even better, consider contributing your tools or comments to this project. We'll be happy to incorporate your ideas.

Part I: A Typical Scientific Investigation

1 Hypothesis, Experimentation, Analysis: The Scientific Workflow

I would like to model the style of this section on the first part of 'Agile Web Development with Rails'. In it, David Heinemeier-Hannson goes through the story of a typical programmer-client relationship as they develop an online store application. The project goes through sketching the visual design of the application on paper with the client, to first steps in implementation, then client evaluation and finally tweaking of the application (the lather-rinse-repeat cycle.)

For this book, I am imagining a very similar section where we start out with a hypothesis, gather preliminary data, construct simulation code (or something equivalent) to generate more data based on our hypothesis, visualize the data to evaluate our hypothesis, then tweak the simulation code based on what we learned. This chapter will be all prose, with highlighted hints of the key aspects of the workflow introduced, and possible showing the actual visualizations created in later chapters.

The specific aspects of the workflow that we will introduce (and cover in later chapters) are:

  • version control of everything (data, code, plots, write-ups) to maintain data integrity (ability to roll-back changes, ease of maintaining an updated version of the code)
  • modular code writing to promote code-reusability, code testing, and creation of workflow scripts
  • workflow scripts that capture an annotated version of the steps taken in executable code - coupled with versioning this is the ultimate in data reproducibility, and also promotes a written record of data provenance
  • code testing to ensure data integrity

The tools we will use for these are:

  • subversion for version control - this will be introduced at the very beginning since everything in the project should be versioned
  • python for modular code writing, scientific code, data visualization and code testing
  • other computational tools where appropriate (something to address the speed of python for computationally intensive projects)

Discussion Point

The context of this chapter could be one or more of the following:

  • A bioinformatics investigation modeled on an NCBI coffee break
  • Evaluation of a physics theory of experimental data
  • Computational chemistry example
    • deconstruct the problem into manageable parts

If we focus on one of these, we run the risk of missing a large audience that would be interested in this book. If we do all of these, then we run the risk of being too confusing. Maybe we should start by focusing on one of them, and use the wiki to tell the other stories in a modular fashion?

Lorrie LeJeune: We could leave it as one long chapter explaining the workflow but with three example boxes sprinkled throughout or we could break it into four short chapters: Introduction + 3 example chapters. Part II would then start as chapter 2 or chapter 5.

Part II: The Computational Workflow

2 Source control management with subversion

  • What is version control?
    • Similar to Word 'track changes' or wiki 'history' but for all the files in a project.
    • A way to keep a history of every step in a process.
    • Not only for computer code, but for data, plots, paper manuscripts, etc. (files)
  • Why use version control?
    • maintain data integrity
    • ability to roll-back changes
    • ease of maintaining an updated version of the files
    • multiple collaborators working on files
    • ability to reference specific versions of files in lab notebooks
  • An introduction to Subversion
    • What is a repository?
    • How to create a repository
    • What is a commit and how to make one
    • Seeing differences between versions
    • Retrieving past versions
    • Collaboration using subversion
  • Advanced Topics
    • Branching and Merging

3 A brief introduction to python

Why use python for scientific programming?

  • This section is mostly prose.
  • What is python?
    • computer language that offers easy access to high-level functions, and has a large and growing community of scientific users
  • Why build scientific applications in python?
    • python code looks clean - easy to understand yours or your collaborators code a week later
    • everything from data generation to analysis to plots can be done in python, making every aspect of your project consistent. These together promote good scientific practices (data integrity, data reproducibility)

Introduction to python

  • What the scientist needs to know to get started
    • variable assignment
    • basic control structures
    • functions
    • package structure and import
    • objects (just like packages)
    • References to Programming Python for more detail, and A Byte of Python and Dive Into Python for more intro material

4 Data generation in python

  • This section will go over the python code to generate the preliminary data for the story in section I

5 Making scientific plots with python

  • This section will make the first visualizations of the data generated in section II.3
  • An introduction to matplotlib
    • basic functionality - simple line, bar, histogram plots
    • more sophisticated graphics - insets, labeling with text, drawing arrows
    • interactive graphics - adjusting parameters for real-time fitting

5 Crunching numbers with python

  • This section will go over the next generation of data generation code for the story in section I, and build off material in section II.3
  • Python community modules
    • using numpy for matrix manipulations
    • using the scipy project tools
    • interacting with the Gnu Scientific Library
  • Also go over
    • separating your code into modules
    • using objects to encapsulate the code cleanly

6 Lather, rinse, repeat: A general methodology for approaching scientific problems

  • a general methodology for approaching the scientific problems
    • keep everything in a repository
    • start with the simplest possible task and write a script for it
    • move this code into a module and write unit tests for it
    • objectify the code when appropriate
    • identify speed bottle-necks if needed, and speed up those parts

Part III: Advanced Topics

  • these are second priority, but we should make notes as we find ways to reference them in previous chapters

7 Unit testing for scientists

  • What is unit testing?
    • A way to generate automated tests of small units of code
  • Why do unit testing?
    • example: switching a sorting algorithm - how do you know the code works the same way
      • typically done by 'eye' by running the code manually and looking at output
      • with unit tests can see if the code failed, and if it did, where exactly
  • Using python and nose to write unit tests?
    • example of test code, and how to run the tests following our example story line
  • How do I know which tests to write?
    • (This one is hard)

8 Using SWIG and psyco to speed up python code

  • What if python is not fast enough for my project?
    • Several options:
      • Use psyco to 'compile' the python code
      • Identify the slow parts and write them in C/C++ and bind them to python using SWIG
  • Using psyco
  • Using C with SWIG

Appendixes

  • list of possible appendixes
    • svn cheat sheet
    • python language reference
    • useful python links to community modules