This documents describes a group of programs I wrote to make life
easier when using the LSF scheduler we have in orchestra. I call the
set of programs PyPlatform.
Oftentimes, I submit a job that in an interactive session takes 5
minutes to run. In a 15m queue, it dies with a timeout. Sometimes,
the network filesystem in orchestra is working slowly, or the node
my job was running in was too busy. In any case, I wanted a way to
detect this situation, and somehow have my job rerun.
Another annoying, and recurring problem, is that sometimes my jobs
would die because of transient errors in the network, or because a
certain job did not have the home, or group, directories mounted.
Yet other times, my job run for 16 minutes. My estimation of 15m was
reasonable, but not useful enough. In those cases, I wanted my job to
be rerun on a 2h queue. One could submit everything to the unlimited
queues, but those have a lower priority.
The set of programs I wrote has a two main parts: a dispatcher and
command-line tools to interface with the dispatcher.
The dispatcher is a job that needs to run periodically in one of the
login nodes. We can achieve this using cron (see Installation). This
program checks the status of the submitted jobs, relaunches dead or
timedout jobs, and performs general bookkeeping.
If a job dies, the dispatcher will see this, and it will decide what
to do. If it was killed because of some error, it will be rerun as
it was submitted, up to a number of times you can configure. If it
was killed because of a timeout, it will decide whether the timeout
was legit (it if used very little CPU time, it's not considered
legit), and then decide either to rerun it in the same queue, or to
move it to a queue with more time available (e.g. bump it from a 2h
queue to a 12h queue)
The command line tools are mysub, myjobs and mykill. mysub supports
a subset of the functionality of bsub. mykill is essentially like
bkill, and myjobs prints the LSF job id of the jobs that PyPlatform
is taking care of.
Follow these instructions, and hopefully you will have everything running.
1) Log in to orchestra (you will actually login to either mezzanine or
2) cd to your home directory and execute
svn checkout svn+ssh://orchestra.med.harvard.edu/home/et62/svnroot/PyPlatform
This will create a directory PyPlatform wherever you were standing.
3) cd to PyPlatform/trunk and install everything by executing
4) Tweak the file ~/.PyPlatform/config to suit your taste. If I were you, I
would leave everything as it is, but feel free to play.
5) Install PyPlatform in your crontab by executing
and adding the line
*/1 * * * * bash -login /home/et62/.PyPlatform/forcron
anywhere in the file. REPLACE et62 by your orchestra
username. crontab -e will launch a some text editor. You will
probably have no trouble using it.
you want it to run every 3 minutes, you replace that by */3. I
feel that 1 minute is great if you are launching many short jobs,
and 5 works more than fine if you are running longer jobs. The
dispatcher checks to see whether there is another instance
running, so don't worry about that.
6) Add ~/bin to your PATH, if it's not already there. You can check
this by editing the file
I think the default version in orchestra includes the your home
bin directory, if it exists. If nothing of the sort is present in
the file, you can accomplish this effect by adding the line
7) Log out of orchestra, and log back in.
8) Learn how to use PyPlatform, by reading the usage section
Usage: mysub [options]
show this help message and exit
-n NAME, --name=NAME
Assign a name to the job
-e ERRORSFILE, --errorsfile=ERRORSFILE
Redirect stderr to a file
-o OUTPUTFILE, --outputfile=OUTPUTFILE
Redirect stdout to a file
Send an email even if the output/errors are redirected
-q QUEUE, --queue=QUEUE
Specify a queue to run the job
-a AFTERACTION, --afterwards=AFTERACTION
Specify an action to be performed upon successful completion
For now, ignore the -n and -a options. -e, -o, -q and -N work
exactly like they do for bsub.
If you want to know the LSF job ids of the jobs that PyPlatform is
controlling, you can type myjobs. For now, this only lists job ids,
but you can get the rest of the information from bjobs.
If you want to kill a job that is being controlled by PyPlatform,
you can use mykill. This program takes a list of LSF job ids, kills
them and removes them from PyPlatform. If you use bkill instead,
they may be rerun (because to LSF, they will have died with an
There is one difference, though: the output and the errors are not
stored, unless you specify a file with -e -o (bsub sends both of
these things in the email report).
Try it by executing
If you start getting emails from cron, you can remove the PyPlatform
line from your crontab (by executing crontab -e and editing the
file), and then you can forward me your emails. For now, we'll leave
it at that.
incidents. They can certainly be polished, and I'm counting on your
help with that. Don't worry, I just want you to tell me what I could
change or add. I'll take care of the rest.