Kreiman:PyPlatform

{|cellspacing="5" cellpadding="10" style="background:#3674C2; width: 750px;"
 * -valign="top"
 * style="background:#ffffff"|

This documents describes a group of programs I wrote to make life

easier when using the LSF scheduler we have in orchestra. I call the

set of programs PyPlatform.


 * Motivation

Oftentimes, I submit a job that in an interactive session takes 5

minutes to run. In a 15m queue, it dies with a timeout. Sometimes,

the network filesystem in orchestra is working slowly, or the node

my job was running in was too busy. In any case, I wanted a way to

detect this situation, and somehow have my job rerun.

Another annoying, and recurring problem, is that sometimes my jobs

would die because of transient errors in the network, or because a

certain job did not have the home, or group, directories mounted.

Yet other times, my job run for 16 minutes. My estimation of 15m was

reasonable, but not useful enough. In those cases, I wanted my job to

be rerun on a 2h queue. One could submit everything to the unlimited

queues, but those have a lower priority.


 * Description of the system

The set of programs I wrote has a two main parts: a dispatcher and

command-line tools to interface with the dispatcher.

The dispatcher is a job that needs to run periodically in one of the

login nodes. We can achieve this using cron (see Installation). This

program checks the status of the submitted jobs, relaunches dead or

timedout jobs, and performs general bookkeeping.

If a job dies, the dispatcher will see this, and it will decide what

to do. If it was killed because of some error, it will be rerun as

it was submitted, up to a number of times you can configure. If it

was killed because of a timeout, it will decide whether the timeout

was legit (it if used very little CPU time, it's not considered

legit), and then decide either to rerun it in the same queue, or to

move it to a queue with more time available (e.g. bump it from a 2h

queue to a 12h queue)

The command line tools are mysub, myjobs and mykill. mysub supports

a subset of the functionality of bsub. mykill is essentially like

bkill, and myjobs prints the LSF job id of the jobs that PyPlatform

is taking care of.


 * Installation

Follow these instructions, and hopefully you will have everything running.

1) Log in to orchestra (you will actually login to either mezzanine or

balcony).

2) cd to your home directory and execute

svn checkout svn+ssh://orchestra.med.harvard.edu/home/et62/svnroot/PyPlatform

This will create a directory PyPlatform wherever you were standing.

3) cd to PyPlatform/trunk and install everything by executing

make

4) Tweak the file ~/.PyPlatform/config to suit your taste. If I were you, I

would leave everything as it is, but feel free to play.

5) Install PyPlatform in your crontab by executing

crontab -e

and adding the line

*/1 * * * * bash -login /home/et62/.PyPlatform/forcron

anywhere in the file. REPLACE et62 by your orchestra

username. crontab -e will launch a some text editor. You will

probably have no trouble using it.


 * /1 means that the dispatcher is going to be run every minute. If

you want it to run every 3 minutes, you replace that by */3. I

feel that 1 minute is great if you are launching many short jobs,

and 5 works more than fine if you are running longer jobs. The

dispatcher checks to see whether there is another instance

running, so don't worry about that.

6) Add ~/bin to your PATH, if it's not already there. You can check

this by editing the file

~/.bash_profile

I think the default version in orchestra includes the your home

bin directory, if it exists. If nothing of the sort is present in

the file, you can accomplish this effect by adding the line

PATH=~/bin:"${PATH}"

7) Log out of orchestra, and log back in.

8) Learn how to use PyPlatform, by reading the usage section


 * Usage

Execute

mysub --help

and you will get

Usage: mysub [options]

Options:

-h, --help show this help message and exit

-n NAME, --name=NAME Assign a name to the job

-e ERRORSFILE, --errorsfile=ERRORSFILE

Redirect stderr to a file

-o OUTPUTFILE, --outputfile=OUTPUTFILE

Redirect stdout to a file

-N Send an email even if the output/errors are redirected

-q QUEUE, --queue=QUEUE

Specify a queue to run the job

-a AFTERACTION, --afterwards=AFTERACTION

Specify an action to be performed upon successful completion

For now, ignore the -n and -a options. -e, -o, -q and -N work

exactly like they do for bsub.

If you want to know the LSF job ids of the jobs that PyPlatform is

controlling, you can type myjobs. For now, this only lists job ids,

but you can get the rest of the information from bjobs.

If you want to kill a job that is being controlled by PyPlatform,

you can use mykill. This program takes a list of LSF job ids, kills

them and removes them from PyPlatform. If you use bkill instead,

they may be rerun (because to LSF, they will have died with an

error).

There is one difference, though: the output and the errors are not

stored, unless you specify a file with -e -o (bsub sends both of

these things in the email report).

Try it by executing

mysub ls


 * Troubleshooting

If you start getting emails from cron, you can remove the PyPlatform

line from your crontab (by executing crontab -e and editing the

file), and then you can forward me your emails. For now, we'll leave

it at that.

I have been using the scripts for a few weeks without any

incidents. They can certainly be polished, and I'm counting on your

help with that. Don't worry, I just want you to tell me what I could

change or add. I'll take care of the rest.

Cheers,

Enrique