Kreiman:PyPlatform

From OpenWetWare

Jump to: navigation, search

Home        Contact        Internal        Lab Members        Publications        Research        Talks       


This documents describes a group of programs I wrote to make life

easier when using the LSF scheduler we have in orchestra. I call the

set of programs PyPlatform.


  • Motivation

Oftentimes, I submit a job that in an interactive session takes 5

minutes to run. In a 15m queue, it dies with a timeout. Sometimes,

the network filesystem in orchestra is working slowly, or the node

my job was running in was too busy. In any case, I wanted a way to

detect this situation, and somehow have my job rerun.


Another annoying, and recurring problem, is that sometimes my jobs

would die because of transient errors in the network, or because a

certain job did not have the home, or group, directories mounted.


Yet other times, my job run for 16 minutes. My estimation of 15m was

reasonable, but not useful enough. In those cases, I wanted my job to

be rerun on a 2h queue. One could submit everything to the unlimited

queues, but those have a lower priority.


  • Description of the system

The set of programs I wrote has a two main parts: a dispatcher and

command-line tools to interface with the dispatcher.


The dispatcher is a job that needs to run periodically in one of the

login nodes. We can achieve this using cron (see Installation). This

program checks the status of the submitted jobs, relaunches dead or

timedout jobs, and performs general bookkeeping.


If a job dies, the dispatcher will see this, and it will decide what

to do. If it was killed because of some error, it will be rerun as

it was submitted, up to a number of times you can configure. If it

was killed because of a timeout, it will decide whether the timeout

was legit (it if used very little CPU time, it's not considered

legit), and then decide either to rerun it in the same queue, or to

move it to a queue with more time available (e.g. bump it from a 2h

queue to a 12h queue)


The command line tools are mysub, myjobs and mykill. mysub supports

a subset of the functionality of bsub. mykill is essentially like

bkill, and myjobs prints the LSF job id of the jobs that PyPlatform

is taking care of.


  • Installation

Follow these instructions, and hopefully you will have everything running.

1) Log in to orchestra (you will actually login to either mezzanine or

balcony).

2) cd to your home directory and execute


   svn checkout svn+ssh://orchestra.med.harvard.edu/home/et62/svnroot/PyPlatform


This will create a directory PyPlatform wherever you were standing.


3) cd to PyPlatform/trunk and install everything by executing


   make


4) Tweak the file ~/.PyPlatform/config to suit your taste. If I were you, I

would leave everything as it is, but feel free to play.


5) Install PyPlatform in your crontab by executing


   crontab -e


and adding the line


   */1 * * * *  bash -login /home/et62/.PyPlatform/forcron


anywhere in the file. REPLACE et62 by your orchestra

username. crontab -e will launch a some text editor. You will

probably have no trouble using it.


  • /1 means that the dispatcher is going to be run every minute. If

you want it to run every 3 minutes, you replace that by */3. I

feel that 1 minute is great if you are launching many short jobs,

and 5 works more than fine if you are running longer jobs. The

dispatcher checks to see whether there is another instance

running, so don't worry about that.


6) Add ~/bin to your PATH, if it's not already there. You can check

this by editing the file


   ~/.bash_profile


I think the default version in orchestra includes the your home

bin directory, if it exists. If nothing of the sort is present in

the file, you can accomplish this effect by adding the line


   PATH=~/bin:"${PATH}"


7) Log out of orchestra, and log back in.


8) Learn how to use PyPlatform, by reading the usage section


  • Usage

Execute


 mysub --help


and you will get


Usage: mysub [options]


Options:

-h, --help

                      show this help message and exit

-n NAME, --name=NAME

                       Assign a name to the job

-e ERRORSFILE, --errorsfile=ERRORSFILE

                       Redirect stderr to a file

-o OUTPUTFILE, --outputfile=OUTPUTFILE

                       Redirect stdout to a file

-N

                  Send an email even if the output/errors are redirected

-q QUEUE, --queue=QUEUE

                       Specify a queue to run the job

-a AFTERACTION, --afterwards=AFTERACTION

                       Specify an action to be performed upon successful completion


For now, ignore the -n and -a options. -e, -o, -q and -N work

exactly like they do for bsub.


If you want to know the LSF job ids of the jobs that PyPlatform is

controlling, you can type myjobs. For now, this only lists job ids,

but you can get the rest of the information from bjobs.


If you want to kill a job that is being controlled by PyPlatform,

you can use mykill. This program takes a list of LSF job ids, kills

them and removes them from PyPlatform. If you use bkill instead,

they may be rerun (because to LSF, they will have died with an

error).


There is one difference, though: the output and the errors are not

stored, unless you specify a file with -e -o (bsub sends both of

these things in the email report).


Try it by executing


 mysub ls


  • Troubleshooting

If you start getting emails from cron, you can remove the PyPlatform

line from your crontab (by executing crontab -e and editing the

file), and then you can forward me your emails. For now, we'll leave

it at that.


I have been using the scripts for a few weeks without any

incidents. They can certainly be polished, and I'm counting on your

help with that. Don't worry, I just want you to tell me what I could

change or add. I'll take care of the rest.


Cheers,

Enrique

Personal tools