Difference between revisions of "Vital-IT"

Line 35: Line 35:
 
== Bsub in a nutshell ==
 
== Bsub in a nutshell ==
  
Submitting a simple job.
+
===Submitting a simple job.===
  
 
  bsub "sh myscript.sh > mylog"
 
  bsub "sh myscript.sh > mylog"
Line 46: Line 46:
 
  bsub -e myerrorfile -o myoutputfile "sh myscript.sh"
 
  bsub -e myerrorfile -o myoutputfile "sh myscript.sh"
  
 +
=== Monitoring jobs ===
  
 
You can check it status by doing
 
You can check it status by doing
Line 61: Line 62:
 
  bkill -q normal 0 # kill all my jobs from the normal queue
 
  bkill -q normal 0 # kill all my jobs from the normal queue
  
Making job dependent, if I want to run a,b,c and b needs the output from a, and c from b.
+
=== Building nicer bsub ===
You can use the -w bsub option:
+
 
 +
You can submit many jobs and ensure some start after the completion of some other.
 +
i.e. if you want to run a,b,c and b needs the output from a, and c is to do only when b failed,
 +
Then you can use the -w bsub option
 
  bsub -J a "sh a.sh"
 
  bsub -J a "sh a.sh"
 
  bsub -J b -w '(done "a")' "sh b.sh" # start b when a is successfully done
 
  bsub -J b -w '(done "a")' "sh b.sh" # start b when a is successfully done
  bsub -J b -w '(exit "b")' "sh b.sh" # start c if b has failed
+
  bsub -J c -w '(exit "b")' "sh b.sh" # start c if b has failed
 
And here we go, we have a mini-pipeline.
 
And here we go, we have a mini-pipeline.
 +
  
 
== FAQ ==
 
== FAQ ==

Revision as of 16:56, 26 February 2009

How to run jobs on Vital-IT, hints and good pratice

When you have many many jobs to run, running them on the Vital-IT cluster might be better than running them on shoshana or maya.
Simply because running 300+ jobs on a 16 processor machine, will make your jobs competing with each other. (i.e. each job will not be using 100% of a processor, but will be sharing the resources with others).
If your jobs take few minutes to complete, that might not be an issue though.

Any jobs that do not require huge amount of memory (i.e. more than 2Gb) can be easily run on the Vital-it machines. For huge-memory there are few machines available, although only one (rserv) competing with shoshana or maya.

Prerequisites

Before working or crashing vital-it, you will need an account.
You can ask for one there [1]

Ways to submit jobs

You can submit jobs through :

  • a web interface [2]
  • or you can use a python script (wsub.py), documentation available at wsub-python[3]
  • or you can log on to a front-end node (dev.vital-it.ch or prd.vital-it.ch) and submit jobs using the bsub command.[4]

Being nice

PLEASE DO NOT RUN ANY COMPUTATION ON THE FRONT_END NODES (dev,prd) !!!
These front-end nodes are only to submit jobs and do not have the resources to allow you running your jobs interatively.
For interactive and/or heavy computation, you can log on rserv.vital-it.ch or noko01.vital-it.ch .
The jobs on these machines will share the resources (RAM, CPU, I/O) with all other user's jobs.

Bsub in a nutshell

Submitting a simple job.

bsub "sh myscript.sh > mylog"
Job <903956> is submitted to default queue <normal>.


That will submit it to the cluster and return you its job id. Here outputs will be redirected to mylog. But you can separate STDOUT and STDERR messages in distinct files with :

bsub -e myerrorfile -o myoutputfile "sh myscript.sh"

Monitoring jobs

You can check it status by doing

bjobs 

You might also use

bjobs -r # list all running jobs
bjobs -d # list all finished jobs (either successfully completed or failed ones)
bjobs -u marcel # list all jobs for this user
bjobs -q normal# list my jobs on this queue

Bkill is your best friend, we something goes wrong, you can kill your job(s) with :

bkill 007 # kill job id's 007
bkill 0 # kill all my jobs
bkill -q normal 0 # kill all my jobs from the normal queue

Building nicer bsub

You can submit many jobs and ensure some start after the completion of some other. i.e. if you want to run a,b,c and b needs the output from a, and c is to do only when b failed, Then you can use the -w bsub option

bsub -J a "sh a.sh"
bsub -J b -w '(done "a")' "sh b.sh" # start b when a is successfully done
bsub -J c -w '(exit "b")' "sh b.sh" # start c if b has failed

And here we go, we have a mini-pipeline.


FAQ

  • Can I submit LSF jobs from rserv or noko01?
    No, use dev or prd instead.
  • My ls is painfully slow, that's because of SFS. Why?
    Use /bin/ls or ls --color=none instead.
  • How to I check the space left?

Please note, that Vital-it will crash if the space left is less than 1Tb !!! Because, there are some webservices relying on this minimal free space.

df -h .
Filesystem            Size  Used  Avail Use% Mounted on
client_o2ib           16T   13T   2.7T   83%  /sfs1  
 

Known limitations

Vital-it uses the SFS file system [5], files are stripped to many discs for backup reasons.
But this means that any file stat operation (i.e. a simple ls), needs to query the various discs where the data stripes are. This can be painfully slow...
This means that a job doing lots of I/O operations will be slower compared to a NFS file system. Still running in parallel 200+ jobs will be much faster than one by one or by small batches on maya/shoshana.