SLURM is an open-source resource manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
As depicted in Figure 1, SLURM consists of a slurmd daemon running on each compute node and a central slurmctld daemon running on a management node (with optional fail-over twin). The slurmd daemons provide fault-tolerant hierarchical communications. The user commands include: sacct, salloc, sattach, sbatch, sbcast, scancel, scontrol, sinfo, smap, squeue, srun, strigger and sview. All of the commands can run anywhere in the cluster.
$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 2 idle carnot,diesel
sinfo shows the current status of the cluster. It displays the availability, timelimits being enforced, number of nodes, and which nodes are available. The STATE of a node can also be allocated, completing, down, draining, unknown. If a node is marked with an '*', the node is not responding to the controller.
$squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 37 debug myprogram james R 0:04 1 carnot
squeue shows jobs currently in the queue. The State can be one of:
$ sacct JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 37 sleep debug 8 COMPLETED 0:0
sacct shows jobs that you have run in the past, how much resources they used and their state.
$ srun -n4 --time=0:30 -o myjob.out ./myprogram
Setting up all of your options everytime you use srun can be repetitive. Use a batch file and submit that instead!
Example file 'mybatch':
#!/bin/sh #SBATCH -n 4 #SBATCH --time=0:60 #SBATCH --output=mybatch.out srun ./my_program
$ sbatch mybatch Submitted batch job 47
$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 37 debug myprogram james R 0:04 1 carnot $ scancel 37
Cancel a job with scancel, then the JobID of the program you want to cancel. You can view the jobID with squeue.