User Tools

Site Tools


slurm_tutorial

Using SLURM

What is SLURM?

SLURM is an open-source resource manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

As depicted in Figure 1, SLURM consists of a slurmd daemon running on each compute node and a central slurmctld daemon running on a management node (with optional fail-over twin). The slurmd daemons provide fault-tolerant hierarchical communications. The user commands include: sacct, salloc, sattach, sbatch, sbcast, scancel, scontrol, sinfo, smap, squeue, srun, strigger and sview. All of the commands can run anywhere in the cluster.

Viewing Status of Runs

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      2   idle carnot,diesel

sinfo shows the current status of the cluster. It displays the availability, timelimits being enforced, number of nodes, and which nodes are available. The STATE of a node can also be allocated, completing, down, draining, unknown. If a node is marked with an '*', the node is not responding to the controller.

$squeue
JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
37     debug    myprogram    james   R       0:04      1 carnot

squeue shows jobs currently in the queue. The State can be one of:

  • R - Running
  • S - Suspended
  • PD - Pending
  • F - Failed
  • CG - Completing
  • TO - Timeout
  • CA - Canceled
$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
37                sleep      debug                     8  COMPLETED      0:0 

sacct shows jobs that you have run in the past, how much resources they used and their state.

Starting a Job with srun

$ srun -n4 --time=0:30 -o myjob.out ./myprogram

Starting a Job with sbatch

Setting up all of your options everytime you use srun can be repetitive. Use a batch file and submit that instead!

Example file 'mybatch':

#!/bin/sh
#SBATCH -n 4
#SBATCH --time=0:60
#SBATCH --output=mybatch.out
srun ./my_program
$ sbatch mybatch
Submitted batch job 47

Stopping or Canceling Jobs

$ squeue
JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
37     debug    myprogram    james   R       0:04      1 carnot

$ scancel 37

Cancel a job with scancel, then the JobID of the program you want to cancel. You can view the jobID with squeue.

More information

slurm_tutorial.txt · Last modified: 2022/07/21 06:59 by 127.0.0.1