User Tools

Site Tools


slurm_tutorial

This is an old revision of the document!


Using Slurm

As depicted in Figure 1, SLURM consists of a slurmd daemon running on each compute node and a central slurmctld daemon running on a management node (with optional fail-over twin). The slurmd daemons provide fault-tolerant hierarchical communications. The user commands include: sacct, salloc, sattach, sbatch, sbcast, scancel, scontrol, sinfo, smap, squeue, srun, strigger and sview. All of the commands can run anywhere in the cluster.

Viewing Status of Runs

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      2   idle carnot,diesel

sinfo shows the current status of the cluster. It displays the availability, timelimits being enforced, number of nodes, and which nodes are available. The STATE of a node can also be allocated, completing, down, draining, unknown. If a node is marked with an '*', the node is not responding to the controller.

$squeue
JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
37     debug    myprogram    james   R       0:04      1 carnot

squeue shows jobs currently in the queue. The State can be one of:

  • R - Running
  • S - Suspended
  • PD - Pending
  • F - Failed
  • CG - Completing
  • TO - Timeout
  • CA - Canceled
$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
37                sleep      debug                     8  COMPLETED      0:0 

sacct shows jobs that you have run in the past, how much resources they used and their state.

Starting a Job with srun

Starting a Job with sbatch

Setting up all of your options everytime you use srun can be repetitive. Use a batch file and submit that instead!

Example file 'mybatch':

#!/bin/sh
#SBATCH -n 4
#SBATCH --time=0:60
#SBATCH --output=mybatch.out
srun ./my_program
$ sbatch mybatch
Submitted batch job 47

Stopping or Canceling Jobs

$ squeue
JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
37     debug    myprogram    james   R       0:04      1 carnot

$ scancel 37

Cancel a job with scancel, then the JobID of the program you want to cancel. You can view the jobID with squeue.

More information

slurm_tutorial.1410217505.txt.gz · Last modified: 2022/07/21 06:59 (external edit)