Submitting a Batch Job
3 min read • 581 wordsThe next step is to submit this batch job to the cluster.
Within the Class_Examples folder we unzipped you will find a copy of our submit script we created in the Estimating job resources section.
Here is how it looks:
#!/bin/bash -l
# Request 1 CPU for the job
#SBATCH --cpus-per-task=1
# Request 8GB of memory for the job
#SBATCH --mem=8GB
# Walltime (job duration)
#SBATCH --time=00:05:00
# Then finally, our code we want to execute.
python3 invert_matrix.py
Note
All of the lines that begin with a #SBATCH are directives to Slurm. The meaning of the directives in the sample script are exampled in a comment line that precedes the directive. The full list of available directives is explained in the man page for the sbatch command which is available on discovery.
sbatch will copy the current shell environment and the scheduler will recreate that environment on the allocated compute node when the job starts.
#!/bin/bash -l
which explicitly starts bash as a login shell.
Now submit the job and check its status. We use sbatch to submit the job:
[john@x01 Class_Examples]$ sbatch python_job.sh
Submitted batch job 2628650
To see a quick view of the job status you can issue the squeue -u command. In the state (ST) field it shows you what your current job state is. Below we can see that my job is RUNNING with (R)
[john@x01 Class_Examples]$ squeue -u john
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2628650 standard python_j john R 0:49 1 j03
If you would like to see detailed information of the job. You can use the scontrol command. Scontrol will provide information like what node my job is running on, how much walltime my job has left, and other useful information like the resources requested, and more.
[john@x01 Class_Examples]$ scontrol show job 2628650
JobId=2628650 JobName=python_job.sh
UserId=john(48374) GroupId=rc-users(480987) MCS_label=rc
Priority=1 Nice=0 Account=rc QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:03:05 TimeLimit=00:05:00 TimeMin=N/A
SubmitTime=2022-05-10T20:33:20 EligibleTime=2022-05-10T20:33:20
AccrueTime=2022-05-10T20:33:20
StartTime=2022-05-10T20:34:07 EndTime=2022-05-10T20:39:07 Deadline=N/A
PreemptEligibleTime=2022-05-10T20:34:07 PreemptTime=None
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-05-10T20:34:07 Scheduler=Main
Partition=standard AllocNode:Sid=x01:36127
ReqNodeList=(null) ExcNodeList=(null)
NodeList=j03
BatchHost=j03
NumNodes=1 NumCPUs=3 NumTasks=1 CPUs/Task=3 ReqB:S:C:T=0:0:*:*
TRES=cpu=3,mem=12G,node=1,billing=3
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=3 MinMemoryCPU=4G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/dartfs-hpc/rc/home/p/d18014p/Class_Examples/python_job.sh
WorkDir=/dartfs-hpc/rc/home/p/d18014p/Class_Examples
StdErr=/dartfs-hpc/rc/home/p/d18014p/Class_Examples/slurm-2628650.out
StdIn=/dev/null
StdOut=/dartfs-hpc/rc/home/p/d18014p/Class_Examples/slurm-2628650.out
Power=
JOBID is the unique ID of the job – in this case it is 2628650. In the above example I am issuing scontrol to view information related to my job
The output file, slurm-2628650.out, consists of three sections:
A header section, Prologue, which gives information such as JOBID, user name and node list.
A body section which include user output to STDOUT.
A footer section, Epilogue, which is similar to the header.
If we use the command cat
short for con·cat·e·nate, we can see a that the output file is exactly as we would expect it to be:
[john@x01 Class_Examples]$ cat slurm-2628650.out
i,mean 50 -1.09670952875e-17
i,mean 100 3.04038500105e-16
i,mean 150 1.24104227901e-17
i,mean 200 -1.60139176225e-16
i,mean 250 -1.17287454488e-16
i,mean 300 2.94829036507e-16
i,mean 350 4.66888358553e-17
i,mean 400 3.752857595e-15
i,mean 450 2.60083792553e-18
i,mean 500 2.05032635526e-16
i,mean 550 -3.80521832845e-16
i,mean 600 -3.07765049942e-17
i,mean 650 -9.39259624383e-17
i,mean 700 2.81309843854e-16
i,mean 750 -4.91502224946e-17
i,mean 800 7.35744459606e-17
i,mean 850 -5.23231103131e-18
i,mean 900 -5.52926185394e-17
i,mean 950 -3.26360319077e-16
i,mean 1000 -1.39343172417e-17
Here are some other useful commands to remember when interfacing with the scheduler.
sbatch
- submits a batch job to the queue
squeue
- shows status of Slurm batch jobs
srun
- srun --pty /bin/bash
runs an interactive job
sinfo
- show information about partitions
scontrol
- show job used to check the status of a running, or idle job
scancel
- cancel job