The KU Community Cluster uses SLURM (Simple Linux Utility for Resource Management) for managing job scheudling.

Connecting

Step-by-Step
Step-by-step instructions on how to connect

​The cluster uses your KU Online ID and password.

  • SSH: Use a SSH2 client to connect to hpc.crc.ku.edu for example:
    ssh username@hpc.crc.ku.edu
    

    Replace username with your KU Online ID, and then authenticate with your KU Online ID password. Alternatively, you can set up public-key authentication. SSH connections to hpc.crc.ku.edu resolve to either of the following login nodes:

    submit1.hpc.crc.ku.edu
    submit2.hpc.crc.ku.edu
    
  • X2Go: X2Go is software which allows you to access the cluster using a graphical desktop window. This allows you to open GUI applications such as MATLAB on the cluster.

Campus

If you are connecting from any of the University of Kansas' campuses, you may connect as the instructions above show.

Off-Campus

If you wish to connect the KU Community Cluster from off campus, you must connect through KU Anywhere. If you have multiple VPN Entitlements, any one of them will work. After successful connection, you may connect as instructed above.

Submitting Jobs

Maximum number of jobs
The maximum number of jobs a user can have submitted at one time is 5000
  • Batch jobs:To run a job in batch mode, use your favorite text editor to create a file which has SLURM options and also instructions on how to run your job, called a submission script. All SLURM options are prefaced with #SBATCH. It is necessary to specify the partition you wish to run in. After your script is complete, you can submit the job to the cluster with command sbatch.

    A submission script is simply a text file that contains your job parameters and the commands you wish to execute as part of your job. You can also load modules, set environmental variables, or other tasks inside your submission script.

    sbatch example.sh

    You may also submit simple jobs from the command line

    srun --partition=sixhour echo Hello World!

    Command-line options
    Command-line options will override SLURM options in your job script.
    
  • Interactive jobs: An interactive job allows you to open a shell on the compute node as if you had ssh'd into it. It is usually used for debugging purposes.

    To submit an interactive job, use the srun. Again, you must specify which --partition you wish your job to run in.

    srun --time=4:00:00 --ntasks=1 --nodes=1 --partition=sixhour --pty /bin/bash -l

    In the example above, the job has requested:

    • --time=4:00:00 4 hours for the job run
    • --ntasks=1 1 task. By default, 1 core is given to each task.
    • --nodes=1 1 node
    • --partition=sixhour Job to run in sixhour partition
    • --pty /bin/bash Interactive terminal running /bin/bash shell.
    • The --time, --ntasks, --nodes are called options.

    If you have ssh'd to the submit nodes with X11 forwarding enabled and wish to have X11 for an interactive job, then supply the --x11 flag

    srun --time=4:00:00 --ntasks=4 --nodes=1 --partition=sixhour --x11 --pty /bin/bash -l

Submission Script

To run a job in batch mode on a high-performance computing system using SLURM, first prepare a job script that specifies the application you want to run and the resources required to run it, and then submit the script to SLURM using the sbatch command.

A very basic job script might contain just a bash or tcsh shell script. However, SLURM job scripts most commonly contain at least one executable command preceded by a list of options that specify resources and other attributes needed to execute the command (e.g., wall-clock time, the number of nodes and processors, and filenames for job output and errors). These options prefaced with the #SBATCH instruction, which should precede any executable lines in your job script.

Additionally, your SLURM job script (which will be executed under your preferred login shell) should begin with a line that specifies the command interpreter under which it should run.

Default Options
If no SLURM options are given, default options are applied.

Tasks / Cores

Slurm is very explicit in how one requests cores and nodes. While extremely powerful, the three flags, --nodes, --ntasks, and --cpus-per-task can be a bit confusing at first.

The term task in this context can be thought of as a process. Therefore, a multi-process program (e.g. MPI) is comprised of multiple tasks. And a multi-threaded program is comprised of a single task, which can in turn use multiple CPUs. In SLURM, tasks are requested with the --ntasks flag. CPUs, for the multithreaded programs, are requested with the --cpus-per-task flag.

Single Core Job

The --mem option can be used to request the appropriate amount of memory for your job. Please make sure to test your application and set this value to a reasonable number based on actual memory use. The %j in the --output line tells SLURM to substitute the job ID in the name of the output file. You can also add --error with an error file name to separate output and error logs.

#!/bin/bash
#SBATCH --job-name=serial_job_test    # Job name
#SBATCH --partition=sixhour           # Partition Name (Required)
#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=email@ku.edu      # Where to send mail	
#SBATCH --ntasks=1                    # Run on a single CPU
#SBATCH --mem=1gb                     # Job memory request
#SBATCH --time=0-00:05:00             # Time limit days-hrs:min:sec
#SBATCH --output=serial_test_%j.log   # Standard output and error log

pwd; hostname; date
 
module load python/3.6
 
echo "Running python script"
 
python /path/to/your/python/script/script.py
 
date

Threaded or multi-core job

This script can serve as a template for applications that are capable of using multiple processors on a single server or physical computer. These applications are commonly referred to as threaded, OpenMP, PTHREADS, or shared memory applications. While they can use multiple processors, they cannot make use of multiple servers and all the processors must be on the same node.

These applications required shared memory and can only run on one node; as such it is important to remember the following:

  • You must set --ntasks=1, and then set --cpus-per-task to the number of threads you wish to use.
  • You must make the application aware of how many processors to use. How that is done depends on the application:
    • For some applications, set OMP_NUM_THREADS to a value less than or equal to the number of --cpus-per-task you set.
    • For some applications, use a command line option when calling that application.
#!/bin/bash
#SBATCH --job-name=parallel_job      # Job name
#SBATCH --partition=sixhour          # Partition Name (Required)
#SBATCH --mail-type=END,FAIL         # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=email@ku.edu     # Where to send mail	
#SBATCH --ntasks=1                   # Run a single task	
#SBATCH --cpus-per-task=4            # Number of CPU cores per task
#SBATCH --mem-per-cpu=2gb            # Job memory request
#SBATCH --time=0-00:05:00            # Time limit days-hrs:min:sec
#SBATCH --output=parallel_%j.log     # Standard output and error log

pwd; hostname; date
 
echo "Running on $SLURM_CPUS_PER_TASK cores"
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
 
module load compiler/gcc/6.3
 
/path/to/your/program

MPI job

These are applications that can use multiple processors that may, or may not, be on multiple compute nodes. In SLURM, the --ntasks flag specifies the number of MPI tasks created for your job. Note that, even within the same job, multiple tasks do not necessarily run on a single node. Therefore, requesting the same number of CPUs as above, but with the --ntasks flag, could result in those CPUs being allocated on several, distinct compute nodes.

For many users, differentiating between --ntasks and --cpus-per-task is sufficient. However, for more control over how SLURM lays out your job, you can add the --nodes and --ntasks-per-node flags. --nodes specifies how many nodes to allocate to your job. SLURM will allocate your requested number of cores to a minimal number of nodes on the cluster, so it is extremely likely if you request a small number of tasks that they will all be allocated on the same node. However, to ensure they are on the same node, set --nodes=1 (obviously this is contingent on the number of CPUs and requesting too many may result in a job that will never run). Conversely, if you would like to ensure a specific layout, such as one task per node for memory, I/O or other reasons, you can also set --ntasks-per-node=1. Note that the following must be true:

ntasks-per-node * nodes >= ntasks

The job below requests 16 tasks per node, with 2 nodes. By default, each task gets 1 core, so this job uses 32 cores. If the --ntasks=16 option was used, it would only use 16 cores and could be on any of the nodes in the partition, even split between multiple nodes.

#!/bin/bash

#SBATCH --partition=sixhour      # Partition Name (Required)
#SBATCH --ntasks-per-node=16     # 16 tasks per node with each task given 1 core
#SBATCH --nodes=2                # Run across 2 nodes
#SBATCH --constraint=ib          # Only nodes with Infiniband (ib)
#SBATCH --mem-per-cpu=4gb        # Job memory request
#SBATCH --time=0-06:00:00        # Time limit days-hrs:min:sec
#SBATCH --output=mpi_%j.log      # Standard output and error log
 
echo "Running on $SLURM_JOB_NODELIST nodes using $SLURM_CPUS_ON_NODE cores on each node"
 
mpirun /path/to/program

GPU or MIC jobs

GPU and MIC (Intel Xeon Phi) nodes can be requested using the general consumable resource option (--gres=gpu/mic). There are 3 different types of GPU cards in the KU Community Cluster set up as constraints. To run on a V100 GPU:

--gres=gpu --constraint=v100
Multiple GPUs
You may request multiple GPUs by changing the --gres value to --gres=gpu:2. Note that this value is per node.
For example, --nodes=2 --gres=gpu:2 will request 2 nodes with 2 GPUs each, for a total of 4 GPUs.

The job below request a single GPU node in the sixhour partition

#!/bin/bash
#SBATCH --partition=sixhour   # Partition Name (Required)
#SBATCH --ntasks=1            # 1 task
#SBATCH --time=0-06:00:00     # Time limit days-hrs:min:sec
#SBATCH --gres=gpu            # 1 GPU
#SBATCH --output=gpu_%j.log   # Standard output and error log
 
module load singularity
CONTAINERS=/panfs/pfs.local/software/install/singularity/containers
singularity exec --nv $CONTAINERS/tensorflow-gpu-1.9.0.img python ./models/tutorials/image/mnist/convolutional.py

Common Commands

Submitting the Job

All Commands
List of all commands.

Submitting the SLURM job is done by command sbatch. SLURM will read the submit file, and schedule the job according to the description in the submit file.

Submitting the job described above is:

$ sbatch example.sh 
Submitted batch job 62

Checking Job Status

To check the status of your job, use the squeue command. It will provide information such as:

  • The State (ST) of the job:
    • R - Running
    • PD - Pending - Job is awaiting resource allocation.
    • Additional codes are available on the squeue page.
  • Job Name
  • Run Time
  • Nodes running the job

Checking the status of jobs owned by a specific username, use the -u option

$ squeue -u <username>
  JOBID PARTITION     NAME       USER  ST       TIME  NODES NODELIST(REASON)
     65   sixhour hello-wo <username>   R       0:56      1 g004

Additionally, if you want to see the status of a specific partition, for example if you are part of a partition, you can use the -p option to squeue:

$ squeue -p sixhour
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
  73435  sixhour  MyRandom  jayhawk   R   10:35:20      1 r10r29n1
  73436  sixhour  MyRandom  jayhawk   R   10:35:20      1 r10r29n1
  73735  sixhour  SW2_driv   bigjay   R   10:14:11      1 r31r29n1
  73736  sixhour  SW2_driv   bigjay   R   10:14:11      1 r31r29n1

Checking Job Start

You may view the start time of your job with the command squeue --start. The output of the command will show the expected start time of the jobs.

$ squeue --start --user jayhawk
  JOBID  PARTITION     NAME     USER  ST           START_TIME  NODES NODELIST(REASON)
   5822    sixhour  Jobname   bigjay  PD  2018-08-24T00:05:09      3 (Priority)
   5823    sixhour  Jobname   bigjay  PD  2018-08-24T00:07:39      3 (Priority)
   5824    sixhour  Jobname   bigjay  PD  2018-08-24T00:09:09      3 (Priority)
   5825    sixhour  Jobname   bigjay  PD  2018-08-24T00:12:09      3 (Priority)
   5826    sixhour  Jobname   bigjay  PD  2018-08-24T00:12:39      3 (Priority)
   5827    sixhour  Jobname   bigjay  PD  2018-08-24T00:12:39      3 (Priority)
   5828    sixhour  Jobname   bigjay  PD  2018-08-24T00:12:39      3 (Priority)
   5829    sixhour  Jobname   bigjay  PD  2018-08-24T00:13:09      3 (Priority)
   5830    sixhour  Jobname   bigjay  PD  2018-08-24T00:13:09      3 (Priority)
   5831    sixhour  Jobname   bigjay  PD  2018-08-24T00:14:09      3 (Priority)
   5832    sixhour  Jobname   bigjay  PD                  N/A      3 (Priority)

The output shows the expected start time of the jobs, as well as the reason that the jobs are currently idle (in this case, low priority of the user due to running numerous jobs already).

Removing the Job

Removing the job is done with the scancel command. The only argument to the scancel command is the job id. The command is:

$ scancel 2234

Job History

sacct can be used to display currently running jobs and their usage and also previous job usage. It can be customized to look at certain options

$ sacct -u <user>

170          parallel_+    sixhour        crc          4  COMPLETED      0:0 
170.batch         batch                   crc          4  COMPLETED      0:0 
171          parallel_+    sixhour        crc          4 CANCELLED+      0:0 
171.batch         batch                   crc          4  CANCELLED     0:15 

Show all job information starting form a specific date

$ sacct --starttime 2014-07-01

Show job account information for a specific job

$ sacct -j <jobid>
$ sacct -j <jobid> -l 

Node Features

Features are requested under the --constraints option. Because the cluster is consortium of hardware, attributes allow the user to specify which type of node they wish to use (e.g. ib, edr_ib, intel)

#SBATCH --constraint "intel"
#SBATCH --constraint "intel&ib"
Feature Description
intel Intel CPUs
amd AMD CPUs
ib At least FDR Infiniband connections
edr_ib EDR Infiniband connections
noib Without Infiniband connections
k40 NVIDIA K40 GPUs. Must request --gres option to be assigned GPU.
k80 NVIDIA K80 GPUs. Must request --gres option to be assigned GPU.
v100 NVIDIA V100 GPUs. Must request --gres option to be assigned GPU.

Partitions

Each owner group has their own partition. (e.g. bi, compbio, crmda). You can view partitions you can submit to by running mystats​.

  • 60-00:00:00 (60 days): Max walltime of owner partitions
Job Partition
You must specify --partition for your job.
There is no default partition.

Six Hour

Other than the owner group partitions, there is a sixhour partition. This partition will allow your jobs to go across all IDLE nodes in the cluster, but is limited to a wall time of 6 hours.

To run in the sixhour partition, specify #SBATCH --partition sixhour in your job script.

SLURM Options

All options below are prefixed with #SBATCH. For example:

#SBATCH --partition=sixhour
#SBATCH --job-name=Jobname

This is a brief list of the most commonly used SLURM options. All options can be on the SLURM Documentaiton.

Option Abbreviation
Almost all options have a single letter abbreviation.
Option Function
-a, --array=<indexes> Submit a job array, multiple jobs to be executed with identical parameters.
-c, --cpus-per-task=<ncpus> Advise the Slurm controller that ensuing job steps will require ncpus number of processors per task. Without this option, the controller will just try to allocate one processor per task.
-C, --constraint=<list> Request which features the job requires.
-d, --dependency=<dependency_list> Defer the start of this job until the specified dependencies have been satisfied completed.
-D, --chdir=<directory> Set the working directory of the batch script to directory before it is executed.
-e, --error=<filename pattern> Instruct Slurm to connect the batch script's standard error directly to the file name specified in the "filename pattern". By default both standard output and standard error are directed to the same file.
--export=<environment variables [ALL] | NONE> Identify which environment variables are propagated to the launched application, by default all are propagated. Multiple environment variable names should be comma separated.
--gres=<list> Specifies a comma delimited list of generic consumable resources. The format of each entry on the list is "name[:count]". Example: "--gres=gpu:2"
-J, --job-name=<jobname> Specify a name for the job allocation.
--mail-type=<type> Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL
--mail-user=<user> User to receive email notification of state changes as defined by --mail-type.
--mem=<size[units]> Specify the real memory required per node. Default units are megabytes. See Memory Limits
--mem-per-cpu=<size[units]> Minimum memory required per allocated CPU. Default units are megabytes. See Memory Limits
-n, --ntasks=<number> sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the Slurm controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the --cpus-per-task option will change this default.
--ntasks-per-node=<ntasks> Request that ntasks be invoked on each node.
-N, --nodes=<minnodes[-maxnodes]> Request that a minimum of minnodes nodes be allocated to this job. A maximum node count may also be specified with maxnodes. If only one number is specified, this is used as both the minimum and maximum node count.
-o, --output=<filename pattern> Instruct Slurm to connect the batch script's standard output directly to the file name specified in the "filename pattern". By default both standard output and standard error are directed to the same file.
-p, --partition=<partition_names> Request a specific partition for the resource allocation. If the job can use more than one partition, specify their names in a comma separate list. Required.
-t, --time=<time> Set a limit on the total run time of the job allocation. Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds".

Default Options

If some options are not specified in the submission, default values will be set

  • Defaults:

    • --nodes=1
    • --cpus-per-task=1
    • --mem-per-cpu=2gb
    • --time=8:00:00 (8 hours) for owner queues and the max is 60:00:00:00 (60 days).
    • --time=1:00:00 (1 hour) for the sixhour queue.

Memory Limits

We reserve a chunk of memory per node for system services to prevent the node from crashing. This varies with the amount of memory reported by the server.

This also goes for --mem-per-core. You'll have to take the number of cores requested per node and multiply that by your --mem-per-core value and make sure it does not go above the allowed limit.

Total amount of memory on node Amount allowed to request
32 GB 30 GB
64 GB 61 GB
128 GB 125 GB
192 GB 186 GB
256 GB 250 GB
384 GB 376 GB
512 GB 503 GB
768 GB 754 GB

SLURM Commands

Below are some common, useful SLURM commands:

SLURM Command Function
sacct Used to report job or job step accounting information about active or completed jobs.
sinfo Reports the state of partitions and nodes managed by SLURM. It has a wide variety of filtering, sorting, and formatting options.
srun Used to submit a job for execution or initiate job steps in real time. srun has a wide variety of options to specify resource requirements, including: minimum and maximum node count, processor count, specific nodes to use or not use, and specific node characteristics (so much memory, disk space, certain required features, etc.). A job can contain multiple job steps executing sequentially or in parallel on independent or shared nodes within the job's node allocation.
squeue Reports the state of jobs or job steps. It has a wide variety of filtering, sorting, and formatting options. By default, it reports the running jobs in priority order and then the pending jobs in priority order.
squeue -u <username> Display the jobs submitted by the specified <username>
squeue -p <partition> Display the jobs in the specified <partition>. (Will not show jobs running in the sixhour partition that may be running on an owner partition)
scontrol show job <jobid> Check the status of a job (<jobid>).
squeue --start --job <jobid> Show an estimate of when your job (<jobid>) might start.
scontrol show nodes <node_name> Check the status of a node (<node_name>).
scancel <jobid> Cancel a job.

Measuring Memory and CPU Usage

Making sure your jobs use the right amount of RAM and the right number of CPUs helps you and others using the clusters use these resources more effeciently, and in turn get work done more quickly. Below are some examples of how to measure your CPU and RAM usage so you can make this happen. Be sure to check the example SLURM submission scripts to request the correct number of resources.

CPU Percentage Used
By default, this is percentage of a single CPU. On multi-core systems, you can have percentages
that are greater than 100%. For example, if 3 cores are at 60% use, top will show a CPU use of 180%.

Future Jobs

If you launch a program by putting /usr/bin/time -v in front of it, time will watch your program and provide statistics about the resources it used. Check Percent of CPU this job got: for how much CPU was used. Check Maximum resident set size (kbytes) for how much RAM the job used. For example:

/usr/bin/time -v stress -c 8 -t 10s
stress: info: [17958] dispatching hogs: 8 cpu, 0 io, 0 vm, 0 hdd
stress: info: [17958] successful run completed in 10s
	Command being timed: "stress -c 8 -t 10s"
	Percent of CPU this job got: 796%
        Maximum resident set size (kbytes): 2368

Running Jobs

If your job is already running, you can check on its usage, but will have to wait until it has finished to find the maximum memory and CPU used. The easiest way to check the instantaneous memory and CPU usage of a job is to ssh to a compute node your job is running on. To find the node you should ssh to, run:

squeue -u $USER
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           1654654   sixhour   abc123 r557e636  R       0:24      1 n259

Then use ssh to connect to a node your job is running on from the NODELIST column:

ssh n259
SSH to compute node
To access a compute node via ssh, you must have a job running on that compute node.
Your ssh session will be bound by the same cpu, memory, and time your job requested.

Once you are on the compute node, run either psor top.

ps

ps will give you instantaneous usage every time you run it. Here is some sample ps output:

ps -u $USER -o %cpu,rss,args
%CPU   RSS COMMAND
 0.0   588 stress -c 5 -t 10000s
98.2   204 stress -c 5 -t 10000s
98.2   204 stress -c 5 -t 10000s
98.2   204 stress -c 5 -t 10000s
98.2   204 stress -c 5 -t 10000s
98.2   204 stress -c 5 -t 10000s

ps reports memory used in kilobytes, so each of the 5 stress processes is using 204KB of RAM. They are also using most of 5 cores, so future jobs like this should request 5 CPUs.

top

top runs interactively and shows you live usage statistics. You can press u, enter your KU Online ID, then enter to filter just your processes. For Memory usage, the number you are interested in is RES. In the case below, the igblastn and perl programs are each consuming from 46MB to 348MB of memory and each fully utilizing one CPU. You can press q to quit.

top - 23:29:16 up 112 days,  1:00,  1 user,  load average: 5.17, 5.16, 5.15
Tasks: 647 total,   6 running, 641 sleeping,   0 stopped,   0 zombie
Cpu(s): 25.5%us,  1.1%sy,  0.0%ni, 73.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   125.989G total,  122.164G used, 3917.367M free,  388.625M buffers
Swap:    0.000k total,    0.000k used,    0.000k free,  118.752G cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
16273 r557e636  20   0 96068  48m 5812 R 100.0  0.0 250:31.93 igblastn
16167 r557e636  20   0  316m 196m 1252 R 100.0  0.2   0:45.35 perl
16309 r557e636  20   0  468m 348m 1376 R 100.0  0.3  59:57.89 perl
16384 r557e636  20   0 94256  46m 5836 R 100.0  0.0 248:26.95 igblastn
16214 r557e636  20   0  194m  74m 1252 R 99.7  0.1   0:16.94 perl

Completed Jobs

Slurm records statistics for every job, including how much memory and CPU was used.

seff

After the job completes, you can run seff jobid to get some useful information about your job, including the memory used and what percent of your allocated memory that amounts to.

seff 1620511
Job ID: 1620511
Cluster: ku_community_cluster
User/Group: r557e636/r557e636_g
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 8-19:03:16
CPU Efficiency: 99.87% of 8-19:19:34 core-walltime
Job Wall-clock time: 8-19:19:34
Memory Utilized: 66.96 MB
Memory Efficiency: 0.82% of 8.00 GB

The job above requested 1 core and 8GB of memory. It utilized the 1 core with 99.87% efficiently, but only used .82% of the 8GB of memory requested. Future jobs can probably be requested with less memory if the input data is the same.

seff
If your job requests email to be sent for END or FAIL mail types, the seff information
about that job will be sent in the body of the email.
sacct

You can also use the more flexible sacct to get that info, along with other more advanced job queries.

sacct -j 1620511 -o "JobID%20,JobName,User,Partition,NodeList,Elapsed,State,ExitCode,MaxRSS,AllocTRES%32"
               JobID    JobName      User  Partition        NodeList    Elapsed      State ExitCode     MaxRSS                        AllocTRES 
-------------------- ---------- --------- ---------- --------------- ---------- ---------- -------- ---------- -------------------------------- 
             1620511 paper2tes+    r557e636  biostat            n146 8-19:19:34  COMPLETED      0:0               billing=1,cpu=1,mem=8G,node=1 
       1620511.batch      batch                                 n146 8-19:19:34  COMPLETED      0:0     68572K              cpu=1,mem=8G,node=1 
      1620511.extern     extern                                 n146 8-19:19:34  COMPLETED      0:0       616K    billing=1,cpu=1,mem=8G,node=1

Fairshare and Job Priority

In order to ensure that all owner groups get their fair share of the cluster, we utilize SLURM's built-in job accounting and fairshare system. Every owner group is given quantity of shares based on the amount of SCU's they have purchased into the KU Community Cluster. The fairshare score of an owner group is then calculated based off of their share versus the amount of the cluster they have actually used. This fairshare score is then utilized to assign priority to their jobs relative to other users on the cluster. This keeps individual owner groups from monopolizing the resources in the sixhour partition, thus making it unfair to owner groups who have not used their fairshare for quite some time.

Fairshare Score

To see your fairshare score, run the command sshare.

             Account       User  RawShares  NormShares    RawUsage  EffectvUsage  FairShare 
-------------------- ---------- ---------- ----------- ----------- ------------- ---------- 
root                                          1.000000  2321244304      1.000000   0.500000 
 ku                                 parent    1.000000  2321244304      1.000000   0.500000 
  crc                                    2    0.005556      103440      0.000045   0.994453 
   crc                 r557e636          2    0.003704      103365      0.000045   0.991695 

An account is the owner group's name. The CRC owns 2 nodes in the cluster, and thus their RawShares is equal to 2. The NormShares value simply the Account's RawShares divided by the total number of RawShares given to all Accounts on the cluster. There are 359 total RawShares for all Accounts, and thus 2 / 359 = .005556.

RawUsage is the amount of CPU minutes the Account or User has used. The RawUsage is also effected by the halflife that is set for the cluster, which is currently 7 days. Thus work done in the last 7 days counts at full cost, work done 14 days ago costs half, work done 21 days ago one-fourth, and so on.

The next column is EffectvUsage. EffectvUsage is the Account's RawUsage divided by the total RawUsage for the cluster. Thus EffectvUsage is the percentage of the cluster the Account has actually used. In this case, the user has used 0.0045% of the cluster.

Finally, we have the Fairshare score. The Fairshare score is calculated using the following formula f = 2^(-EffectvUsage/NormShares). From this one can see that there are five basic regimes for this score which are as follows:

  • 1.0: Unused. The User has not run any jobs recently.
  • 1.0 > f > 0.5: Underutilization. The User is underutilizing their granted Share. For example, when f=0.75 a lab has recently underutilized their Share of the resources 1:2
  • 0.5: Average utilization. The User on average is using exactly as much as their granted Share.
  • 0.5 > f > 0: Over-utilization. The User has overused their granted Share. For example, when f=0.25 a lab has recently over utilized their Share of the resources 2:1
  • 0: No share left. The User has vastly overused their granted Share. If there is no contention for resources, the jobs will still start.

Since the usage of the cluster varies, the schedule does not stop Users from using more than their granted Share in their Account. Instead, the scheduler wants to fill idle cycles, so it will take whatever jobs it has available. Thus a User is essentially borrowing computing resource time in the future to use now. This will continue to drive down the Users's Fairshare score, but allow jobs for the User to still start. Eventually, another User with a higher Fairshare score will start submitting jobs and that labs jobs will have a higher priority because they have not used their granted Share.

Job Priority

Job Priority is an integer number that adjudicates the position of a job in the pending queue relative to other jobs. There are 4 components. Each component is multiplied by a weighting factor to have that component be more prominent in the scheduling of jobs.

  • Partition: Jobs submitted to an owner group partition receive 20,000 priority versus 400 priority given to jobs in the sixhour partition. This ensures that any job submitted to an owner group partition will always be scheduled before a sixhour job, even if submitted after the sixhour job.
  • Fairshare: The fairshare priority is given based on the usage of the cluster of the individual user.
  • Age: All jobs once submitted start with a 0 priority for age. The age priority component increases as the job is in the PENDING state waiting for the available resources to become free.
    100 PENDING Jobs
    Only 100 jobs per user in the PENDING state will accrue age priority.
    This is to allow other jobs to cut in line in that partition if there are thousands of jobs pending from a single user.
    
  • Job Size: This priority has the least weight out of all the priorities and is only a true impact on the job priority when all jobs in PENDING state are equal in the other 3 priorities. It allows for jobs requesting mores cores to have a slightly higher priority so that single core jobs will fill in the gaps where the large jobs may leave behind.

You can view all PENDING jobs and their respective priorities using the sprio command.


Cluster Support

If you need any help with the cluster or have general questions related to the cluster, please contact crchelp@ku.edu.

In your email, please include your submission script, any relevant log files, and steps in which you took to produce the problem

One of 34 U.S. public institutions in the prestigious Association of American Universities
44 nationally ranked graduate programs.
—U.S. News & World Report
Top 50 nationwide for size of library collection.
—ALA
5th nationwide for service to veterans —"Best for Vets: Colleges," Military Times
KU Today