Job Scheduling Policy for Theta

Help Desk

Hours: 9:00am-5:00pm CT M-F
Email: support@alcf.anl.gov

Theta

General Policy

We ask that all users follow good etiquette and be excellent to one another.

Priority

As with all Argonne Leadership Computing Facility production systems, job priority in the queue is based on several criteria:

  • positive balance of your project
  • size (in nodes) of the job, larger jobs receive higher priority
  • the type of project (e.g. INCITE, ALCC, or discretionary)
  • job duration - shorter duration jobs will accumulate priority more quickly, so it is best to specify the job run time as accurately as possible

Reservations and Scheduling Policy

Some work will require use of Theta that requires deviation from regular policy. On such occasions, normal reservation policy applies. Please send the regular form no fewer than five (5) business days in advance.

Big Run Mondays

As part of our regular maintenance procedures on Mondays, we will promote to the highest priority any jobs in the queued state requesting 802 nodes or more. Promotion is subject to operational discretion.

We may also, at our discretion, take the opportunity to promote the priority of capability jobs if the system has been drained of jobs for any other reason.

Monday Maintenance

On Mondays where the ALCF is on a regular business schedule the system may be expected to undergo maintenance from 9:00 am until 5:00 pm US Central Time. The showres command may be used to view pending and active maintenance reservations.

INCITE/ALCC Overburn Policy

If an INCITE or ALCC project has exhausted its allocation in the first 11 months of its allocation year, it is eligible for overburn running. At this point, capability jobs submitted by INCITE and ALCC projects will run in the default queue (instead of backfill) for the first 11 months of the allocation year until 125% of the project allocation has been consumed.

INCITE and ALCC projects needing additional overburn hours should e-mail support@alcf.anl.gov with a short description of what they plan to do with the additional hours, highlighting specific goals or milestones and the time expected to accomplish them. This will be reviewed by the scheduling committee, allocations committee, and ALCF management. Requests should be submitted 15 days before the start of the next quarter of the allocation year for full consideration. Non-capability jobs from projects that have exhausted their allocation will continue to run in backfill. 

To be clear, this policy does not constitute a guarantee of extra time, and we reserve the right to prioritize the scheduling of jobs submitted by projects that have not yet used 100% of their allocations, so the earlier that an INCITE or ALCC project exhausts its allocation, the more likely it is to be able to take full advantage of this policy.

Queues

Information on KNL Queues
Information on ThetaGPU Queues

KNL Queues

Debugging Queues

  • There are two 16-node debugging queues:
    • debug-cache-quad
    • debug-flat-quad
  • Hardware is dedicated to each queue
  • Nodes are not rebootable to another mode
  • Minimum allocation of 1 node
  • Maximum allocation of 8 nodes
  • Job wall-clock time is limited to 1:00:00 (1 hour).
  • The maximum running job count is one (1) job per user.

Production Queues

  • There is a single submission queue for the entire system: default
  • Priority is given to jobs using at least 20% of Theta (>=802 nodes) (previously >=648 nodes)
  • There is a global limit of ten (10) jobs running per user
  • There is a global limit of twenty (20) jobs in queue per user
  • There is a minimum job time of thirty (00:30:00) minutes for the default queue
  • Beginning 21 May 2018 there is a minimum allocation of 128 nodes (previously 8 nodes)
  • Jobs smaller than the minimum allocation will be allocated and charged for the minimum allocation of nodes
  • While shorter jobs may accumulate priority faster, all requested wall-clock times (job durations) greater than or equal to 12 hours are treated equivalently.
  • Wall-clock limits are a step-wise function designed to encourage scaling:
    • node count >= 128 nodes (minimum allocation): maximum 3:00:00 hours
    • node count >= 256 nodes : maximum 6:00:00 hours
    • node count >= 384 nodes : maximum 9:00:00 hours
    • node count >= 640 nodes : maximum 12:00:00 hours
    • node count >= 802 nodes : maximum 24:00:00 hours
  • There is no default mode nodes may be assumed to be booted into. Failure to specify a mode will result in the assumption of cache-quad.

* Effective 4/13/20, ALCF has reserved 260 nodes in the production queue for a research project. The maximum job size is 4,100 nodes.

Running With Less Than 128 Nodes in Default Queue

It is important to remember that jobs are charged by the number of nodes allocated, not the number of nodes used. For instance, a job requesting 14 nodes, below the minimum of 128, will be allocated 128 nodes and charged for 128 even though only 14 nodes are used.

We strongly encourage you to bundle smaller jobs into ensembles, either using Cobalt ensemble scripting or using a workflow system such as Balsam.

KNL Mode Selection and Charging

For the first year of Theta production, beginning July 1, 2017, time spent booting or rebooting nodes to obtain requested modes will not be charged to projects though it will account against requested walltime. This policy may be revisted or revised after the first year.

Please allow up to thirty (30) minutes for rebooting of nodes when submitting jobs.

Failure to specify a mode will result in the selection of cache-quad, the equivalent of listing:

--attrs mcdram=cache:numa=quad

in your qsub or job script.

You are charged for the number of nodes allocated, not the number you use. The minimum possible allocation, effective May 21, 2018, will be 128 nodes.

GPU Node Queue and Policy

Note: Users will need an allocation on ThetaGPU to utilize the GPU nodes. Request for an allocation by filling out this form: Allocation request. ThetaGPU is listed under Theta on the form.

The GPU nodes are new and we expect the workload to be significantly different than it is on the KNL nodes.  This document describes the current state of affairs, but we will monitor usage and adjust the policies as necessary.

Nodes vs Queue vs MIG mode

The GPU nodes are NVidia DGX A100 nodes and each node contains eight (8) A100 GPUs.  You may request either entire nodes, or individual GPUs based on your job needs.  What you will get is determined by the queue you submit to.  If it has node in the name, you will get nodes.  If it has GPU in the name, you will get GPUs.  Note that the -n parameter in your qsub will match the resource type in the queue (-n 2 in node queue will get you two full nodes, -n 2 in a GPU queue will get you two GPUs). 

Additionally, the Nvidia A100 GPUs support a feature called “Multi-Instance GPU” (MIG) mode.  This allows a single GPU to be shared by up to 7 different processes.  We do not schedule at this level, but you may pass –attrs mig-mode=True in with your qsub and we will set the node to MIG mode and you may take advantage of it in your job script.

Queues

There will be two primary queues:

  • full-node – This is the general production queue for jobs that require full nodes.
  • single-gpu – This is the general production queue for jobs that operate best on individual GPUs.

Debug queues

  • debug-node – Submit to this queue if you need an entire node for your testing (for instance you are utilizing the NVLink)
  • debug-gpu – submit to this queue if you need GPUs.

Initially, we are relaxing our node restrictions to encourage early users.  Please be courteous to your fellow users and do not monopolize the machine.  We will tighten restrictions as required to manage the demonstrated workload. 

Here are the initial queue limits:

  • MinTime is 5 minutes
  • MaxTime is 24 hours
  • Max running will be 2 full nodes or 16 individual GPUs
  • Queue Restrictions
    • MaxQueued will be 100 jobs
    • You may have at most 1152 node-hours or 9216 GPU hours in the queue at any time.
    • You may not violate either of these policies.  You could not submit (1000) 1 node-hour jobs because that would violate the MaxQueued of 100 jobs, nor could you submit (2) 1000 node-hour jobs because that would violate the MaxNodeHours limit.

The initial queue policy will be simple First-In-First-Out (FIFO) based on priority with EASY backfill.