Scientific Computing - possible reasons your jobs haven't run yet

"Fairshare" exhausted for the Slurm account (e.g. hallc, casa, jam) under which you submit jobs

Both the Farm and LQCD sides of the house use priority queuing, and within a given partition, priority is almost entirely based on Slurm "fairshare." So the primary determinant of your jobs' position in the queue is the amount of recent (half-life of seven days) usage by the Slurm account under which you are submitting the jobs and how that compares to its administratively set "share," relative to other Slurm accounts (and within an account, to other users submitting against the same account) that have jobs pending. See the current status with sshare and summarize the queue with squeue -t pd -p production -o "%.8Q %.10u/%10a" | uniq -c.

Note however that it can be difficult to infer expected start times based on the state of the queue at any given time because of how dynamic the ordering is. For one, you do know when another user may suddenly appear and submit jobs against an underutilized (or less overutilized) account that therefore cut in line. Even if no one adds jobs, jobs that are currently running are altering the usage statistics and may cause the pending queue to be repeatedly reordered on that basis. But the overall effect should be that accounts get roughly their administratively set share when there is contention.

Usage is charged based on the resources requested from the batch system (Slurm) by the jobs, so it follows that the more resources your jobs request, the more quickly they will burden the Slurm account you are using, whether or not your application actually uses those resources or leaves them idle while running, so it pays to monitor your jobs' efficiency (e.g. with the seff command) and be judicious when specifying resource requests.

Wide and/or long jobs being more difficult to back-fill

Slurm for both the Farm and LQCD is also configured for back-fill scheduling. Because jobs vary in the amounts of different resources they request, strict priority scheduling leaves holes in the schedule as resources are held idle in order to be able to start the next highest priority job. With back-fill scheduling, jobs can run in those holes, early, if they will not impact the expected start time of any higher-priority job.

If your jobs are "wide," i.e. requesting many nodes or cores (and on the Farm, where most jobs are single-core, and the vast majority single-node, anything bigger is "wide"), large amounts of memory, or a long specified TimeLimit, then those jobs will be less likely to be eligible for back-filling opportunities, i.e. less likely to fit in a scheduling hole. You want a reasonable margin in your specified TimeLimits to avoid premature termination, but excessive TimeLimits will disadvantage your jobs for scheduling when the cluster is busy.

Nodes reserved or draining for maintenance

Is it the third Tuesday of a month? We will occasionally reserve all or most of a cluster for maintenance activities that require there to be no jobs running. Frequently (though not always) we schedule such activities on a third Tuesday, the regular monthly CST Division maintenance day. Reservations have a StartTime, and if the TimeLimit of your job would extend past the StartTime, your job will not start until after the reservation ends. This is another reason to be judicious about your jobs' TimeLimits. Current reservations can be listed with sinfo -T and scontrol show reservation.

When nodes need rebooted, but not all at the same time (e.g. for monthly software patching), we may set them individually to "drain," i.e. to finish their current jobs without accepting new ones. Because drains force CPUs to remain idle waiting for other CPUs in the same node to become idle, they temporarily reduce the available computing capacity and therefore increase wait times. Drains (and other offline nodes) can be seen with sinfo -R.

Other (capital-R) Reasons

The last column of the output of squeue --me -t pd may show a reason other than (Priority), like (AssocMaxNodePerJobLimit) or (ReqNodeNotAvail). You can find explanations of such Reasons in the squeue(1) man-page under the heading JOB REASON CODES, or on Slurm's website.