Jobs that successfully launch/spawn will have a log file created in /farm_out/$USER
. This is accessible from ifarm nodes and from within running Jupyter sessions. Check the log first when having issues. Common issues are outlined below and fall into two categories (Spawn and Run-time Errors).
Errors that prevent the job from starting.
You may have ifarm access, but not yet be allowed to submit a job to the cluster. To verify, you can use the ifarm and get the list of users known to slurm by using sacctmgr (omitting username will list all users).
sacctmgr list users <username>
If you aren't listed, submit an incident requesting access. Please include the account (ex. clas12) you should be added to. You can get a list of the current accounts by running:
sacctmgr list accounts
Jobs will usually be stuck at the spawning screen. For example, you may be requesting more GPUs than are currently available on the cluster.
To confirm, you can use slurm on an ifarm node to list your current jobs. You'll see the state is pending (PD) and the reason is Resources.
Reduce the resources requested for your job. If you truly need the specified resources, submit in incident so we can review your requirements.
Errors that occur during the execution of the job.
This is often due to full home disk quota. The job is unable to launch or continue due to lack of space. You can confirm this by checking your job logs in /farm_out/$USER/
Typically errors are shown as No space left on disk:
% cd /farm_out/$USER
% tail jupyterhub-spawner-53690974-sciml1902.log
...
[E 2021-10-06 14:50:17.498 SingleUserNotebookApp largefilemanager:36] Error while saving file: path/to/my/file.root [Errno 28] No space left on device
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/notebook/services/contents/largefilemanager.py", line
32, in save
self._save_large_file(os_path, model['content'], model.get('format'))
File "/opt/conda/lib/python3.8/site-packages/notebook/services/contents/largefilemanager.py", line
70, in _save_large_file
f.write(bcontent)
OSError: [Errno 28] No space left on device
There are a couple of solutions.
ncdu -rx ~/
anddu -hxd1 ~/ | sort -h
); orDisconnects and timeouts are most often due to out-of-memory errors (OOM). There are two easy ways to verify this.
Your log file will likely contain something like:
% cd /farm_out/$USER
% tail jupyterhub-spawner-53690974-sciml1902.log
...
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=53690974.batch cgroup.
Some of your processes may have been killed by the cgroup out-of-memory handler.
and seff
will show the reported job state and memory efficiency.
% seff 53690974
Job ID: 53690974
Cluster: scicomp
User/Group: wmoore/ccc
State: OUT_OF_MEMORY (exit code 0)
Nodes: 1
Cores per node: 4
CPU Utilized: 00:00:00
CPU Efficiency: 0.00% of 01:14:20 core-walltime
Job Wall-clock time: 00:18:35
Memory Utilized: 5.56 GB
Memory Efficiency: 142.28% of 3.91 GB