Intro

Jobs that successfully launch/spawn will have a log file created in /farm_out/$USER. This is accessible from ifarm nodes and from within running Jupyter sessions. Check the log first when having issues. Common issues are outlined below and fall into two categories (Spawn and Run-time Errors).

Spawn Errors

Errors that prevent the job from starting.

400 : Bad Request OAuth state missing from cookies

You may be using a URL to an old session. Try going to jupyterhub.jlab.org directly and starting from scratch.

sbatch Error

You may have ifarm access, but not yet be allowed to submit a job to the cluster. To verify, you can use the ifarm and get the list of users known to slurm by using sacctmgr (omitting username will list all users).

sacctmgr list users <username>

If you aren't listed, submit an incident requesting access. Please include the account (ex. clas12) you should be added to. You can get a list of the current accounts by running:

sacctmgr list accounts

Resources Unavailable

Jobs will usually be stuck at the spawning screen. For example, you may be requesting more GPUs than are currently available on the cluster.

To confirm, you can use slurm on an ifarm node to list your current jobs. You'll see the state is pending (PD) and the reason is Resources.

Reduce the resources requested for your job. If you truly need the specified resources, submit in incident so we can review your requirements.

Run-time Errors

Errors that occur during the execution of the job.

503 - Service Unavailable

This is often due to full home disk quota. The job is unable to launch or continue due to lack of space. You can confirm this by checking your job logs in /farm_out/$USER/

Typically errors are shown as No space left on disk:

% cd /farm_out/$USER
% tail jupyterhub-spawner-53690974-sciml1902.log
...
[E 2021-10-06 14:50:17.498 SingleUserNotebookApp largefilemanager:36] Error while saving file: path/to/my/file.root [Errno 28] No space left on device
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.8/site-packages/notebook/services/contents/largefilemanager.py", line 
32, in save
        self._save_large_file(os_path, model['content'], model.get('format'))
      File "/opt/conda/lib/python3.8/site-packages/notebook/services/contents/largefilemanager.py", line 
70, in _save_large_file
        f.write(bcontent)
    OSError: [Errno 28] No space left on device

There are a couple of solutions.

Write files to a location other than your home directory;
Clean-up junk files in your home directory (some commands for assessing what's counting most against your quota are
- ncdu -rx ~/ and
- du -hxd1 ~/ | sort -h); or
Submit and incident to request additional home quota (only after evaluating #1 and #2).

Timeouts or "Bad Gateway"

Disconnects and timeouts are most often due to out-of-memory errors (OOM). There are two easy ways to verify this.

Your log file will likely contain something like:

% cd /farm_out/$USER
% tail jupyterhub-spawner-53690974-sciml1902.log
...
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=53690974.batch cgroup.
 Some of your processes may have been killed by the cgroup out-of-memory handler.

and seff will show the reported job state and memory efficiency.

% seff 53690974
Job ID: 53690974
Cluster: scicomp
User/Group: wmoore/ccc
State: OUT_OF_MEMORY (exit code 0)
Nodes: 1
Cores per node: 4
CPU Utilized: 00:00:00
CPU Efficiency: 0.00% of 01:14:20 core-walltime
Job Wall-clock time: 00:18:35
Memory Utilized: 5.56 GB
Memory Efficiency: 142.28% of 3.91 GB

JupyterHub - Common Issues