-
Notifications
You must be signed in to change notification settings - Fork 350
Closed
Labels
Description
After installing/deploying slurm with 'ansible-playbook -l slurm-cluster playbooks/slurm-cluster.yml' The validation playbook fails.
$ ansible-playbook -l slurm-cluster playbooks/slurm-cluster/slurm-validation.yml -e '{num_gpus: 1}'
[WARNING]: Invalid characters were found in group names but not replaced, use -vvvv to see details
PLAY [slurm-master[0]] **********************************************************************************************************************************************************************************************
TASK [Get node count from sinfo] ************************************************************************************************************************************************************************************
changed: [aplcdhen01.datalake.jhuapl.edu]
TASK [Set num_nodes variable] ***************************************************************************************************************************************************************************************
ok: [aplcdhen01.datalake.jhuapl.edu]
TASK [Set cmd variable] *********************************************************************************************************************************************************************************************
ok: [aplcdhen01.datalake.jhuapl.edu]
TASK [Print node/gpu counts] ****************************************************************************************************************************************************************************************
ok: [aplcdhen01.datalake.jhuapl.edu] =>
msg:
- Detected 1 nodes with 1 gpus each.
- 'Proceeding to run validation test, this may take several minutes: srun -N 1 -G 1 --ntasks-per-node=1 --mpi=pmix --exclusive --container-image=deepops/nccl-tests-tf20.06-ubuntu18.04:latest all_reduce_perf -b 1M -e 4G -f 2 -g 1.'
TASK [Execute NCCL test across all nodes and GPUs] ******************************************************************************************************************************************************************
fatal: [aplcdhen01.datalake.jhuapl.edu]: FAILED! => changed=true
cmd: |-
srun -N 1 -G 1 --ntasks-per-node=1 --mpi=pmix --exclusive --container-image=deepops/nccl-tests-tf20.06-ubuntu18.04:latest all_reduce_perf -b 1M -e 4G -f 2 -g 1
delta: '0:01:47.939670'
end: '2020-12-07 19:07:46.163125'
msg: non-zero return code
rc: 1
start: '2020-12-07 19:05:58.223455'
stderr: |-
pyxis: importing docker image ...
pyxis: creating container filesystem ...
pyxis: starting container ...
slurmstepd: error: pyxis: container start failed with error code: 1
slurmstepd: error: pyxis: printing contents of log file ...
slurmstepd: error: pyxis: enroot-nsenter: failed to create user namespace: Invalid argument
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: pyxis: if the image has an unusual entrypoint, try using --no-container-entrypoint
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
srun: error: apl-redd-ai02.datalake.jhuapl.edu: task 0: Exited with exit code 1
stderr_lines: <omitted>
stdout: ''
stdout_lines: <omitted>