Skip to content

slurm validation playbook fails - Pyxis/Enroot failing to run jobs on CentOS #784

@miketice22

Description

@miketice22

After installing/deploying slurm with 'ansible-playbook -l slurm-cluster playbooks/slurm-cluster.yml' The validation playbook fails.

$ ansible-playbook -l slurm-cluster playbooks/slurm-cluster/slurm-validation.yml -e '{num_gpus: 1}'
[WARNING]: Invalid characters were found in group names but not replaced, use -vvvv to see details

PLAY [slurm-master[0]] **********************************************************************************************************************************************************************************************

TASK [Get node count from sinfo] ************************************************************************************************************************************************************************************
changed: [aplcdhen01.datalake.jhuapl.edu]

TASK [Set num_nodes variable] ***************************************************************************************************************************************************************************************
ok: [aplcdhen01.datalake.jhuapl.edu]

TASK [Set cmd variable] *********************************************************************************************************************************************************************************************
ok: [aplcdhen01.datalake.jhuapl.edu]

TASK [Print node/gpu counts] ****************************************************************************************************************************************************************************************
ok: [aplcdhen01.datalake.jhuapl.edu] => 
  msg:
  - Detected 1 nodes with 1 gpus each.
  - 'Proceeding to run validation test, this may take several minutes: srun -N 1 -G 1 --ntasks-per-node=1 --mpi=pmix --exclusive  --container-image=deepops/nccl-tests-tf20.06-ubuntu18.04:latest all_reduce_perf -b 1M -e 4G -f 2 -g 1.'

TASK [Execute NCCL test across all nodes and GPUs] ******************************************************************************************************************************************************************
fatal: [aplcdhen01.datalake.jhuapl.edu]: FAILED! => changed=true 
  cmd: |-
    srun -N 1 -G 1 --ntasks-per-node=1 --mpi=pmix --exclusive  --container-image=deepops/nccl-tests-tf20.06-ubuntu18.04:latest all_reduce_perf -b 1M -e 4G -f 2 -g 1
  delta: '0:01:47.939670'
  end: '2020-12-07 19:07:46.163125'
  msg: non-zero return code
  rc: 1
  start: '2020-12-07 19:05:58.223455'
  stderr: |-
    pyxis: importing docker image ...
    pyxis: creating container filesystem ...
    pyxis: starting container ...
    slurmstepd: error: pyxis: container start failed with error code: 1
    slurmstepd: error: pyxis: printing contents of log file ...
    slurmstepd: error: pyxis:     enroot-nsenter: failed to create user namespace: Invalid argument
    slurmstepd: error: pyxis: couldn't start container
    slurmstepd: error: pyxis: if the image has an unusual entrypoint, try using --no-container-entrypoint
    slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
    slurmstepd: error: Failed to invoke spank plugin stack
    srun: error: apl-redd-ai02.datalake.jhuapl.edu: task 0: Exited with exit code 1
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions