Skip to content

Conversation

@ajdecon
Copy link
Contributor

@ajdecon ajdecon commented Feb 15, 2022

Summary

We currently configure DeepOps Slurm clusters to use pam_slurm_adopt.so in the SSH authentication process. This module does two things:

  • When the user has a job on the node, it adopts the SSH process into the job so it can be killed when the job ends
  • When the user doesn't have a job on the node, it denies access

This behavior is exactly what you want on a production system, in order to prevent one user from interfering with another's workload. However, it often causes confusion for users new to Slurm, and doesn't always correspond to what you want on a development system which may be undergoing frequent change.

This PR adds two new role variables to the Slurm role:

  • slurm_restrict_node_access: When enabled, denies SSH to users without a job on the node
  • slurm_enable_pam_slurm_adopt: When enabled, sets up pam_slurm_adopt to adopt SSH processes into the job

By default, both of these are true. If slurm_restrict_node_access is set to false, SSH processes will still be adopted into the job for accounting and cleanup purposes, but otherwise allow access without a job on the node. If slurm_enable_pam_slurm_adopt is set to false, then pam_slurm_adopt.so will not be used at all.

Test plan

  • Set slurm_restrict_node_access: false and create a new user who is not included in slurm_allow_ssh_user. Confirm that the user can SSH from the login node to the compute node.
  • Set slurm_enable_pam_slurm_adopt: false. Run a job, SSH in from another terminal, and confirm the SSH process isn't adopted.
  • Confirm regular DeepOps CI tests still work as expected.

@ajdecon ajdecon requested review from avolkov1 and dholt February 15, 2022 19:08
avolkov1
avolkov1 previously approved these changes Feb 16, 2022
Copy link
Contributor

@avolkov1 avolkov1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes make sense and are easy to follow. I did not test behavior varying every combination
of "slurm_enable_pam_slurm_adopt" and "slurm_restrict_node_access", but the Ansible logic looks straightforward.

If you would like me to test this more extensively let me know. Logically this does what is expected, i.e. optionally restrict/permit ssh access in general and make "pam_slurm_adopt" optional.

@ajdecon ajdecon merged commit ea9794e into NVIDIA:master Mar 7, 2022
@ajdecon ajdecon mentioned this pull request Apr 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants