Skip to content

Commit ea6ef9c

Browse files
committed
GPU docs
1 parent 4ab645b commit ea6ef9c

2 files changed

Lines changed: 103 additions & 0 deletions

File tree

doc/user_guides/advanced_guide.rst

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -426,6 +426,106 @@ the resulting core affinity of the OpenMP threads are:
426426
Hello from rank 7, thread 0, on nid00026. (core affinity = 18)
427427
Hello from rank 7, thread 1, on nid00026. (core affinity = 19)
428428
429+
430+
Slurm with GPUs examples
431+
^^^^^^^^^^^^^^^^^^^^^^^^
432+
433+
.. note::
434+
435+
New in 0.8.0
436+
437+
The :py:meth:`~ipsframework.services.ServicesProxy.launch_task` method
438+
has an option ``task_gpp`` which allows you to set the number of GPUs
439+
per process, used as the ``--gpus-per-task`` in the ``srun``
440+
command.
441+
442+
IPS will validate the number of GPUs per node requested does not
443+
exceed the number specified by the ``GPUS_PER_NODE`` parameter in the
444+
:ref:`plat-conf-sec`. You need to make sure that the number of GPUs
445+
per process times the number of processes per node does not exceed the
446+
``GPUS_PER_NODE`` set.
447+
448+
Using the `gpus_for_tasks
449+
<https://docs.nersc.gov/jobs/affinity/#gpus>`_ program provided for
450+
Perlmutter (which has 4 GPUs per node) to test the behavior, you will
451+
see the following:
452+
453+
454+
To launch a task with 1 process and 1 GPU per process (``task_gpp``) run:
455+
456+
.. code-block:: python
457+
458+
self.services.launch_task(1, cwd, "gpu-per-task", task_gpp=1)
459+
460+
will create the command ``srun -N 1 -n 1 -c
461+
64 --threads-per-core=1 --cpu-bind=cores --gpus-per-task=1
462+
gpus_for_tasks`` and the output of will be:
463+
464+
.. code-block:: text
465+
466+
Rank 0 out of 1 processes: I see 1 GPU(s).
467+
0 for rank 0: 0000:03:00.0
468+
469+
To launch 8 processes on 2 nodes (so 4 processes per node) with 1 gpu per process run:
470+
471+
.. code-block:: python
472+
473+
self.services.launch_task(8, cwd, "gpu-per-task", task_ppn=4, task_gpp=1)
474+
475+
will create the command ``srun -N 2 -n 8 -c
476+
16 --threads-per-core=1 --cpu-bind=cores --gpus-per-task=1
477+
gpus_for_task`` and the output of will be:
478+
479+
.. code-block:: text
480+
481+
Rank 0 out of 8 processes: I see 1 GPU(s).
482+
0 for rank 0: 0000:03:00.0
483+
Rank 1 out of 8 processes: I see 1 GPU(s).
484+
0 for rank 1: 0000:41:00.0
485+
Rank 2 out of 8 processes: I see 1 GPU(s).
486+
0 for rank 2: 0000:82:00.0
487+
Rank 3 out of 8 processes: I see 1 GPU(s).
488+
0 for rank 3: 0000:C1:00.0
489+
Rank 4 out of 8 processes: I see 1 GPU(s).
490+
0 for rank 4: 0000:03:00.0
491+
Rank 5 out of 8 processes: I see 1 GPU(s).
492+
0 for rank 5: 0000:41:00.0
493+
Rank 6 out of 8 processes: I see 1 GPU(s).
494+
0 for rank 6: 0000:82:00.0
495+
Rank 7 out of 8 processes: I see 1 GPU(s).
496+
0 for rank 7: 0000:C1:00.0
497+
498+
To launch 2 processes on 2 nodes (so 1 processes per node) with 4 gpu per process run:
499+
500+
.. code-block:: python
501+
502+
self.services.launch_task(2, cwd, "gpu-per-task", task_ppn=1, task_gpp=4)
503+
504+
will create the command ``srun -N 2 -n 2 -c
505+
64 --threads-per-core=1 --cpu-bind=cores --gpus-per-task=4
506+
gpus_per_tasks`` and the output of will be:
507+
508+
.. code-block:: text
509+
510+
Rank 0 out of 2 processes: I see 4 GPU(s).
511+
0 for rank 0: 0000:03:00.0
512+
1 for rank 0: 0000:41:00.0
513+
2 for rank 0: 0000:82:00.0
514+
3 for rank 0: 0000:C1:00.0
515+
Rank 1 out of 2 processes: I see 4 GPU(s).
516+
0 for rank 1: 0000:03:00.0
517+
1 for rank 1: 0000:41:00.0
518+
2 for rank 1: 0000:82:00.0
519+
3 for rank 1: 0000:C1:00.0
520+
521+
If you try to launch a task with too many GPUs per node, *e.g.*:
522+
523+
.. code-block:: python
524+
525+
self.services.launch_task(8, cwd, "gpu-per-task", task_gpp=1)
526+
527+
then it will raise an :class:`~ipsframework.ipsExceptions.GPUResourceRequestMismatchException`.
528+
429529
.. automethod:: ipsframework.services.ServicesProxy.launch_task
430530
:noindex:
431531

doc/user_guides/platform.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -423,6 +423,9 @@ The platform configuration file contains platform specific information that the
423423
one task can share a node [#nochange]_. Simulations,
424424
components and tasks can set their node usage allocation
425425
policies in the configuration file and on task launch.
426+
**GPUS_PER_NODE**
427+
number of GPUs per node, used when validating the launch task
428+
commands with ``task_gpp`` set, see :meth:`~ipsframework.services.ServicesProxy.launch_task`.
426429

427430

428431
.. [#nochange] This value should not change unless the machine is

0 commit comments

Comments
 (0)