This project contains some scripts to ease the training of models at the DFKI GPU cluster.
- One time setup of the execution environment.
- Persistent cache. This is useful, for instance, when working with Huggingface to cache models and dataset preprocessing steps.
- If an environment variables file
.envis found in the current working directory, all contained variables are exported automatically and are available inside the Slurm job.
This approach requires some manual housekeeping. Since the cache is persisted (by default to /netscratch/$USER/.cache_slurm), that needs to be cleaned up from time to time. It is also recommended to remove Conda environments when they are not needed anymore.
To train models at the cluster, we first need to set up a respective python environment. Then, we can call a wrapper script that will start a Slurm job with a selected Enroot image and execute the command we passed to it within the job. In the following, this is described in detail.
- Install Miniconda
- Download the miniconda setup script using the following command:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh - Install miniconda, accept the terms and conditions, and set the proper installation path.
This can be done non-interactively, all in one line, by running bash./miniconda.sh -b -p /netscratch/$USER/miniconda3, where-baccepts the terms, and-p <installation-path>sets the installation path.
Alternatively usebash ./miniconda.shfor an interactive dialogue in which you get to read and agree to the terms and conditions, as well as setting the installation path. - Initialise conda by running
<your_path_to_miniconda>/miniconda3/bin/conda init bash, e.g.,
/netscratch/$USER/miniconda3/bin/conda init bashif you installed in the default location. - Switch to the
conda-forgechannel. Because of license restrictions, we have to useconda-forgeand disable the default conda channel.- Add conda-forge as the highest priority channel (taken from here):
conda config --add channels conda-forge - disable the default conda channel:
conda config --remove channels defaults
- Add conda-forge as the highest priority channel (taken from here):
- (OPTIONAL) Set
conda config --set auto_activate_base falseto stop conda from activating the base image for each new shell session.
This leads to a significant speedup in opening new shell sessions.
- Download the miniconda setup script using the following command:
- Setup a conda environment
- Create a conda environment, e.g. using:
conda create -n {env} python=3.9(replace{env}with a name of your choice) - Either start a screen / tmux session or make sure you are in bash (type
bashin terminal) to make the conda commands available. - Activate the environment:
conda activate {env} - Install any required python packages. We recommend that you use the PyPI cache installed at the cluster as described here, e.g. using:
pip install --no-cache --index-url http://pypi-cache/index --trusted-host pypi-cache <package>
- Create a conda environment, e.g. using:
- Get this code and cd into it:
git clone https://github.com/DFKI-NLP/pegasus-bridle.git && cd pegasus-bridle - Create the cache folder /netscratch/$USER/.cache_slurm if it doesn't exist
- Prepare the Slurm setup environment variable file
- Create a Slurm setup environment variable file through copying the example file:
- Either
cp .env.example path/to/your/project/.pegasus-bridle.env(recommended). The.pegasus-bridle.envfile will be used by the wrapper script if it is found in the current working directory of your project. It is possible to create a.pegasus-bridle.envfile in each of your projects. This way, you can have different configurations for each project. If the wrapper script detects the.pegasus-bridle.envfile, it will use it instead of a default.envfile in the pegasus-bridle directory (Option 2). - Or run
cp .env.example .envin thepegasus-bridledirectory. The.envfile will be the default configuration and will be used by the wrapper script, in case no.pegasus-bridle.envis detected in the current working directory (Option 1).
- Either
- Adapt either the
.pegasus-bridle.envor the.envto your needs and ensure that the respective paths exist at the host and create them if necessary (especially forHOST_CACHEDIR) - Make sure the images you are using contains a conda installation.
- Create a Slurm setup environment variable file through copying the example file:
-
Activate the conda environment at the host:
conda activate {env}.
Note: This is just required to pass the conda environment name to the wrapper script. You can also set a fixed name by directly overwriting the environment variableCONDA_ENVin the.envfile (see above). -
Run
wrapper.shfrom anywhere, e.g. your project directory, and give the python command in the parameters to execute it:bash path/to/pegasus-bridle/wrapper.sh command with argumentsExample Usage (assuming you cloned the
pegasus-bridlerepository to/home/$USER/projects, want to runsrc/train.pyin the current directory and there is either a.envfile in the pegasus-bridle directory or a.pegasus-bridle.envfile in the current working directory):bash /home/$USER/projects/pegasus-bridle/wrapper.sh python src/train.py +trainer.fast_dev_run=true
Notes:
- If an environment variables file
.envis found in the current working directory (this is not the.envfile you have created for the Slurm setup), all contained variables are exported automatically and are available inside the Slurm job. - For more details about slurm cluster, please follow this link.
- Run
bash path/to/interactive.shfrom your project directory - [OPTIONAL] Activate the conda environment inside the slurm job:
- Execute
source /opt/conda/bin/activate - Activate conda environment:
conda activate {env}
- Execute
Note: This uses the same environment variables as the wrapper.sh. You may modify them before starting an interactive session, especially variables related to resource allocation.
- Malte's Getting Started Guide
- Jan's How-to-Pegasus
- Connect via SSH to a Slurm compute job that runs as Enroot container (for GPU debugging with your IDE)
