Add more tests to cover containerd/driver container k8s deployments #1139

supertetelman · 2022-03-24T23:54:56Z

We now have the following configurations to test.

This change covers these all in the nightlies and it will probably take around 4 hours end to end.

I tried to mix-and-match the actual tests being run across them so that we have some confidence without filling out a dense test matrics. Kubeflow is tested once on GPU Operator and once on device plugin.

I skipped the local-registry tests on containerd installs, but kept it in for docker installs. This may or may not work given the recent changes.

I made sure that the monitoring stack is tested with at least one configuration of device plugin, docker, containerd, and driver-container configurations.

…Slurm overlap issues

ajdecon · 2022-03-29T21:50:54Z

workloads/jenkins/Jenkinsfile

            timeout 180 bash -x ./workloads/jenkins/scripts/test-dashboard.sh
          '''

+          echo "Start new virtual environment pre-Slurm checks"


Do we need to explicitly tear down the VMs before we do this?

Tearing them down is part of this script.

…cs tests

ajdecon

LGTM. Confirmed that tests passed in nightly builds.

dholt and others added 2 commits March 24, 2022 16:58

Disable container registry test

e6baed1

Add more tests to cover contaerind/driver container k8s deployments

d30b231

supertetelman force-pushed the more-tests branch from 9dfb377 to d30b231 Compare March 24, 2022 23:58

supertetelman added 4 commits March 29, 2022 14:25

Merge branch 'master' into more-tests

749f702

Merge branch 'master' into more-tests

7d60364

Check DCGM-Exporter when GPU Operator is not used in Jenkins

00dc540

Restart fresh VMs in PR test inbetween Slurm/K8s due to new Operator/…

322ee48

…Slurm overlap issues

ajdecon reviewed Mar 29, 2022

View reviewed changes

supertetelman force-pushed the more-tests branch from 3c4471d to efcf7aa Compare March 30, 2022 06:11

supertetelman added 2 commits March 29, 2022 23:12

small tweaks to Jenkinsfiles to add back/change timeout of dcgm-metri…

1f80c67

…cs tests

temporarily disable Kubeflow tests

dbf4c87

supertetelman force-pushed the more-tests branch from efcf7aa to dbf4c87 Compare March 30, 2022 06:12

ajdecon approved these changes Mar 30, 2022

View reviewed changes

ajdecon merged commit 704a097 into NVIDIA:master Mar 30, 2022

ajdecon mentioned this pull request Apr 26, 2022

DeepOps Release 22.04 #1164

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more tests to cover containerd/driver container k8s deployments #1139

Add more tests to cover containerd/driver container k8s deployments #1139

Uh oh!

supertetelman commented Mar 24, 2022

Uh oh!

ajdecon Mar 29, 2022

Uh oh!

supertetelman Mar 30, 2022

Uh oh!

ajdecon left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add more tests to cover containerd/driver container k8s deployments #1139

Add more tests to cover containerd/driver container k8s deployments #1139

Uh oh!

Conversation

supertetelman commented Mar 24, 2022

Uh oh!

ajdecon Mar 29, 2022

Choose a reason for hiding this comment

Uh oh!

supertetelman Mar 30, 2022

Choose a reason for hiding this comment

Uh oh!

ajdecon left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants