Conversation
There was a problem hiding this comment.
Pull Request Overview
The purpose of this PR is to introduce a new CI pipeline for the H100 environment. Key changes include:
- Adding a new performance benchmark file (perf_ndmv5.jsonl) for latency tests.
- Updating multiple Azure Pipelines YAML configuration files to include new jobs, container image substitutions, and hardware‐specific job definitions (e.g., H100 and A100 variants).
- Creating and adjusting pipeline templates (ut.yaml, ut-npkit.yaml, nccl-test.yaml, integration-test.yaml) to accommodate the new hardware environment.
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| test/deploy/perf_ndmv5.jsonl | Added new performance benchmark data for latency tests. |
| .azure-pipelines/ut.yml | Updated job names and container image substitution to support new hardware targets. |
| .azure-pipelines/nccl-api-test.yaml | Introduced H100-specific parameters including adjusted nvccGencode value. |
| .azure-pipelines/integration-test.yml | Added an H100 integration job and updated baseline file references. |
| Other pipeline template files | Refactored templates to support the new CI pipeline structure for H100. |
Comments suppressed due to low confidence (3)
.azure-pipelines/ut.yml:13
- [nitpick] Verify that the use of 'UnitTestA100' accurately reflects the intended hardware target. If this job is meant for H100 tests, consider aligning the name consistently to avoid potential confusion.
+- job: UnitTestA100
.azure-pipelines/nccl-api-test.yaml:55
- [nitpick] Confirm that the 'nvccGencode' value '-gencode=arch=compute_90,code=sm_90' is correct for the H100 target and reflects the latest hardware specifications.
nvccGencode: "-gencode=arch=compute_90,code=sm_90"
.azure-pipelines/integration-test.yml:25
- [nitpick] Ensure that the consistent variable substitution syntax '$(containerImage)' is used across all pipeline files, as mixed syntaxes can potentially lead to discrepancies.
image: $(containerImage)
| @@ -0,0 +1,3 @@ | |||
| {"name":"allreduce", "kernel":6, "ranks":8, "ranksPerNode":8, "algBw":3.98, "busBw":6.96, "size":24576, "time":6.18, "target":"latency"} | |||
| {"name":"allreduce", "kernel":6, "ranks":8, "ranksPerNode":8, "algBw":7.42, "busBw":12.99, "size":49152, "time":6.62, "target":"latency"} | |||
| {"name":"allreduce", "kernel":6, "ranks":8, "ranksPerNode":8, "algBw":10.67, "busBw":18.68, "size":73728, "time":6.91, "target":"latency"} No newline at end of file | |||
There was a problem hiding this comment.
A few questions about this file:
-
I noticed in perf_ndmv4.jsonl, we have more test cases with 8 and 16 ranks and different kernels. Is there a reason we only test with 8 ranks and use kernel 6 for H100 validation?
-
Do we have performance tests for other collectives beside allreduce in CI pipeline?
There was a problem hiding this comment.
Oh, this because most of algos are designed for nvd4, we don't tune the algos for ndv5. Only algo6 (allreduce with small message size) is reasonable for ndv5.
In future we will move to DSL based algo and add perf regression test based on that. For now, the perf file is added to pass the CI pipeline and test some functionality.
Set Up a CI Pipeline for H100