Conversation
|
!test |
Description
|
| Relevant files | |||
|---|---|---|---|
| Bug fix |
| ||
| Tests |
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 PR contains tests |
| ⚡ Recommended focus areas for review |
Logic Correctness
oneVal() instead of dividing by num_devices. This prevents incorrect sharding calculations for device dimensions. The conditional logic properly distinguishes between device dimensions and other sharded dimensions. |
Test failures (partial, pipeline still running)
-
(Medium, 3)
Shape mismatch in thunderfx higher-order inplace alias update test (thunder/tests/test_update_aliases.py)Test Name A100 GB200 H100 Source thunder.tests.test_update_aliases.test_higher_order_inplace_alias_update_nvfuser_cuda_thunder.dtypes.float32 ❌ ❌ ❌ -
(Medium, 1)
thunder nanogpt scalar mismatch in test_networks (nvFuser CUDA)Test Name GB200 Source thunder.tests.test_networks.test_nanogpt_complete_autograd_nvfuser_cuda_thunder.dtypes.float32 ❌
Greptile OverviewGreptile Summary
Confidence Score: 2/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Test as pytest (MPI)
participant FD as FusionDefinition
participant TV as TensorView scheduling
participant Sched as vectorize_helper.cpp
participant Exec as fd.execute
Test->>FD: define inp_tv/out_tv (device-parallel layout)
FD->>TV: set_device_mesh(mesh)
Test->>TV: parallelize axes (mesh_x/mesh_y/mesh_z)
Test->>Exec: execute(sharded inp)
Exec->>Sched: analyze vectorization
Sched->>Sched: getContigMergeOfInnerSize(of_tv)
Sched-->>Exec: computed contiguous inner size
Exec-->>Test: produce sharded output
Test->>Test: assert_close(output, expected)
|
| Val* sharded_extent; | ||
| if (logical_id->isDeviceDim()) { | ||
| sharded_extent = of_tv->container()->oneVal(); |
There was a problem hiding this comment.
Incorrect sharded extent
In getContigMergeOfInnerSize, when logical_id->isDeviceDim() you set sharded_extent = 1. That makes the contiguity product ignore the local per-rank extent of a device-parallel logical dimension, which can cause the computed contiguous-inner-merge size to be too large and enable vectorization when the local contiguous span is actually smaller. In this function, device dims still contribute their projected extent (already per-rank), so they shouldn’t be dropped from the product.
Also appears to affect the same logic at csrc/scheduler/vectorize_helper.cpp:835-838 where num_devices division is skipped entirely for device dims.
| getProjectedExtent(logical_id), num_devices); | ||
| Val* sharded_extent; | ||
| if (logical_id->isDeviceDim()) { | ||
| sharded_extent = of_tv->container()->oneVal(); |
There was a problem hiding this comment.
IIUC, the patch is for the case where we don't have any expr between logical and allocation. Looked seemed straightforward to me.
Yet I'm surprised that this didn't get caught earlier.
Fixes #5920