Move torch.cond predicate non-persistent buffer to CPU #16378

larryliu0820 · 2025-12-23T20:28:16Z

Avoid device-to-host memory copies when evaluating torch.cond predicates.

When a GPU buffer (e.g., a KV cache initialized flag) is used as a predicate for torch.cond, the runtime must synchronize and copy the predicate value from GPU to CPU on every forward pass to evaluate the condition. This adds latency and synchronization overhead.

MoveCondPredicateToCpuPass moves non-persistent buffer predicates to CPU at export time, eliminating per-inference D2H transfers. The predicate is typically a small scalar (e.g., a boolean flag), so keeping it on CPU has negligible memory impact.

Add MoveCondPredicateToCpuPass in backends/cuda/passes/
Add unit tests covering:
- GPU buffer predicates moved to CPU
- CPU buffer predicates unchanged
- Computed predicates unaffected
- Multiple torch.cond calls
- Cross-attention cache pattern
- Persistent buffers (state_dict) not moved
Add Python tests to unittest-cuda CI job in cuda.yml

[ghstack-poisoned]

larryliu0820 · 2025-12-23T20:28:16Z

Stack from ghstack (oldest at bottom):

-> Move torch.cond predicate non-persistent buffer to CPU #16378

pytorch-bot · 2025-12-23T20:28:19Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16378

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Unrelated Failure

As of commit ab861b9 with merge base c5d66a5 ():

NEW FAILURES - The following jobs have failed:

pull / test-multimodal-linux (gemma3-4b) / linux-job (gh)
RuntimeError: Command docker exec -t e26fd998c7c5937ebf33bea3b5ecf405ea03c36897c1af115ffcdc513a50d7df /exec failed with exit code 139
Test CUDA Builds / unittest-cuda / linux-job (gh)
RuntimeError: Command docker exec -t 4b9f95085d462c768d9b002f2fc31322541e3654d8b0c9c008a7d62ae87276a6 /exec failed with exit code 134

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / android / run-emulator (gh) (#16137)
Timeout waiting for emulator to boot.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Avoid device-to-host memory copies when evaluating `torch.cond` predicates. When a GPU buffer (e.g., a KV cache `initialized` flag) is used as a predicate for `torch.cond`, the runtime must synchronize and copy the predicate value from GPU to CPU on every forward pass to evaluate the condition. This adds latency and synchronization overhead. `MoveCondPredicateToCpuPass` moves non-persistent buffer predicates to CPU at export time, eliminating per-inference D2H transfers. The predicate is typically a small scalar (e.g., a boolean flag), so keeping it on CPU has negligible memory impact. - Add `MoveCondPredicateToCpuPass` in `backends/cuda/passes/` - Add unit tests covering: - GPU buffer predicates moved to CPU - CPU buffer predicates unchanged - Computed predicates unaffected - Multiple `torch.cond` calls - Cross-attention cache pattern - Persistent buffers (state_dict) not moved - Add Python tests to `unittest-cuda` CI job in `cuda.yml` ghstack-source-id: ff22758 ghstack-comment-id: 3687889864 Pull-Request: #16378

Gasoonjia

gogogo!

.github/workflows/cuda.yml

backends/cuda/passes/move_cond_predicate_to_cpu.py

[ghstack-poisoned]

Avoid device-to-host memory copies when evaluating `torch.cond` predicates. When a GPU buffer (e.g., a KV cache `initialized` flag) is used as a predicate for `torch.cond`, the runtime must synchronize and copy the predicate value from GPU to CPU on every forward pass to evaluate the condition. This adds latency and synchronization overhead. `MoveCondPredicateToCpuPass` moves non-persistent buffer predicates to CPU at export time, eliminating per-inference D2H transfers. The predicate is typically a small scalar (e.g., a boolean flag), so keeping it on CPU has negligible memory impact. - Add `MoveCondPredicateToCpuPass` in `backends/cuda/passes/` - Add unit tests covering: - GPU buffer predicates moved to CPU - CPU buffer predicates unchanged - Computed predicates unaffected - Multiple `torch.cond` calls - Cross-attention cache pattern - Persistent buffers (state_dict) not moved - Add Python tests to `unittest-cuda` CI job in `cuda.yml` ghstack-source-id: 8d724ef ghstack-comment-id: 3687889864 Pull-Request: #16378

[ghstack-poisoned]

Avoid device-to-host memory copies when evaluating `torch.cond` predicates. When a GPU buffer (e.g., a KV cache `initialized` flag) is used as a predicate for `torch.cond`, the runtime must synchronize and copy the predicate value from GPU to CPU on every forward pass to evaluate the condition. This adds latency and synchronization overhead. `MoveCondPredicateToCpuPass` moves non-persistent buffer predicates to CPU at export time, eliminating per-inference D2H transfers. The predicate is typically a small scalar (e.g., a boolean flag), so keeping it on CPU has negligible memory impact. - Add `MoveCondPredicateToCpuPass` in `backends/cuda/passes/` - Add unit tests covering: - GPU buffer predicates moved to CPU - CPU buffer predicates unchanged - Computed predicates unaffected - Multiple `torch.cond` calls - Cross-attention cache pattern - Persistent buffers (state_dict) not moved - Add Python tests to `unittest-cuda` CI job in `cuda.yml` ghstack-source-id: 4714546 ghstack-comment-id: 3687889864 Pull-Request: #16378

[ghstack-poisoned]

Avoid device-to-host memory copies when evaluating `torch.cond` predicates. When a GPU buffer (e.g., a KV cache `initialized` flag) is used as a predicate for `torch.cond`, the runtime must synchronize and copy the predicate value from GPU to CPU on every forward pass to evaluate the condition. This adds latency and synchronization overhead. `MoveCondPredicateToCpuPass` moves non-persistent buffer predicates to CPU at export time, eliminating per-inference D2H transfers. The predicate is typically a small scalar (e.g., a boolean flag), so keeping it on CPU has negligible memory impact. - Add `MoveCondPredicateToCpuPass` in `backends/cuda/passes/` - Add unit tests covering: - GPU buffer predicates moved to CPU - CPU buffer predicates unchanged - Computed predicates unaffected - Multiple `torch.cond` calls - Cross-attention cache pattern - Persistent buffers (state_dict) not moved - Add Python tests to `unittest-cuda` CI job in `cuda.yml` ghstack-source-id: d813c68 ghstack-comment-id: 3687889864 Pull-Request: #16378

[ghstack-poisoned]

Avoid device-to-host memory copies when evaluating `torch.cond` predicates. When a GPU buffer (e.g., a KV cache `initialized` flag) is used as a predicate for `torch.cond`, the runtime must synchronize and copy the predicate value from GPU to CPU on every forward pass to evaluate the condition. This adds latency and synchronization overhead. `MoveCondPredicateToCpuPass` moves non-persistent buffer predicates to CPU at export time, eliminating per-inference D2H transfers. The predicate is typically a small scalar (e.g., a boolean flag), so keeping it on CPU has negligible memory impact. - Add `MoveCondPredicateToCpuPass` in `backends/cuda/passes/` - Add unit tests covering: - GPU buffer predicates moved to CPU - CPU buffer predicates unchanged - Computed predicates unaffected - Multiple `torch.cond` calls - Cross-attention cache pattern - Persistent buffers (state_dict) not moved - Add Python tests to `unittest-cuda` CI job in `cuda.yml` ghstack-source-id: efe08be ghstack-comment-id: 3687889864 Pull-Request: #16378

[ghstack-poisoned]

Avoid device-to-host memory copies when evaluating `torch.cond` predicates. When a GPU buffer (e.g., a KV cache `initialized` flag) is used as a predicate for `torch.cond`, the runtime must synchronize and copy the predicate value from GPU to CPU on every forward pass to evaluate the condition. This adds latency and synchronization overhead. `MoveCondPredicateToCpuPass` moves non-persistent buffer predicates to CPU at export time, eliminating per-inference D2H transfers. The predicate is typically a small scalar (e.g., a boolean flag), so keeping it on CPU has negligible memory impact. - Add `MoveCondPredicateToCpuPass` in `backends/cuda/passes/` - Add unit tests covering: - GPU buffer predicates moved to CPU - CPU buffer predicates unchanged - Computed predicates unaffected - Multiple `torch.cond` calls - Cross-attention cache pattern - Persistent buffers (state_dict) not moved - Add Python tests to `unittest-cuda` CI job in `cuda.yml` ghstack-source-id: 58e9268 ghstack-comment-id: 3687889864 Pull-Request: #16378

larryliu0820 added 21 commits December 19, 2025 11:21

Update

63a2766

[ghstack-poisoned]

Update

f02dbe1

[ghstack-poisoned]

Update

9a7aa91

[ghstack-poisoned]

Update

bc07a7b

[ghstack-poisoned]

Update

a97933b

[ghstack-poisoned]

Update

99ca698

[ghstack-poisoned]

Update

e1bb6c2

[ghstack-poisoned]

Update

395ab4f

[ghstack-poisoned]

Update

2a7a9f0

[ghstack-poisoned]

Update

a86ab6e

[ghstack-poisoned]

Update

ca3ac6d

[ghstack-poisoned]

Update

8b94087

[ghstack-poisoned]

Update

5f755f9

[ghstack-poisoned]

Update

690546b

[ghstack-poisoned]

Update

73efe12

[ghstack-poisoned]

Update

d96dec8

[ghstack-poisoned]

Update

eb6a7e6

[ghstack-poisoned]

Update

d5c53ec

[ghstack-poisoned]

Update

8b8580d

[ghstack-poisoned]

Update

ba6fdff

[ghstack-poisoned]

Update

b103b7f

[ghstack-poisoned]

larryliu0820 requested review from JacobSzwejbka, SS-JIA, cccclai, digantdesai, lucylq and mergennachin as code owners December 23, 2025 20:28

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 23, 2025

larryliu0820 mentioned this pull request Dec 23, 2025

Custom op to update cache for torch.cond #16366

Merged

Gasoonjia approved these changes Dec 23, 2025

View reviewed changes

.github/workflows/cuda.yml Outdated Show resolved Hide resolved

backends/cuda/passes/move_cond_predicate_to_cpu.py Show resolved Hide resolved

backends/cuda/passes/move_cond_predicate_to_cpu.py Show resolved Hide resolved

larryliu0820 added 2 commits December 23, 2025 23:41

Update

a8b20f5

[ghstack-poisoned]

Update

016adb3

[ghstack-poisoned]

Base automatically changed from gh/larryliu0820/85/head to main December 24, 2025 00:41

Update

3fc3117

[ghstack-poisoned]

larryliu0820 added the release notes: desktop for desktop/laptop workstream label Dec 24, 2025

Update

e8349a7

[ghstack-poisoned]

Update

5897ba4

[ghstack-poisoned]

Update

ab861b9

[ghstack-poisoned]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Move torch.cond predicate non-persistent buffer to CPU #16378

Move torch.cond predicate non-persistent buffer to CPU #16378

larryliu0820 commented Dec 23, 2025

Uh oh!

larryliu0820 commented Dec 23, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Dec 23, 2025 •

edited

Loading

Uh oh!

Gasoonjia left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Move torch.cond predicate non-persistent buffer to CPU #16378

Are you sure you want to change the base?

Move torch.cond predicate non-persistent buffer to CPU #16378

Conversation

larryliu0820 commented Dec 23, 2025

Uh oh!

larryliu0820 commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16378

❌ 2 New Failures, 1 Unrelated Failure

Uh oh!

Gasoonjia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

larryliu0820 commented Dec 23, 2025 •

edited

Loading

pytorch-bot bot commented Dec 23, 2025 •

edited

Loading