Fix activation offload storage dedupe reuse#6241
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Want higher recall? High effort reviews run extra passes and find more bugs. A team admin can switch effort levels in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 3ebd792. Configure here.
| # Track this storage for deduplication | ||
| self.storage_to_tensor_id[storage_key] = tensor_id | ||
| # Track this storage for deduplication while the GPU tensor is live. | ||
| track_storage_key(storage_key, tensor_id) |
There was a problem hiding this comment.
Single-stream dedupe tracking omitted
Medium Severity
After offloading an activation, track_storage_key runs only when use_streams is true. With use_streams=False, storage_to_tensor_id is never updated for offloaded tensors, so shared-storage views are not deduplicated and each view may be copied to CPU separately.
Reviewed by Cursor Bugbot for commit 3ebd792. Configure here.
There was a problem hiding this comment.
I think this issue may need being considered. What do you think, @winglian and @kashif ?
I think the observation is factually true: two views sharing a storage key:
use_streams=True: 1 offloaded + 1 deduped;- use_streams=False: 2 offloaded + 0 deduped.
Both paths ran backward() without error.
However, I think it is not a correctness issue and also its implied fix would reintroduce the very bug this PR fixes.
- Not a correctness issue: With
use_streams=False, shared-storage views are each copied to CPU separately instead of one being kept on GPU. Both paths produce correct gradients: it's a CPU-memory/bandwidth regression only, and only for saved tensors that genuinely share a storage key (same data_ptr+offset), which is a subset of "views." - It's essentially necessary, not an oversight: The whole fix rests on "track the key only while the GPU tensor is live, release it when freed." In streams mode, fwd_stash keeps the offloaded GPU tensor alive and reaping gives a clean release point. Single-stream mode has no fwd_stash: the offloaded GPU activation is dropped right after pack_tensor returns, so there is no bounded live-window and no release hook during the forward pass.
- If you simply moved
track_storage_keyback outside theif self.use_streamsguard (what the comment implies), you'd get exactly the pre-fix behavior: a stale key sitting in the map for the rest of the forward, ready to false-dedup a reused address. Indeed, the pre-fix code tracked unconditionally and had this bug in single-stream too. So scoping tracking to streams-only is a deliberate, correct trade-off given the current design.
- If you simply moved
Recommendation:
- Add a one-line comment at line 351/352 stating tracking is streams-only because single-stream lacks a live-window to bound the key (prevents a future contributor from "fixing" it back into the bug).
- If single-stream dedup is actually wanted, it needs its own machinery (e.g. a minimal single-stream stash). However I think this implies more code than the repo's simplicity ethos warrants for a fallback path, and I'd skip it.
There was a problem hiding this comment.
Pull request overview
Fixes a correctness bug in activation offloading’s storage-based deduplication: when GPU storages are freed and their addresses later reused by the allocator, the previous dedupe entry could incorrectly cause unrelated tensors to be treated as already-offloaded “views” and skipped.
Changes:
- Add reverse mapping (
_storage_key_by_tensor_id) and helpers to reliably remove stale storage dedupe entries when tensors are no longer live. - Release dedupe entries when forward stash entries are reaped and when tensors are unpacked / forward stash is drained during backward.
- Add a targeted regression test that simulates allocator address reuse by stubbing
_get_unique_tensor_keyand asserts no incorrect dedup occurs.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
trl/models/activation_offloading.py |
Introduces storage-key tracking/release to prevent stale dedupe entries from causing skipped offloads after storage reuse. |
tests/test_activation_offloading.py |
Adds a regression test to ensure reused storage keys are not incorrectly deduplicated after stash reaping. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
albertvillanova
left a comment
There was a problem hiding this comment.
Thanks. I think, the fix is correct and well-tested.
The Bugbot comment describes a real but minor, non-correctness perf difference in the single-stream fallback that is an intended consequence of the fix; its implied remedy would undo the fix. Worth a clarifying code comment, nothing more. See my comment below.
Do you agree?
| # Track this storage for deduplication | ||
| self.storage_to_tensor_id[storage_key] = tensor_id | ||
| # Track this storage for deduplication while the GPU tensor is live. | ||
| track_storage_key(storage_key, tensor_id) |
There was a problem hiding this comment.
I think this issue may need being considered. What do you think, @winglian and @kashif ?
I think the observation is factually true: two views sharing a storage key:
use_streams=True: 1 offloaded + 1 deduped;- use_streams=False: 2 offloaded + 0 deduped.
Both paths ran backward() without error.
However, I think it is not a correctness issue and also its implied fix would reintroduce the very bug this PR fixes.
- Not a correctness issue: With
use_streams=False, shared-storage views are each copied to CPU separately instead of one being kept on GPU. Both paths produce correct gradients: it's a CPU-memory/bandwidth regression only, and only for saved tensors that genuinely share a storage key (same data_ptr+offset), which is a subset of "views." - It's essentially necessary, not an oversight: The whole fix rests on "track the key only while the GPU tensor is live, release it when freed." In streams mode, fwd_stash keeps the offloaded GPU tensor alive and reaping gives a clean release point. Single-stream mode has no fwd_stash: the offloaded GPU activation is dropped right after pack_tensor returns, so there is no bounded live-window and no release hook during the forward pass.
- If you simply moved
track_storage_keyback outside theif self.use_streamsguard (what the comment implies), you'd get exactly the pre-fix behavior: a stale key sitting in the map for the rest of the forward, ready to false-dedup a reused address. Indeed, the pre-fix code tracked unconditionally and had this bug in single-stream too. So scoping tracking to streams-only is a deliberate, correct trade-off given the current design.
- If you simply moved
Recommendation:
- Add a one-line comment at line 351/352 stating tracking is streams-only because single-stream lacks a live-window to bound the key (prevents a future contributor from "fixing" it back into the bug).
- If single-stream dedup is actually wanted, it needs its own machinery (e.g. a minimal single-stream stash). However I think this implies more code than the repo's simplicity ethos warrants for a fallback path, and I'd skip it.


What does this PR do?
in #4247, the intent was to track duplicate storage when the tensors are views. however, tensors are offloaded and released, the allocator will reuse storage pointers for new tensors, which end up not getting offloaded.
Fixes # (issue)
Before submitting
AI writing disclosure
We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
Note
Medium Risk
Touches core pack/unpack and stream stash lifecycle for activation offloading; incorrect release timing could break view deduplication or reintroduce skipped offloads, but behavior is covered by a targeted regression test.
Overview
Fixes incorrect storage deduplication in
OffloadActivationswhere a storage key could stay instorage_to_tensor_idafter the original activation was offloaded and its GPU buffer freed. The allocator could then reuse the same address for an unrelated tensor, which was wrongly skipped for offload as if it were a live view.The change adds
_storage_key_by_tensor_idplustrack_storage_key/release_storage_key, and clears dedupe entries when tensors are no longer live: whenfwd_stashentries are reaped during forward, when tensors are unpacked in backward, and when remaining forward stash is drained in backward hooks.__enter__also clears the reverse map. Dedupe keys are only tracked while the corresponding GPU activation is still held.Adds
test_reused_storage_key_after_stash_reap_is_not_deduplicated, which stubs_get_unique_tensor_keyto simulate allocator reuse and asserts all three saved tensors are offloaded (none incorrectly deduplicated).Reviewed by Cursor Bugbot for commit 3ebd792. Bugbot is set up for automated code reviews on this repo. Configure here.