Fix activation offload storage dedupe reuse by winglian · Pull Request #6241 · huggingface/trl

winglian · 2026-07-02T08:33:50Z

What does this PR do?

in #4247, the intent was to track duplicate storage when the tensors are views. however, tensors are offloaded and released, the allocator will reuse storage pointers for new tensors, which end up not getting offloaded.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

AI writing disclosure

We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.

No AI usage: the PR was written entirely by a human.
AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
AI-generated: the PR was mostly or fully generated by an AI tool.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

Note

Medium Risk
Touches core pack/unpack and stream stash lifecycle for activation offloading; incorrect release timing could break view deduplication or reintroduce skipped offloads, but behavior is covered by a targeted regression test.

Overview
Fixes incorrect storage deduplication in OffloadActivations where a storage key could stay in storage_to_tensor_id after the original activation was offloaded and its GPU buffer freed. The allocator could then reuse the same address for an unrelated tensor, which was wrongly skipped for offload as if it were a live view.

The change adds _storage_key_by_tensor_id plus track_storage_key / release_storage_key, and clears dedupe entries when tensors are no longer live: when fwd_stash entries are reaped during forward, when tensors are unpacked in backward, and when remaining forward stash is drained in backward hooks. __enter__ also clears the reverse map. Dedupe keys are only tracked while the corresponding GPU activation is still held.

Adds test_reused_storage_key_after_stash_reap_is_not_deduplicated, which stubs _get_unique_tensor_key to simulate allocator reuse and asserts all three saved tensors are offloaded (none incorrectly deduplicated).

^{Reviewed by Cursor Bugbot for commit 3ebd792. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

Want higher recall? High effort reviews run extra passes and find more bugs. A team admin can switch effort levels in the Cursor dashboard.

^{Reviewed by Cursor Bugbot for commit 3ebd792. Configure here.}

cursor · 2026-07-02T08:34:56Z

-            # Track this storage for deduplication
-            self.storage_to_tensor_id[storage_key] = tensor_id
+                # Track this storage for deduplication while the GPU tensor is live.
+                track_storage_key(storage_key, tensor_id)


Single-stream dedupe tracking omitted

Medium Severity

After offloading an activation, track_storage_key runs only when use_streams is true. With use_streams=False, storage_to_tensor_id is never updated for offloaded tensors, so shared-storage views are not deduplicated and each view may be copied to CPU separately.

^{Reviewed by Cursor Bugbot for commit 3ebd792. Configure here.}

I think this issue may need being considered. What do you think, @winglian and @kashif ?

I think the observation is factually true: two views sharing a storage key:

use_streams=True: 1 offloaded + 1 deduped;

use_streams=False: 2 offloaded + 0 deduped.

Both paths ran backward() without error.

However, I think it is not a correctness issue and also its implied fix would reintroduce the very bug this PR fixes.

Not a correctness issue: With use_streams=False, shared-storage views are each copied to CPU separately instead of one being kept on GPU. Both paths produce correct gradients: it's a CPU-memory/bandwidth regression only, and only for saved tensors that genuinely share a storage key (same data_ptr+offset), which is a subset of "views."

It's essentially necessary, not an oversight: The whole fix rests on "track the key only while the GPU tensor is live, release it when freed." In streams mode, fwd_stash keeps the offloaded GPU tensor alive and reaping gives a clean release point. Single-stream mode has no fwd_stash: the offloaded GPU activation is dropped right after pack_tensor returns, so there is no bounded live-window and no release hook during the forward pass.

If you simply moved track_storage_key back outside the if self.use_streams guard (what the comment implies), you'd get exactly the pre-fix behavior: a stale key sitting in the map for the rest of the forward, ready to false-dedup a reused address. Indeed, the pre-fix code tracked unconditionally and had this bug in single-stream too. So scoping tracking to streams-only is a deliberate, correct trade-off given the current design.

Recommendation:

Add a one-line comment at line 351/352 stating tracking is streams-only because single-stream lacks a live-window to bound the key (prevents a future contributor from "fixing" it back into the bug).

If single-stream dedup is actually wanted, it needs its own machinery (e.g. a minimal single-stream stash). However I think this implies more code than the repo's simplicity ethos warrants for a fallback path, and I'd skip it.

Copilot

Pull request overview

Fixes a correctness bug in activation offloading’s storage-based deduplication: when GPU storages are freed and their addresses later reused by the allocator, the previous dedupe entry could incorrectly cause unrelated tensors to be treated as already-offloaded “views” and skipped.

Changes:

Add reverse mapping (_storage_key_by_tensor_id) and helpers to reliably remove stale storage dedupe entries when tensors are no longer live.
Release dedupe entries when forward stash entries are reaped and when tensors are unpacked / forward stash is drained during backward.
Add a targeted regression test that simulates allocator address reuse by stubbing _get_unique_tensor_key and asserts no incorrect dedup occurs.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
`trl/models/activation_offloading.py`	Introduces storage-key tracking/release to prevent stale dedupe entries from causing skipped offloads after storage reuse.
`tests/test_activation_offloading.py`	Adds a regression test to ensure reused storage keys are not incorrectly deduplicated after stash reaping.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

bot-ci-comment · 2026-07-03T05:32:51Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

albertvillanova

Thanks. I think, the fix is correct and well-tested.

The Bugbot comment describes a real but minor, non-correctness perf difference in the single-stream fallback that is an intended consequence of the fix; its implied remedy would undo the fix. Worth a clarifying code comment, nothing more. See my comment below.

Do you agree?

albertvillanova · 2026-07-03T06:36:56Z

-            # Track this storage for deduplication
-            self.storage_to_tensor_id[storage_key] = tensor_id
+                # Track this storage for deduplication while the GPU tensor is live.
+                track_storage_key(storage_key, tensor_id)


I think this issue may need being considered. What do you think, @winglian and @kashif ?

I think the observation is factually true: two views sharing a storage key:

use_streams=True: 1 offloaded + 1 deduped;

use_streams=False: 2 offloaded + 0 deduped.

Both paths ran backward() without error.

However, I think it is not a correctness issue and also its implied fix would reintroduce the very bug this PR fixes.

Not a correctness issue: With use_streams=False, shared-storage views are each copied to CPU separately instead of one being kept on GPU. Both paths produce correct gradients: it's a CPU-memory/bandwidth regression only, and only for saved tensors that genuinely share a storage key (same data_ptr+offset), which is a subset of "views."

It's essentially necessary, not an oversight: The whole fix rests on "track the key only while the GPU tensor is live, release it when freed." In streams mode, fwd_stash keeps the offloaded GPU tensor alive and reaping gives a clean release point. Single-stream mode has no fwd_stash: the offloaded GPU activation is dropped right after pack_tensor returns, so there is no bounded live-window and no release hook during the forward pass.

If you simply moved track_storage_key back outside the if self.use_streams guard (what the comment implies), you'd get exactly the pre-fix behavior: a stale key sitting in the map for the rest of the forward, ready to false-dedup a reused address. Indeed, the pre-fix code tracked unconditionally and had this bug in single-stream too. So scoping tracking to streams-only is a deliberate, correct trade-off given the current design.

Recommendation:

Add a one-line comment at line 351/352 stating tracking is streams-only because single-stream lacks a live-window to bound the key (prevents a future contributor from "fixing" it back into the bug).

If single-stream dedup is actually wanted, it needs its own machinery (e.g. a minimal single-stream stash). However I think this implies more code than the repo's simplicity ethos warrants for a fallback path, and I'd skip it.

Fix activation offload storage dedupe reuse

3ebd792

cursor Bot reviewed Jul 2, 2026

View reviewed changes

kashif requested a review from Copilot July 2, 2026 14:15

kashif approved these changes Jul 2, 2026

View reviewed changes

Copilot started reviewing on behalf of kashif July 2, 2026 14:15 View session

Copilot AI reviewed Jul 2, 2026

View reviewed changes

qgallouedec assigned kashif Jul 2, 2026

albertvillanova reviewed Jul 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix activation offload storage dedupe reuse#6241

Fix activation offload storage dedupe reuse#6241
winglian wants to merge 1 commit into
huggingface:mainfrom
winglian:fix/offload-storage-dedupe-reuse

winglian commented Jul 2, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jul 2, 2026

Uh oh!

albertvillanova Jul 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

bot-ci-comment Bot commented Jul 3, 2026

Uh oh!

albertvillanova left a comment

Uh oh!

albertvillanova Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

winglian commented Jul 2, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

AI writing disclosure

Who can review?

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jul 2, 2026

Choose a reason for hiding this comment

Single-stream dedupe tracking omitted

Uh oh!

albertvillanova Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

bot-ci-comment Bot commented Jul 3, 2026

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

albertvillanova Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

winglian commented Jul 2, 2026 •

edited by cursor Bot

Loading