Skip to content

Fix activation offload storage dedupe reuse#6241

Open
winglian wants to merge 1 commit into
huggingface:mainfrom
winglian:fix/offload-storage-dedupe-reuse
Open

Fix activation offload storage dedupe reuse#6241
winglian wants to merge 1 commit into
huggingface:mainfrom
winglian:fix/offload-storage-dedupe-reuse

Conversation

@winglian

@winglian winglian commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

in #4247, the intent was to track duplicate storage when the tensors are views. however, tensors are offloaded and released, the allocator will reuse storage pointers for new tensors, which end up not getting offloaded.

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

AI writing disclosure

We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.

  • No AI usage: the PR was written entirely by a human.
  • AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
  • AI-generated: the PR was mostly or fully generated by an AI tool.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.


Note

Medium Risk
Touches core pack/unpack and stream stash lifecycle for activation offloading; incorrect release timing could break view deduplication or reintroduce skipped offloads, but behavior is covered by a targeted regression test.

Overview
Fixes incorrect storage deduplication in OffloadActivations where a storage key could stay in storage_to_tensor_id after the original activation was offloaded and its GPU buffer freed. The allocator could then reuse the same address for an unrelated tensor, which was wrongly skipped for offload as if it were a live view.

The change adds _storage_key_by_tensor_id plus track_storage_key / release_storage_key, and clears dedupe entries when tensors are no longer live: when fwd_stash entries are reaped during forward, when tensors are unpacked in backward, and when remaining forward stash is drained in backward hooks. __enter__ also clears the reverse map. Dedupe keys are only tracked while the corresponding GPU activation is still held.

Adds test_reused_storage_key_after_stash_reap_is_not_deduplicated, which stubs _get_unique_tensor_key to simulate allocator reuse and asserts all three saved tensors are offloaded (none incorrectly deduplicated).

Reviewed by Cursor Bugbot for commit 3ebd792. Bugbot is set up for automated code reviews on this repo. Configure here.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Want higher recall? High effort reviews run extra passes and find more bugs. A team admin can switch effort levels in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 3ebd792. Configure here.

# Track this storage for deduplication
self.storage_to_tensor_id[storage_key] = tensor_id
# Track this storage for deduplication while the GPU tensor is live.
track_storage_key(storage_key, tensor_id)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single-stream dedupe tracking omitted

Medium Severity

After offloading an activation, track_storage_key runs only when use_streams is true. With use_streams=False, storage_to_tensor_id is never updated for offloaded tensors, so shared-storage views are not deduplicated and each view may be copied to CPU separately.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 3ebd792. Configure here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this issue may need being considered. What do you think, @winglian and @kashif ?

I think the observation is factually true: two views sharing a storage key:

  • use_streams=True: 1 offloaded + 1 deduped;
  • use_streams=False: 2 offloaded + 0 deduped.

Both paths ran backward() without error.

However, I think it is not a correctness issue and also its implied fix would reintroduce the very bug this PR fixes.

  • Not a correctness issue: With use_streams=False, shared-storage views are each copied to CPU separately instead of one being kept on GPU. Both paths produce correct gradients: it's a CPU-memory/bandwidth regression only, and only for saved tensors that genuinely share a storage key (same data_ptr+offset), which is a subset of "views."
  • It's essentially necessary, not an oversight: The whole fix rests on "track the key only while the GPU tensor is live, release it when freed." In streams mode, fwd_stash keeps the offloaded GPU tensor alive and reaping gives a clean release point. Single-stream mode has no fwd_stash: the offloaded GPU activation is dropped right after pack_tensor returns, so there is no bounded live-window and no release hook during the forward pass.
    • If you simply moved track_storage_key back outside the if self.use_streams guard (what the comment implies), you'd get exactly the pre-fix behavior: a stale key sitting in the map for the rest of the forward, ready to false-dedup a reused address. Indeed, the pre-fix code tracked unconditionally and had this bug in single-stream too. So scoping tracking to streams-only is a deliberate, correct trade-off given the current design.

Recommendation:

  • Add a one-line comment at line 351/352 stating tracking is streams-only because single-stream lacks a live-window to bound the key (prevents a future contributor from "fixing" it back into the bug).
  • If single-stream dedup is actually wanted, it needs its own machinery (e.g. a minimal single-stream stash). However I think this implies more code than the repo's simplicity ethos warrants for a fallback path, and I'd skip it.

@kashif kashif requested a review from Copilot July 2, 2026 14:15

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a correctness bug in activation offloading’s storage-based deduplication: when GPU storages are freed and their addresses later reused by the allocator, the previous dedupe entry could incorrectly cause unrelated tensors to be treated as already-offloaded “views” and skipped.

Changes:

  • Add reverse mapping (_storage_key_by_tensor_id) and helpers to reliably remove stale storage dedupe entries when tensors are no longer live.
  • Release dedupe entries when forward stash entries are reaped and when tensors are unpacked / forward stash is drained during backward.
  • Add a targeted regression test that simulates allocator address reuse by stubbing _get_unique_tensor_key and asserts no incorrect dedup occurs.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
trl/models/activation_offloading.py Introduces storage-key tracking/release to prevent stale dedupe entries from causing skipped offloads after storage reuse.
tests/test_activation_offloading.py Adds a regression test to ensure reused storage keys are not incorrectly deduplicated after stash reaping.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@bot-ci-comment

bot-ci-comment Bot commented Jul 3, 2026

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@albertvillanova albertvillanova left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think, the fix is correct and well-tested.

The Bugbot comment describes a real but minor, non-correctness perf difference in the single-stream fallback that is an intended consequence of the fix; its implied remedy would undo the fix. Worth a clarifying code comment, nothing more. See my comment below.

Do you agree?

# Track this storage for deduplication
self.storage_to_tensor_id[storage_key] = tensor_id
# Track this storage for deduplication while the GPU tensor is live.
track_storage_key(storage_key, tensor_id)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this issue may need being considered. What do you think, @winglian and @kashif ?

I think the observation is factually true: two views sharing a storage key:

  • use_streams=True: 1 offloaded + 1 deduped;
  • use_streams=False: 2 offloaded + 0 deduped.

Both paths ran backward() without error.

However, I think it is not a correctness issue and also its implied fix would reintroduce the very bug this PR fixes.

  • Not a correctness issue: With use_streams=False, shared-storage views are each copied to CPU separately instead of one being kept on GPU. Both paths produce correct gradients: it's a CPU-memory/bandwidth regression only, and only for saved tensors that genuinely share a storage key (same data_ptr+offset), which is a subset of "views."
  • It's essentially necessary, not an oversight: The whole fix rests on "track the key only while the GPU tensor is live, release it when freed." In streams mode, fwd_stash keeps the offloaded GPU tensor alive and reaping gives a clean release point. Single-stream mode has no fwd_stash: the offloaded GPU activation is dropped right after pack_tensor returns, so there is no bounded live-window and no release hook during the forward pass.
    • If you simply moved track_storage_key back outside the if self.use_streams guard (what the comment implies), you'd get exactly the pre-fix behavior: a stale key sitting in the map for the rest of the forward, ready to false-dedup a reused address. Indeed, the pre-fix code tracked unconditionally and had this bug in single-stream too. So scoping tracking to streams-only is a deliberate, correct trade-off given the current design.

Recommendation:

  • Add a one-line comment at line 351/352 stating tracking is streams-only because single-stream lacks a live-window to bound the key (prevents a future contributor from "fixing" it back into the bug).
  • If single-stream dedup is actually wanted, it needs its own machinery (e.g. a minimal single-stream stash). However I think this implies more code than the repo's simplicity ethos warrants for a fallback path, and I'd skip it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants