[TPU Offload]Manage staging buffer in HBM #1069

juncgu-google · 2025-11-11T07:06:13Z

Description

Add a staging buffer manager to book-keep the # of used staging blocks (page / chunk) for swap- in / out. This should be helpful in mitigating the OOM issue when kv cache offloading is handling very large of save / load blocks.

staging buffer size is controlled by env var: TPU_OFFLOAD_STAGING_BUFFER_TOKENS=8192. we will find a way to specify the buffer size in GB instead of in tokens.
when there are insufficient staging buffers for the incoming save / load operations, we will drop the extra save / load blocks.

Tests

pytest -sv tests/distrbuted/cpu_offloading_scheduler_test.py

Checklist

Before submitting this PR, please make sure:

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have made or will make corresponding changes to any relevant documentation.

github-actions · 2025-11-11T07:06:26Z

Description

Start with a short description of what the PR does and how this is a change from
the past.

The rest of the description includes relevant details and context, examples:

why is this change being made,
the problem being solved and any relevant context,
why this is a good solution,
some information about the specific implementation,
shortcomings of the solution and possible future improvements.

If the change fixes a bug or a Github issue, please include a link, e.g.,:
FIXES: b/123456
FIXES: #123456

Tests

Please describe how you tested this change, and include any instructions and/or
commands to reproduce.

Checklist

Before submitting this PR, please make sure:

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have made or will make corresponding changes to any relevant documentation.

tpu_inference/worker/tpu_worker_jax.py

Signed-off-by: Juncheng Gu <[email protected]>

tpu_inference/distributed/tpu_connector_local.py

saikat-royc · 2025-11-13T02:04:37Z

aligned with the high level appraoch on the block tracking, reporting and staging buffer budgeting. let's also add unit tests for added confidence in the PR

Signed-off-by: Juncheng Gu <[email protected]>

tpu_inference/distributed/tpu_connector_local.py

tests/distributed/cpu_offloading_scheduler_test.py

saikat-royc

nice work on the unit tests!

Signed-off-by: Juncheng Gu <[email protected]>

juncgu-google · 2025-11-15T06:22:05Z

Thanks for the reviews!

saikat-royc reviewed Nov 12, 2025

View reviewed changes

tpu_inference/worker/tpu_worker_jax.py Outdated Show resolved Hide resolved

staging buffer manager v1

c25e534

Signed-off-by: Juncheng Gu <[email protected]>

juncgu-google force-pushed the cpu-offloading/dev-1110 branch from 6d987d7 to c25e534 Compare November 12, 2025 23:42