Add hash_roundrobin routing mode to mitigate modulo-aliasing imbalance by ShaobinChen-AH · Pull Request #367 · NVIDIA/recsys-examples

ShaobinChen-AH · 2026-04-17T13:06:40Z

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

Summary

This PR adds hash_roundrobin as a new DynamicEmb RW routing mode and makes it the default for DynamicEmb row-wise planning.

The goal is narrow: fix load imbalance caused by pathological raw-key patterns that can break plain modulo-based roundrobin. The new mode hashes the raw key first, then assigns the owner rank from the hashed key.

This PR does not claim to solve general hot-key or Zipf-skew load balancing.

Changes

add hash_roundrobin support in the DynamicEmb input-distribution path
map hash_roundrobin to dist_type = 2 for the CUDA extension path
“adds hash_roundrobin as an opt-in RW routing mode”, default remains roundrobin
add a regression test file covering:
- CPU reference owner mapping
- adversarial modulo-aliasing patterns
- same logical workload with different raw-key encodings
- CUDA kernel parity against CPU reference
- comparative real/Zipf-style robustness checks
add the new regression test to the regular test/unit_test.sh flow
document supported dist_type values and clarify the intended scope of hash_roundrobin

Why

Issue #350 points out that plain roundrobin can become imbalanced when raw keys follow special patterns. Hashing the raw key before RW rank assignment makes the routing much less sensitive to those patterns while preserving the existing overall bucketization flow.

Validation

Validated on a clean rebuild in the target Ubuntu Docker environment.

python3 -m pytest -svv test/unit_tests/test_hash_roundrobin_kuairand.py
- 16 passed
CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes 1 --nproc_per_node 2 ./test/unit_tests/test_sequence_embedding_fw.py --print_sharding_plan --optimizer_type "adam" --use_index_dedup True
- passed on NVIDIA RTX A6000

Notes

The new benchmark is intentionally framed around pattern robustness, which is what issue [FEA] Shuffle keys using hash instead of roundrobin #350 asks for.
Real/Zipf-style checks are kept as comparative robustness validation, not as a claim of universal fairness.

greptile-apps · 2026-04-17T13:10:47Z

Greptile Summary

This PR adds hash_roundrobin as a new DynamicEmb row-wise routing mode (mapped to dist_type=2 in the CUDA kernel) using a MurmurHash3 finalizer to break pathological modulo-aliasing patterns, while keeping roundrobin as the default. It also introduces checkpoint-level dist_type persistence and mismatch detection, along with a comprehensive regression suite.

The core implementation is sound: the CPU and CUDA hash constants match, the new_idx = idx choice for hash_roundrobin (routing via hash, lookup via original key) is correct, and the torch.int32 tensor dtype aligns with the kernel's int32_t pointer. The two P2 items worth attention are (1) _validate_load_meta assumes all legacy checkpoints (no dist_type key) used \"roundrobin\", which could produce a false-positive ValueError for the small set of users who drove DynamicEmbParameterSharding directly with the old \"continuous\" default, and (2) DynamicEmbParameterSharding.dist_type's class-field default changed from \"continuous\" to \"roundrobin\" without a deprecation note.

Confidence Score: 5/5

Safe to merge; all remaining findings are P2 style/compatibility notes that do not affect the primary use path through the planner.

The hash function, kernel routing logic, checkpoint validation, and test coverage are all correct. The two flagged items are narrow edge cases (direct API users with pre-PR 'continuous' checkpoints) unlikely to affect production planner-driven workflows. No P0/P1 issues remain.

key_value_table.py (legacy checkpoint dist_type fallback assumption) and planner/planner.py (silent class-field default change).

Important Files Changed

Filename	Overview
corelib/dynamicemb/src/sparse_block_bucketize_features.cu	Adds MurmurHash3 finalizer device function and dist_type==2 branches in both CUDA kernels; logic is correct but mixed tabs/spaces (flagged in prior review) and one long reformatted line reduce readability.
corelib/dynamicemb/dynamicemb/key_value_table.py	Persists dist_type in checkpoint metadata and enforces equality on load; the legacy-checkpoint fallback of 'roundrobin' could silently reject valid 'continuous' checkpoints from direct API users.
corelib/dynamicemb/dynamicemb/planner/planner.py	Propagates opts.dist_type through the planner instead of hardcoding 'roundrobin'; also silently changes DynamicEmbParameterSharding class-field default from 'continuous' to 'roundrobin' without deprecation notice.
corelib/dynamicemb/dynamicemb/dynamicemb_config.py	Adds dist_type field to DynamicEmbTableOptions with validation against SUPPORTED_DIST_TYPES and includes it in the grouping key; clean and correct.
corelib/dynamicemb/dynamicemb/input_dist.py	Maps 'hash_roundrobin' to integer 2 and creates the dist_type_per_feature tensor with dtype=torch.int32, correctly matching the CUDA kernel's int32_t pointer.
corelib/dynamicemb/test/unit_tests/test_hash_roundrobin_kuairand.py	New regression suite covering CPU reference parity, adversarial modulo-aliasing, CUDA kernel agreement, and Zipf-style robustness; CPU hash constants match the CUDA MurmurHash3 finalizer exactly.
corelib/dynamicemb/test/test_batched_dynamic_embedding_tables_v2.py	Adds three dump/load tests (round-trip, mismatch rejection, legacy-metadata fallback) and propagates dist_type into PyDictStorage metadata; well-structured and thorough.
corelib/dynamicemb/test/unit_tests/test_embedding_dump_load.py	Threads dist_type through apply_dmp/create_model and adds assert_dist_type_path to verify the dist_type is wired end-to-end through sharding and input_dist objects.
corelib/dynamicemb/test/unit_test.sh	Adds test_hash_roundrobin_kuairand.py to FWD_BWD_TEST_FILES; the script already handles .py files via pytest so integration is correct.
corelib/dynamicemb/test/unit_tests/test_embedding_dump_load.sh	Adds a two-pass (dump then load) smoke test for hash_roundrobin using torchrun; straightforward and consistent with the existing test pattern.
corelib/dynamicemb/DynamicEmb_APIs.md	Documents all three supported dist_type values and clearly scopes hash_roundrobin as a routing robustness improvement rather than a general hot-key solution.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[DynamicEmbTableOptions\ndist_type: str] -->|validated in __post_init__| B{dist_type?}
    B -->|continuous| C[dist_type_int = 0]
    B -->|roundrobin| D[dist_type_int = 1]
    B -->|hash_roundrobin| E[dist_type_int = 2]

    C & D & E --> F[dist_type_per_feature tensor\ndtype=torch.int32]
    F --> G[block_bucketize_sparse_features CUDA kernel]

    G --> H{dist_type value}
    H -->|0| I[p = idx / blk_size\nnew_idx = idx % blk_size]
    H -->|1| J[p = idx % my_size\nnew_idx = idx]
    H -->|2| K[p = hash_key_idx % my_size\nnew_idx = idx]

    I & J & K --> L[Bucketized KJT\nrouted to correct rank]

    M[DynamicEmbeddingShardingPlanner] -->|opts.dist_type| N[DynamicEmbParameterSharding\ndist_type field]
    N --> F

    O[Checkpoint dump] -->|stores dist_type in metadata| P[meta.json]
    P -->|_validate_load_meta| Q{ckpt == runtime?}
    Q -->|yes| R[Load succeeds]
    Q -->|no| S[ValueError raised]
    Q -->|key absent defaults to roundrobin| T[compared against runtime]

_{Reviews (5): Last reviewed commit: "fix" | Re-trigger Greptile}

shijieliu · 2026-04-20T02:17:17Z

hi @ShaobinChen-AH thanks for your contribution!

I think you also need to modify load to accommodate the changes in input dist
how did you trigger the code path for hash_roundrobin?

jiashuy

We have dist_type in DynamicEmbParameterSharding, which is not exposed to users.

So if you want to use hash_roundrobin, we have two choice:

expose dist_type in DynamicEmbTableOptions, and make roundrobin as default value
use hash_roundrobin in defalut here, and adjust our tests who view dist_type as roundrobin

ShaobinChen-AH · 2026-04-21T13:02:30Z

hi @ShaobinChen-AH thanks for your contribution!

I think you also need to modify load to accommodate the changes in input dist

how did you trigger the code path for hash_roundrobin?

dist_type is now exposed via DynamicEmbTableOptions, with roundrobin kept as the default for compatibility. The planner now reads opts.dist_type instead of hardcoding the routing mode, so hash_roundrobin is an explicit opt-in path.

ShaobinChen-AH · 2026-04-21T13:05:08Z

We have dist_type in DynamicEmbParameterSharding, which is not exposed to users.

So if you want to use hash_roundrobin, we have two choice:

expose dist_type in DynamicEmbTableOptions, and make roundrobin as default value

use hash_roundrobin in defalut here, and adjust our tests who view dist_type as roundrobin

I updated dump/load to persist dist_type in checkpoint metadata and validate it at load time, so mismatched input-distribution settings now fail loudly instead of silently loading. I also added end-to-end validation through the user-facing path (DynamicEmbTableOptions(dist_type="hash_roundrobin") -> planner/sharding/input-dist -> dump/load smoke), in addition to the existing kernel/parity benchmark.

shijieliu · 2026-04-23T00:49:48Z

thanks! @ShaobinChen-AH could you also help update the example to demonstrate how to use different input_dist type?

shijieliu · 2026-04-23T00:49:57Z

/build

greptile-apps Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread corelib/dynamicemb/dynamicemb/planner/planner.py Outdated

Comment thread corelib/dynamicemb/src/sparse_block_bucketize_features.cu

ShaobinChen-AH changed the title ~~fix-issue-350~~ Add hash_roundrobin routing mode to mitigate modulo-aliasing imbalance Apr 17, 2026

shijieliu requested a review from jiashuy April 21, 2026 01:20

jiashuy reviewed Apr 21, 2026

View reviewed changes

ShaobinChen-AH added 4 commits April 23, 2026 08:48

fix-issue-350

b35dc09

fix

dead94c

fix

e2fee25

fix

009c5e3

shijieliu force-pushed the fix-issue-350 branch from 5e9efdc to 009c5e3 Compare April 23, 2026 00:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add hash_roundrobin routing mode to mitigate modulo-aliasing imbalance#367

Add hash_roundrobin routing mode to mitigate modulo-aliasing imbalance#367
ShaobinChen-AH wants to merge 4 commits intoNVIDIA:mainfrom
ShaobinChen-AH:fix-issue-350

ShaobinChen-AH commented Apr 17, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

shijieliu commented Apr 20, 2026 •

edited

Loading

Uh oh!

jiashuy left a comment

Uh oh!

ShaobinChen-AH commented Apr 21, 2026

Uh oh!

ShaobinChen-AH commented Apr 21, 2026

Uh oh!

shijieliu commented Apr 23, 2026

Uh oh!

shijieliu commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ShaobinChen-AH commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Summary

Changes

Why

Validation

Notes

Uh oh!

greptile-apps Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

shijieliu commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiashuy left a comment

Choose a reason for hiding this comment

Uh oh!

ShaobinChen-AH commented Apr 21, 2026

Uh oh!

ShaobinChen-AH commented Apr 21, 2026

Uh oh!

shijieliu commented Apr 23, 2026

Uh oh!

shijieliu commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ShaobinChen-AH commented Apr 17, 2026 •

edited

Loading

greptile-apps Bot commented Apr 17, 2026 •

edited

Loading

shijieliu commented Apr 20, 2026 •

edited

Loading