Skip to content

Add hash_roundrobin routing mode to mitigate modulo-aliasing imbalance#367

Open
ShaobinChen-AH wants to merge 4 commits intoNVIDIA:mainfrom
ShaobinChen-AH:fix-issue-350
Open

Add hash_roundrobin routing mode to mitigate modulo-aliasing imbalance#367
ShaobinChen-AH wants to merge 4 commits intoNVIDIA:mainfrom
ShaobinChen-AH:fix-issue-350

Conversation

@ShaobinChen-AH
Copy link
Copy Markdown
Contributor

@ShaobinChen-AH ShaobinChen-AH commented Apr 17, 2026

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Summary

This PR adds hash_roundrobin as a new DynamicEmb RW routing mode and makes it the default for DynamicEmb row-wise planning.

The goal is narrow: fix load imbalance caused by pathological raw-key patterns that can break plain modulo-based roundrobin. The new mode hashes the raw key first, then assigns the owner rank from the hashed key.

This PR does not claim to solve general hot-key or Zipf-skew load balancing.

Changes

  • add hash_roundrobin support in the DynamicEmb input-distribution path
  • map hash_roundrobin to dist_type = 2 for the CUDA extension path
  • “adds hash_roundrobin as an opt-in RW routing mode”, default remains roundrobin
  • add a regression test file covering:
    • CPU reference owner mapping
    • adversarial modulo-aliasing patterns
    • same logical workload with different raw-key encodings
    • CUDA kernel parity against CPU reference
    • comparative real/Zipf-style robustness checks
  • add the new regression test to the regular test/unit_test.sh flow
  • document supported dist_type values and clarify the intended scope of hash_roundrobin

Why

Issue #350 points out that plain roundrobin can become imbalanced when raw keys follow special patterns. Hashing the raw key before RW rank assignment makes the routing much less sensitive to those patterns while preserving the existing overall bucketization flow.

Validation

Validated on a clean rebuild in the target Ubuntu Docker environment.

  • python3 -m pytest -svv test/unit_tests/test_hash_roundrobin_kuairand.py
    • 16 passed
  • CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes 1 --nproc_per_node 2 ./test/unit_tests/test_sequence_embedding_fw.py --print_sharding_plan --optimizer_type "adam" --use_index_dedup True
    • passed on NVIDIA RTX A6000

Notes

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 17, 2026

Greptile Summary

This PR adds hash_roundrobin as a new DynamicEmb row-wise routing mode (mapped to dist_type=2 in the CUDA kernel) using a MurmurHash3 finalizer to break pathological modulo-aliasing patterns, while keeping roundrobin as the default. It also introduces checkpoint-level dist_type persistence and mismatch detection, along with a comprehensive regression suite.

The core implementation is sound: the CPU and CUDA hash constants match, the new_idx = idx choice for hash_roundrobin (routing via hash, lookup via original key) is correct, and the torch.int32 tensor dtype aligns with the kernel's int32_t pointer. The two P2 items worth attention are (1) _validate_load_meta assumes all legacy checkpoints (no dist_type key) used \"roundrobin\", which could produce a false-positive ValueError for the small set of users who drove DynamicEmbParameterSharding directly with the old \"continuous\" default, and (2) DynamicEmbParameterSharding.dist_type's class-field default changed from \"continuous\" to \"roundrobin\" without a deprecation note.

Confidence Score: 5/5

Safe to merge; all remaining findings are P2 style/compatibility notes that do not affect the primary use path through the planner.

The hash function, kernel routing logic, checkpoint validation, and test coverage are all correct. The two flagged items are narrow edge cases (direct API users with pre-PR 'continuous' checkpoints) unlikely to affect production planner-driven workflows. No P0/P1 issues remain.

key_value_table.py (legacy checkpoint dist_type fallback assumption) and planner/planner.py (silent class-field default change).

Important Files Changed

Filename Overview
corelib/dynamicemb/src/sparse_block_bucketize_features.cu Adds MurmurHash3 finalizer device function and dist_type==2 branches in both CUDA kernels; logic is correct but mixed tabs/spaces (flagged in prior review) and one long reformatted line reduce readability.
corelib/dynamicemb/dynamicemb/key_value_table.py Persists dist_type in checkpoint metadata and enforces equality on load; the legacy-checkpoint fallback of 'roundrobin' could silently reject valid 'continuous' checkpoints from direct API users.
corelib/dynamicemb/dynamicemb/planner/planner.py Propagates opts.dist_type through the planner instead of hardcoding 'roundrobin'; also silently changes DynamicEmbParameterSharding class-field default from 'continuous' to 'roundrobin' without deprecation notice.
corelib/dynamicemb/dynamicemb/dynamicemb_config.py Adds dist_type field to DynamicEmbTableOptions with validation against SUPPORTED_DIST_TYPES and includes it in the grouping key; clean and correct.
corelib/dynamicemb/dynamicemb/input_dist.py Maps 'hash_roundrobin' to integer 2 and creates the dist_type_per_feature tensor with dtype=torch.int32, correctly matching the CUDA kernel's int32_t pointer.
corelib/dynamicemb/test/unit_tests/test_hash_roundrobin_kuairand.py New regression suite covering CPU reference parity, adversarial modulo-aliasing, CUDA kernel agreement, and Zipf-style robustness; CPU hash constants match the CUDA MurmurHash3 finalizer exactly.
corelib/dynamicemb/test/test_batched_dynamic_embedding_tables_v2.py Adds three dump/load tests (round-trip, mismatch rejection, legacy-metadata fallback) and propagates dist_type into PyDictStorage metadata; well-structured and thorough.
corelib/dynamicemb/test/unit_tests/test_embedding_dump_load.py Threads dist_type through apply_dmp/create_model and adds assert_dist_type_path to verify the dist_type is wired end-to-end through sharding and input_dist objects.
corelib/dynamicemb/test/unit_test.sh Adds test_hash_roundrobin_kuairand.py to FWD_BWD_TEST_FILES; the script already handles .py files via pytest so integration is correct.
corelib/dynamicemb/test/unit_tests/test_embedding_dump_load.sh Adds a two-pass (dump then load) smoke test for hash_roundrobin using torchrun; straightforward and consistent with the existing test pattern.
corelib/dynamicemb/DynamicEmb_APIs.md Documents all three supported dist_type values and clearly scopes hash_roundrobin as a routing robustness improvement rather than a general hot-key solution.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[DynamicEmbTableOptions\ndist_type: str] -->|validated in __post_init__| B{dist_type?}
    B -->|continuous| C[dist_type_int = 0]
    B -->|roundrobin| D[dist_type_int = 1]
    B -->|hash_roundrobin| E[dist_type_int = 2]

    C & D & E --> F[dist_type_per_feature tensor\ndtype=torch.int32]
    F --> G[block_bucketize_sparse_features CUDA kernel]

    G --> H{dist_type value}
    H -->|0| I[p = idx / blk_size\nnew_idx = idx % blk_size]
    H -->|1| J[p = idx % my_size\nnew_idx = idx]
    H -->|2| K[p = hash_key_idx % my_size\nnew_idx = idx]

    I & J & K --> L[Bucketized KJT\nrouted to correct rank]

    M[DynamicEmbeddingShardingPlanner] -->|opts.dist_type| N[DynamicEmbParameterSharding\ndist_type field]
    N --> F

    O[Checkpoint dump] -->|stores dist_type in metadata| P[meta.json]
    P -->|_validate_load_meta| Q{ckpt == runtime?}
    Q -->|yes| R[Load succeeds]
    Q -->|no| S[ValueError raised]
    Q -->|key absent defaults to roundrobin| T[compared against runtime]
Loading

Reviews (5): Last reviewed commit: "fix" | Re-trigger Greptile

Comment thread corelib/dynamicemb/dynamicemb/planner/planner.py Outdated
Comment thread corelib/dynamicemb/src/sparse_block_bucketize_features.cu
@ShaobinChen-AH ShaobinChen-AH changed the title fix-issue-350 Add hash_roundrobin routing mode to mitigate modulo-aliasing imbalance Apr 17, 2026
@shijieliu
Copy link
Copy Markdown
Collaborator

shijieliu commented Apr 20, 2026

hi @ShaobinChen-AH thanks for your contribution!

  1. I think you also need to modify load to accommodate the changes in input dist
  2. how did you trigger the code path for hash_roundrobin?

@shijieliu shijieliu requested a review from jiashuy April 21, 2026 01:20
Copy link
Copy Markdown
Collaborator

@jiashuy jiashuy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have dist_type in DynamicEmbParameterSharding, which is not exposed to users.

So if you want to use hash_roundrobin, we have two choice:

  1. expose dist_type in DynamicEmbTableOptions, and make roundrobin as default value
  2. use hash_roundrobin in defalut here, and adjust our tests who view dist_type as roundrobin

@ShaobinChen-AH
Copy link
Copy Markdown
Contributor Author

hi @ShaobinChen-AH thanks for your contribution!

  1. I think you also need to modify load to accommodate the changes in input dist
  2. how did you trigger the code path for hash_roundrobin?

dist_type is now exposed via DynamicEmbTableOptions, with roundrobin kept as the default for compatibility. The planner now reads opts.dist_type instead of hardcoding the routing mode, so hash_roundrobin is an explicit opt-in path.

@ShaobinChen-AH
Copy link
Copy Markdown
Contributor Author

We have dist_type in DynamicEmbParameterSharding, which is not exposed to users.

So if you want to use hash_roundrobin, we have two choice:

  1. expose dist_type in DynamicEmbTableOptions, and make roundrobin as default value
  2. use hash_roundrobin in defalut here, and adjust our tests who view dist_type as roundrobin

We have dist_type in DynamicEmbParameterSharding, which is not exposed to users.

So if you want to use hash_roundrobin, we have two choice:

  1. expose dist_type in DynamicEmbTableOptions, and make roundrobin as default value
  2. use hash_roundrobin in defalut here, and adjust our tests who view dist_type as roundrobin

I updated dump/load to persist dist_type in checkpoint metadata and validate it at load time, so mismatched input-distribution settings now fail loudly instead of silently loading. I also added end-to-end validation through the user-facing path (DynamicEmbTableOptions(dist_type="hash_roundrobin") -> planner/sharding/input-dist -> dump/load smoke), in addition to the existing kernel/parity benchmark.

@shijieliu
Copy link
Copy Markdown
Collaborator

thanks! @ShaobinChen-AH could you also help update the example to demonstrate how to use different input_dist type?

@shijieliu
Copy link
Copy Markdown
Collaborator

/build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants