rocm: sync branch with main by 24601 · Pull Request #219 · antirez/ds4

24601 · 2026-05-21T18:38:48Z

Summary

This brings the rocm branch forward to current main (8d576642) while keeping the ROCm patch scoped to the existing community ROCm backend work.

Current state before this change:

rocm was based at 7a751eb
rocm was 85 commits behind main and 1 commit ahead

What changed:

merge current main into rocm with conflicts resolved
keep the ROCm backend support buildable against the current DS4 GPU code
add a make rocm target that builds the current main binaries through hipcc
keep HIP out of CUDA WMMA kernels and use the generic indexer-score fallback on ROCm
explicitly reject known wave64 GCN/CDNA ROCm targets for now, because DS4 GPU kernels still assume CUDA-style 32-lane warp math

This intentionally does not include the user-facing --rocm naming/MTP mapping work from #156 or the ROCm WMMA/indexer optimization work from #180.

Relation to #16

This is related to #16, but it should not close #16.

The goal here is to make the existing upstream rocm branch current and testable again on current main. It is not the later ROCm optimization work discussed in #16, and the speed numbers below show why this should stay draft until the remaining correctness/performance gaps are understood.

Scope notes

I checked prior ROCm-related PRs before keeping this narrowly scoped:

Add AMD ROCm/HIP Support and Strix Halo Optimizations #118 was broader than a branch refresh and overlapped with the existing rocm branch
rocm: rebased on top of the current main, removed unnecessary (previously introduced) changes to ds4_cuda.cu #133 attempted a prior ROCm branch update and surfaced remaining long-context risk
Fix ROCm build/runtime naming and MTP model mapping #156 is a separate CLI/runtime naming and MTP mapping fix
rocm: add wmma indexer support #180 is a separate ROCm WMMA/indexer optimization effort

This PR is therefore only a branch sync plus the minimum build fixes needed for current main.

Validation environment

AMD Ryzen AI Max+ 395 / Radeon 8060S
ROCm target: gfx1151
HIP/ROCm toolchain: HIP 7.2.53211-364a905
Backend selected by DS4: cuda API path over ROCm/HIP
Model: official project q2-imatrix GGUF, DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf
Model size reported by DS4: 80.76 GiB, 284.33B logical parameters
PR head tested: e13de731787799310e53bffe98b1f077f7c96b50

Build and unit validation

Commands run:

make clean
make rocm ROCM_ARCH=gfx1151 -j$(nproc)
make ds4_test GPU_BACKEND=rocm ROCM_ARCH=gfx1151 -j$(nproc)
./ds4_test --server
make clean
make cpu -j$(nproc)

Results:

ROCm build: PASS
ROCm-linked ds4_test --server: PASS (server: OK, ds4 tests: ok)
CPU build: PASS

Negative target guard check:

make clean
make ds4_test GPU_BACKEND=rocm ROCM_ARCH=gfx90a -j$(nproc)

Result: expected compile failure with the new wave64 GCN/CDNA unsupported-target error.

Existing warnings observed during the builds:

ds4_server.c: const-discard warning in stop_list_find_from
ds4_agent.c: existing snprintf truncation warnings

Model-backed validation

Model load / inspect:

./ds4 --inspect --cuda -m ds4flash.gguf

Result: PASS. DS4 loaded the official q2-imatrix GGUF and initialized the ROCm/HIP backend.

Short deterministic generation:

./ds4 --cuda -m ds4flash.gguf --ctx 4096 --nothink --temp 0 -n 32 -p 'Reply exactly: ROCm OK'

Result: PASS. Output was ROCm OK.

Reported speed for this tiny prompt:

prefill: 0.46 t/s, generation: 8.96 t/s

Tool-call quality:

DS4_TEST_MODEL=ds4flash.gguf ./ds4_test --tool-call-quality

Result: PASS in both fast and exact paths.

Long-context story recall:

DS4_TEST_MODEL=ds4flash.gguf ./ds4_test --long-context

Result: PASS (long-context: OK, ds4 tests: ok).

Runtime note: this run took about one hour wall time on the validation machine. Progress markers reached 8192/30474, 16384/30474, 24576/30474, and 30474/30474; correctness passed, but throughput is not yet competitive with the best ROCm numbers discussed in #16.

Official logprob vectors:

DS4_TEST_MODEL=ds4flash.gguf ./ds4_test --logprob-vectors

Result: FAIL with one mismatch:

ds4-test: vector short_code_completion step 1 selected token mismatch
logprob-vectors: ERR
ds4 tests: 1 failure(s)

Manual dump for the failing prompt showed ROCm selecting uppercase C after the opening code fence while the fixture expects lowercase c:

step 1 selected 'C'
top candidates: 'C' logprob -0.226568833, 'c' logprob -1.59836829

The exact/quality ROCm path also selected uppercase C for the same prompt:

step 1 selected 'C'
top candidates: 'C' logprob -0.495503277, 'c' logprob -0.941311657

Speed smoke

Command:

./ds4-bench \
  -m ds4flash.gguf \
  --cuda \
  --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 2048 \
  --ctx-max 8192 \
  --step-incr 2048 \
  --gen-tokens 128

Result:

ctx_tokens,prefill_tokens,prefill_tps,gen_tokens,gen_tps,kvcache_bytes
2048,2048,57.76,128,7.63,52184460
4096,2048,46.47,128,7.59,80373132
6144,2048,39.49,128,7.37,108561804
8192,2048,34.26,128,7.41,136750476

These are usable but below the better Strix Halo ROCm numbers reported in #16, which is expected for this branch-sync PR because it deliberately does not include the later ROCm optimization work.

Additional review

Read-only adversarial reviews were run with DeepSeek V4, Cursor Composer 2.5, and Gemini 3.5 Flash. They did not identify a branch-sync merge blocker. Their actionable feedback is reflected here:

added the explicit wave64 GCN/CDNA compile guard
documented the HIP shared-memory opt-in semantic difference in the shim
made the model-dependent test evidence and remaining gaps explicit in this PR body

Draft blockers / remaining gaps

./ds4_test --logprob-vectors does not fully pass on ROCm because of the short_code_completion C vs c mismatch.
ROCm performance is below the stronger issue Support for AMD GPU (ROCm/HIP) backend #16 benchmark reports.
CUDA regression testing on NVIDIA hardware was not run.
MTP/speculative decoding was not validated here; that is intentionally separate from this branch sync and overlaps with Fix ROCm build/runtime naming and MTP model mapping #156.
Full 65K benchmark sweep was not run; the speed smoke covers 2K-8K plus the passing 30K-token long-context correctness test.

Implements the Responses API endpoint that Codex CLI (and other modern OpenAI tooling) speaks instead of /v1/chat/completions. The wire format is documented in OpenAI's Responses API; this implementation has been iterated against the Codex CLI binary's SSE parser shape until no remaining schema gaps were found. Request parsing (parse_responses_request, parse_responses_input): - Accepts the typed input array (message, function_call, function_call_output, reasoning, custom_tool_call(_output), local_shell_call(_output), web_search_call(_output), tool_search_call(_output), image_generation_call(_output), compaction, context_compaction). - Maps hosted-tool history to function_call/function_call_output so prior actions survive across turns; rejects unknown item types and non-completed status with 400 to avoid silent context loss. - Strict content-array parsing: only string|null|array of recognized text blocks (input_text/output_text/text/summary_text/ reasoning_text); rejects non-text modalities (input_image/file/ audio) instead of accepting an empty prompt. - Merges adjacent function_call items into the preceding assistant message so text + tool-call turns render as a single assistant block. - Honors reasoning.effort (incl. "minimal"/"none") and gates reasoning summary surface on reasoning.summary opt-in. - Rejects previous_response_id, conversation, and forced tool_choice explicitly (constrained decoding / persisted state not supported). Output (responses_sse_*, responses_final_response): - Emits the full streaming lifecycle: response.created, output_item.added/.done, reasoning_summary_part.added/.done, reasoning_summary_text.delta/.done, content_part.added/.done, output_text.delta/.done, function_call_arguments.delta/.done, response.completed. - Branches the terminal event by finish reason: response.failed for errors and response.incomplete with reason "max_tokens" for length. - Every event carries sequence_number; every output_text part carries annotations:[]; function_call output_item.added ships with an empty arguments string (full args arrive via function_call_arguments.done and output_item.done), and item ids are stable across added/done. - Tracks whether </think> was actually observed so a truncated stream marks the reasoning item incomplete instead of "completed". - Recovers gracefully when the DSML tool parse fails after the model was suppressed at the tool marker: the suppressed tail is flushed as additional output_text deltas so the streamed message matches output_item.done. Tested by 25 rounds of /codex:adversarial-review against the same client this is meant to feed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Broaden the DS4 imatrix prompt dataset with provider-neutral agent/tool traffic, multi-language programming prompts, algorithm recall, Bash scripting, and multilingual translation tasks. Remove duplicate rendered prompts and avoid provider-specific client references in the generated calibration corpus. This improves calibration coverage without claiming to fix a distributed GGUF bug.

Fold the successful CUDA selector/top-k/indexed-attention changes into one clean commit. This excludes rejected experiment commits and the local prefill-slope work log.\n\nMeasured on GB10 with speed-bench/promessi_sposi.txt, 2048-token append chunks: 32K prefill improved from 255.61 tok/s on origin/main to 346.49 tok/s. Full-curve average improved from 316.39 tok/s to 369.76 tok/s. 32K full prompt + 128-token generation prefill improved from 312.87 tok/s to 368.43 tok/s, while generation stayed neutral at 12.49 -> 12.48 tok/s.\n\nCorrectness: make cuda-regression; ./ds4_test --logprob-vectors --tool-call-quality; ./ds4_test --server --metal-kernels.

Build score_official against the CUDA runtime on Linux and select the CUDA backend there, while keeping the existing Metal path on macOS.\n\nCorrectness: make -C gguf-tools quality-score; gguf-tools/quality-testing/score_official ds4flash.gguf /tmp/ds4_quality_smoke/manifest.tsv /tmp/ds4_quality_smoke/scores.tsv 16384.

Replace the default long-context continuation check with a deterministic prose-story retrieval test. The fixture embeds spelled-out person-number assignments in a long rendered prompt, and ds4_test now validates the generated Name=number list instead of brittle sampled prose.

Preserve Responses namespace metadata and tool_search calls while rendering DSML-safe internal tool names. Replay function_call, hosted tool, and tool_search_output items into the shared chat/tool path so Codex and Pi can round-trip tool calls without losing KV-cache prefix reuse. Document the /v1/responses endpoint and add server unit coverage for namespace, tool_search, and replay output shapes.

This reverts commit 2a7a5f3. There was no ack from the user. Don't want to take a fix that is astronautically produced from an unclear error trace.

Project sampled DSML tool calls to Anthropic SSE tool_use blocks while keeping raw DSML as the parser/cache source of truth. Reuse streamed tool ids for final parsed calls so tool_result continuation still matches live state.

Keep normal CUDA context buffers on device allocations, but route very large KV-cache tensors through managed memory so million-token contexts do not starve unified-memory systems during graph/session allocation. The fallback is scoped to the long-lived KV/cache tensors and logs when it is used because it may reduce performance. Tested on 0.180 with: - make cpu - make -B cuda-spark - make cuda-regression - ./ds4_test --server --metal-kernels - ./ds4_test --logprob-vectors --tool-call-quality - ds4-bench ctx-alloc 32768, 250000, and 1000000 - ds4-server --ctx 1000000 startup smoke (cherry picked from commit 0b248a65c07d21f2fc8ff4815bd8b75af26719f9)

Parse Anthropic tool_use blocks by their own type field instead of relying on the enclosing message role being parsed first. Some clients serialize messages as content-before-role, which made full-history tool_result replays look like unknown live-only continuations. Fixes antirez#127.

Return a 400 error with error type "context_exceeded" when prompt tokens exceed context size. The response includes both n_prompt_tokens and n_ctx fields so clients can determine exactly why the request failed and how far over the limit they went. Error response format: { "error": { "message": "Prompt tokens (N) exceeds context size (M)", "type": "context_exceeded", "n_prompt_tokens": N, "n_ctx": M } }

dwarfstar is typoed to drawfstar

fix typo in readme

Add ds4-agent as a native terminal coding agent with session KV storage, DSML tool execution, edit/read/search/write/list/bash tools, async bash jobs, compaction support, history replay, and a linenoise-based interactive UI. The agent defaults to CRC-guarded edits, streams tool visualization, ignores tool calls emitted inside thinking, improves bash output observations for long-running commands, and styles prompts, spacing, and prefill progress for interactive use.

Use an ANSI scroll region for streamed agent output, resize it as linenoise input wraps or shrinks, and anchor the prompt/status block at the terminal bottom. This also fixes cursor placement at exact column boundaries and removes the prompt/output overwrite cases seen during multiline editing.

Long prefills (large prompts, no cache hit) can take minutes on local hardware. ds4-server was silent on the socket the whole time, so HTTP and TCP idle timeouts on the client side would close the connection before the first response byte was written -- see the "sse headers failed" log line that appeared at the very end of a multi-minute prefill in real agent runs. Stream the SSE response headers from the prefill_chunk progress callback, then emit a ":" comment line (ignored by SSE clients per the spec) at most every five seconds while prefill is still running. The keepalive is best-effort: a closed socket simply fails the writes and the outer code discovers the dead connection when it tries to stream a real event, matching the existing error path. The tool-checkpoint rebuild path pre-arms headers_sent because it only runs after the response stream is already in flight, so it never tries to re-send the SSE header line. Verified on macOS Metal, q2-imatrix GGUF, ctx=200000: - ./ds4_test --server passes - 35s fresh prefill: client receives ": prefill" lines at +6/+12/ +18/+25/+31s, then SSE content events at +35s, no client disconnect - 1s cached prompt: unchanged (sse_headers is still emitted from the request handler when prefill never fires) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…20260521-mergeable # Conflicts: # Makefile

WebReflection · 2026-05-22T13:51:44Z

On average I have ~9 t/s and prefill up to double that, but while on DGX Spark I can see the whole RAM being used, I wonder if I need to force UMA buffer up to 96GB or if leaving it auto is good enough, and these are just the performance this 395+ Max can produce, thanks in advance for any sort of clarification. (ArchLinux btw)

mitsuhiko and others added 30 commits May 11, 2026 12:30

feat(server): report KV cache usage

0ca2e28

feat(server): report Anthropic cache usage

38800bf

README: separate motivations.

c5ef7ac

Merge branch 'pr-91-responses' into responses-api

2174611

Tighten Responses tool_search replay

6396966

Fix Responses tool checkpoint cache reuse

a01bf1d

Fix Responses API live continuation

acb40bf

metal: cover q4 expert tensors in model views

2a7a5f3

Skip tool checkpoint canonicalization for exact DSML replay

b4c5f7c

Merge responses-api

e88a71e

Use visible live checkpoints for toolless thinking

5453ad0

Clarify server progress logs

646798f

Add Anthropic live tool continuation

43535e1

Revert "metal: cover q4 expert tensors in model views"

67e6146

This reverts commit 2a7a5f3. There was no ack from the user. Don't want to take a fix that is astronautically produced from an unclear error trace.

Tag Responses API server logs

0083475

Recover Responses replays without hidden reasoning

0610591

Stream Anthropic tool calls live

94c1f38

Project sampled DSML tool calls to Anthropic SSE tool_use blocks while keeping raw DSML as the parser/cache source of truth. Reuse streamed tool ids for final parsed calls so tool_result continuation still matches live state.

fix typo in readme

741d0cc

dwarfstar is typoed to drawfstar

Merge pull request antirez#155 from kernelzeroday/main

98593ec

fix typo in readme

Fix typos in README.md

f6fa52b

Merge branch 'pr-150-context-error' into merge-pr-150-standard-context

157873b

antirez and others added 28 commits May 18, 2026 15:23

Add CLI perplexity scoring

d630ca4

Skip mismatched long API vector

4efd501

Fix agent tool and Ctrl-C prompt rendering

b12e5f7

Refine agent CRC edit contract

37511f1

Refine agent tool result protocol

1607afd

Simplify agent edit and bash feedback

e81d70d

ds4-agent: likely fix a prompt rendering bug.

560662d

Simplify agent line edits

d991f87

Fix bash command tool rendering

e65bce0

Refine agent tool prompt and edit tracking

f89efb1

Improve ds4-agent TUI and history replay

23cf510

Show post-edit context to the agent

8daa088

Make linenoise history folding less aggressive

d3b69be

Keep queued prompt handling append-only

1e3c11f

Handle agent save and list commands while busy

1dc8bdb

Exit ds4-agent immediately when discarding session

f740b95

Keep ds4-agent prompt visible during streamed output

799dff4

Refresh ds4-agent status while preserving prompt

2606543

Use robust ds4-agent terminal colors

8ba0c45

Use glyphs in ds4-agent prefill progress

9ff77a1

Tune resumed prefill threshold

8d32a52

Fix agent status bar glyph width

a365e44

Add mixed GGUF splicing tool

93d9d96

Handle prefill errors after SSE keepalive

8d57664

Merge remote-tracking branch 'origin/main' into codex/rocm-sync-main-…

e13de73

…20260521-mergeable # Conflicts: # Makefile

24601 force-pushed the codex/rocm-sync-main-20260521 branch from e4617b5 to e13de73 Compare May 21, 2026 18:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rocm: sync branch with main#219

rocm: sync branch with main#219
24601 wants to merge 86 commits into
antirez:rocmfrom
24601:codex/rocm-sync-main-20260521

24601 commented May 21, 2026 •

edited

Loading

Uh oh!

WebReflection commented May 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Conversation

24601 commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Relation to #16

Scope notes

Validation environment

Build and unit validation

Model-backed validation

Speed smoke

Additional review

Draft blockers / remaining gaps

Uh oh!

WebReflection commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

24601 commented May 21, 2026 •

edited

Loading

WebReflection commented May 22, 2026 •

edited

Loading