Skip to content

rocm: sync branch with main#219

Draft
24601 wants to merge 86 commits into
antirez:rocmfrom
24601:codex/rocm-sync-main-20260521
Draft

rocm: sync branch with main#219
24601 wants to merge 86 commits into
antirez:rocmfrom
24601:codex/rocm-sync-main-20260521

Conversation

@24601
Copy link
Copy Markdown

@24601 24601 commented May 21, 2026

Summary

This brings the rocm branch forward to current main (8d576642) while keeping the ROCm patch scoped to the existing community ROCm backend work.

Current state before this change:

  • rocm was based at 7a751eb
  • rocm was 85 commits behind main and 1 commit ahead

What changed:

  • merge current main into rocm with conflicts resolved
  • keep the ROCm backend support buildable against the current DS4 GPU code
  • add a make rocm target that builds the current main binaries through hipcc
  • keep HIP out of CUDA WMMA kernels and use the generic indexer-score fallback on ROCm
  • explicitly reject known wave64 GCN/CDNA ROCm targets for now, because DS4 GPU kernels still assume CUDA-style 32-lane warp math

This intentionally does not include the user-facing --rocm naming/MTP mapping work from #156 or the ROCm WMMA/indexer optimization work from #180.

Relation to #16

This is related to #16, but it should not close #16.

The goal here is to make the existing upstream rocm branch current and testable again on current main. It is not the later ROCm optimization work discussed in #16, and the speed numbers below show why this should stay draft until the remaining correctness/performance gaps are understood.

Scope notes

I checked prior ROCm-related PRs before keeping this narrowly scoped:

This PR is therefore only a branch sync plus the minimum build fixes needed for current main.

Validation environment

  • AMD Ryzen AI Max+ 395 / Radeon 8060S
  • ROCm target: gfx1151
  • HIP/ROCm toolchain: HIP 7.2.53211-364a905
  • Backend selected by DS4: cuda API path over ROCm/HIP
  • Model: official project q2-imatrix GGUF, DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf
  • Model size reported by DS4: 80.76 GiB, 284.33B logical parameters
  • PR head tested: e13de731787799310e53bffe98b1f077f7c96b50

Build and unit validation

Commands run:

make clean
make rocm ROCM_ARCH=gfx1151 -j$(nproc)
make ds4_test GPU_BACKEND=rocm ROCM_ARCH=gfx1151 -j$(nproc)
./ds4_test --server
make clean
make cpu -j$(nproc)

Results:

  • ROCm build: PASS
  • ROCm-linked ds4_test --server: PASS (server: OK, ds4 tests: ok)
  • CPU build: PASS

Negative target guard check:

make clean
make ds4_test GPU_BACKEND=rocm ROCM_ARCH=gfx90a -j$(nproc)

Result: expected compile failure with the new wave64 GCN/CDNA unsupported-target error.

Existing warnings observed during the builds:

  • ds4_server.c: const-discard warning in stop_list_find_from
  • ds4_agent.c: existing snprintf truncation warnings

Model-backed validation

Model load / inspect:

./ds4 --inspect --cuda -m ds4flash.gguf

Result: PASS. DS4 loaded the official q2-imatrix GGUF and initialized the ROCm/HIP backend.

Short deterministic generation:

./ds4 --cuda -m ds4flash.gguf --ctx 4096 --nothink --temp 0 -n 32 -p 'Reply exactly: ROCm OK'

Result: PASS. Output was ROCm OK.

Reported speed for this tiny prompt:

prefill: 0.46 t/s, generation: 8.96 t/s

Tool-call quality:

DS4_TEST_MODEL=ds4flash.gguf ./ds4_test --tool-call-quality

Result: PASS in both fast and exact paths.

Long-context story recall:

DS4_TEST_MODEL=ds4flash.gguf ./ds4_test --long-context

Result: PASS (long-context: OK, ds4 tests: ok).

Runtime note: this run took about one hour wall time on the validation machine. Progress markers reached 8192/30474, 16384/30474, 24576/30474, and 30474/30474; correctness passed, but throughput is not yet competitive with the best ROCm numbers discussed in #16.

Official logprob vectors:

DS4_TEST_MODEL=ds4flash.gguf ./ds4_test --logprob-vectors

Result: FAIL with one mismatch:

ds4-test: vector short_code_completion step 1 selected token mismatch
logprob-vectors: ERR
ds4 tests: 1 failure(s)

Manual dump for the failing prompt showed ROCm selecting uppercase C after the opening code fence while the fixture expects lowercase c:

step 1 selected 'C'
top candidates: 'C' logprob -0.226568833, 'c' logprob -1.59836829

The exact/quality ROCm path also selected uppercase C for the same prompt:

step 1 selected 'C'
top candidates: 'C' logprob -0.495503277, 'c' logprob -0.941311657

Speed smoke

Command:

./ds4-bench \
  -m ds4flash.gguf \
  --cuda \
  --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 2048 \
  --ctx-max 8192 \
  --step-incr 2048 \
  --gen-tokens 128

Result:

ctx_tokens,prefill_tokens,prefill_tps,gen_tokens,gen_tps,kvcache_bytes
2048,2048,57.76,128,7.63,52184460
4096,2048,46.47,128,7.59,80373132
6144,2048,39.49,128,7.37,108561804
8192,2048,34.26,128,7.41,136750476

These are usable but below the better Strix Halo ROCm numbers reported in #16, which is expected for this branch-sync PR because it deliberately does not include the later ROCm optimization work.

Additional review

Read-only adversarial reviews were run with DeepSeek V4, Cursor Composer 2.5, and Gemini 3.5 Flash. They did not identify a branch-sync merge blocker. Their actionable feedback is reflected here:

  • added the explicit wave64 GCN/CDNA compile guard
  • documented the HIP shared-memory opt-in semantic difference in the shim
  • made the model-dependent test evidence and remaining gaps explicit in this PR body

Draft blockers / remaining gaps

  • ./ds4_test --logprob-vectors does not fully pass on ROCm because of the short_code_completion C vs c mismatch.
  • ROCm performance is below the stronger issue Support for AMD GPU (ROCm/HIP) backend #16 benchmark reports.
  • CUDA regression testing on NVIDIA hardware was not run.
  • MTP/speculative decoding was not validated here; that is intentionally separate from this branch sync and overlaps with Fix ROCm build/runtime naming and MTP model mapping #156.
  • Full 65K benchmark sweep was not run; the speed smoke covers 2K-8K plus the passing 30K-token long-context correctness test.

mitsuhiko and others added 30 commits May 11, 2026 12:30
Implements the Responses API endpoint that Codex CLI (and other modern
OpenAI tooling) speaks instead of /v1/chat/completions. The wire format
is documented in OpenAI's Responses API; this implementation has been
iterated against the Codex CLI binary's SSE parser shape until no
remaining schema gaps were found.

Request parsing (parse_responses_request, parse_responses_input):
- Accepts the typed input array (message, function_call,
  function_call_output, reasoning, custom_tool_call(_output),
  local_shell_call(_output), web_search_call(_output),
  tool_search_call(_output), image_generation_call(_output),
  compaction, context_compaction).
- Maps hosted-tool history to function_call/function_call_output so
  prior actions survive across turns; rejects unknown item types and
  non-completed status with 400 to avoid silent context loss.
- Strict content-array parsing: only string|null|array of recognized
  text blocks (input_text/output_text/text/summary_text/
  reasoning_text); rejects non-text modalities (input_image/file/
  audio) instead of accepting an empty prompt.
- Merges adjacent function_call items into the preceding assistant
  message so text + tool-call turns render as a single assistant
  block.
- Honors reasoning.effort (incl. "minimal"/"none") and gates
  reasoning summary surface on reasoning.summary opt-in.
- Rejects previous_response_id, conversation, and forced tool_choice
  explicitly (constrained decoding / persisted state not supported).

Output (responses_sse_*, responses_final_response):
- Emits the full streaming lifecycle: response.created,
  output_item.added/.done, reasoning_summary_part.added/.done,
  reasoning_summary_text.delta/.done, content_part.added/.done,
  output_text.delta/.done, function_call_arguments.delta/.done,
  response.completed.
- Branches the terminal event by finish reason: response.failed for
  errors and response.incomplete with reason "max_tokens" for length.
- Every event carries sequence_number; every output_text part carries
  annotations:[]; function_call output_item.added ships with an empty
  arguments string (full args arrive via function_call_arguments.done
  and output_item.done), and item ids are stable across added/done.
- Tracks whether </think> was actually observed so a truncated stream
  marks the reasoning item incomplete instead of "completed".
- Recovers gracefully when the DSML tool parse fails after the model
  was suppressed at the tool marker: the suppressed tail is flushed
  as additional output_text deltas so the streamed message matches
  output_item.done.

Tested by 25 rounds of /codex:adversarial-review against the same
client this is meant to feed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Broaden the DS4 imatrix prompt dataset with provider-neutral agent/tool traffic, multi-language programming prompts, algorithm recall, Bash scripting, and multilingual translation tasks.

Remove duplicate rendered prompts and avoid provider-specific client references in the generated calibration corpus. This improves calibration coverage without claiming to fix a distributed GGUF bug.
Fold the successful CUDA selector/top-k/indexed-attention changes into one clean commit. This excludes rejected experiment commits and the local prefill-slope work log.\n\nMeasured on GB10 with speed-bench/promessi_sposi.txt, 2048-token append chunks: 32K prefill improved from 255.61 tok/s on origin/main to 346.49 tok/s. Full-curve average improved from 316.39 tok/s to 369.76 tok/s. 32K full prompt + 128-token generation prefill improved from 312.87 tok/s to 368.43 tok/s, while generation stayed neutral at 12.49 -> 12.48 tok/s.\n\nCorrectness: make cuda-regression; ./ds4_test --logprob-vectors --tool-call-quality; ./ds4_test --server --metal-kernels.
Build score_official against the CUDA runtime on Linux and select the CUDA backend there, while keeping the existing Metal path on macOS.\n\nCorrectness: make -C gguf-tools quality-score; gguf-tools/quality-testing/score_official ds4flash.gguf /tmp/ds4_quality_smoke/manifest.tsv /tmp/ds4_quality_smoke/scores.tsv 16384.
Replace the default long-context continuation check with a deterministic prose-story retrieval test. The fixture embeds spelled-out person-number assignments in a long rendered prompt, and ds4_test now validates the generated Name=number list instead of brittle sampled prose.
Preserve Responses namespace metadata and tool_search calls while rendering DSML-safe internal tool names. Replay function_call, hosted tool, and tool_search_output items into the shared chat/tool path so Codex and Pi can round-trip tool calls without losing KV-cache prefix reuse. Document the /v1/responses endpoint and add server unit coverage for namespace, tool_search, and replay output shapes.
This reverts commit 2a7a5f3.

There was no ack from the user. Don't want to take a fix
that is astronautically produced from an unclear error
trace.
Project sampled DSML tool calls to Anthropic SSE tool_use blocks while keeping raw DSML as the parser/cache source of truth.

Reuse streamed tool ids for final parsed calls so tool_result continuation still matches live state.
Keep normal CUDA context buffers on device allocations, but route very large KV-cache tensors through managed memory so million-token contexts do not starve unified-memory systems during graph/session allocation.

The fallback is scoped to the long-lived KV/cache tensors and logs when it is used because it may reduce performance.

Tested on 0.180 with:
- make cpu
- make -B cuda-spark
- make cuda-regression
- ./ds4_test --server --metal-kernels
- ./ds4_test --logprob-vectors --tool-call-quality
- ds4-bench ctx-alloc 32768, 250000, and 1000000
- ds4-server --ctx 1000000 startup smoke

(cherry picked from commit 0b248a65c07d21f2fc8ff4815bd8b75af26719f9)
Parse Anthropic tool_use blocks by their own type field instead of relying on the enclosing message role being parsed first.

Some clients serialize messages as content-before-role, which made full-history tool_result replays look like unknown live-only continuations.

Fixes antirez#127.
Return a 400 error with error type "context_exceeded" when prompt tokens exceed
context size. The response includes both n_prompt_tokens and n_ctx fields so
clients can determine exactly why the request failed and how far over the limit
they went.

Error response format:
  {
    "error": {
      "message": "Prompt tokens (N) exceeds context size (M)",
      "type": "context_exceeded",
      "n_prompt_tokens": N,
      "n_ctx": M
    }
  }
dwarfstar is typoed to drawfstar
antirez and others added 28 commits May 18, 2026 15:23
Add ds4-agent as a native terminal coding agent with session KV storage, DSML tool execution, edit/read/search/write/list/bash tools, async bash jobs, compaction support, history replay, and a linenoise-based interactive UI.

The agent defaults to CRC-guarded edits, streams tool visualization, ignores tool calls emitted inside thinking, improves bash output observations for long-running commands, and styles prompts, spacing, and prefill progress for interactive use.
Use an ANSI scroll region for streamed agent output, resize it as linenoise input wraps or shrinks, and anchor the prompt/status block at the terminal bottom. This also fixes cursor placement at exact column boundaries and removes the prompt/output overwrite cases seen during multiline editing.
Long prefills (large prompts, no cache hit) can take minutes on local
hardware. ds4-server was silent on the socket the whole time, so HTTP
and TCP idle timeouts on the client side would close the connection
before the first response byte was written -- see the "sse headers
failed" log line that appeared at the very end of a multi-minute
prefill in real agent runs.

Stream the SSE response headers from the prefill_chunk progress
callback, then emit a ":" comment line (ignored by SSE clients per the
spec) at most every five seconds while prefill is still running. The
keepalive is best-effort: a closed socket simply fails the writes and
the outer code discovers the dead connection when it tries to stream
a real event, matching the existing error path. The tool-checkpoint
rebuild path pre-arms headers_sent because it only runs after the
response stream is already in flight, so it never tries to re-send
the SSE header line.

Verified on macOS Metal, q2-imatrix GGUF, ctx=200000:

- ./ds4_test --server passes
- 35s fresh prefill: client receives ": prefill" lines at +6/+12/
  +18/+25/+31s, then SSE content events at +35s, no client disconnect
- 1s cached prompt: unchanged (sse_headers is still emitted from the
  request handler when prefill never fires)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@24601 24601 force-pushed the codex/rocm-sync-main-20260521 branch from e4617b5 to e13de73 Compare May 21, 2026 18:54
@WebReflection
Copy link
Copy Markdown

WebReflection commented May 22, 2026

On average I have ~9 t/s and prefill up to double that, but while on DGX Spark I can see the whole RAM being used, I wonder if I need to force UMA buffer up to 96GB or if leaving it auto is good enough, and these are just the performance this 395+ Max can produce, thanks in advance for any sort of clarification. (ArchLinux btw)

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants