Skip to content

[codex] harden Metal failure recovery#184

Closed
Chedrian07 wants to merge 1 commit into
antirez:mainfrom
Chedrian07:codex/metal-failure-recovery
Closed

[codex] harden Metal failure recovery#184
Chedrian07 wants to merge 1 commit into
antirez:mainfrom
Chedrian07:codex/metal-failure-recovery

Conversation

@Chedrian07
Copy link
Copy Markdown

Summary

Harden Metal backend failure recovery after command-buffer errors and make decode command batches easier to diagnose.

What changed

  • Fully invalidate failed sessions instead of only clearing checkpoint_valid, so stale token lengths and checkpoint tokens are not exposed as live KV state after a backend error.
  • Reset Metal prefill frontier state before rebuilding a nonmatching prompt from token zero, including both attention and indexer compressed row counters.
  • Add labeled GPU command-buffer flush/end helpers and use them for decode batches so Metal hang logs identify the failing decode segment.
  • Add DS4_METAL_GRAPH_TOKEN_FLUSH_EVERY to split decode command buffers every N layers while preserving the existing default split behavior.
  • Guard server token rendering against invalidated sessions.

Root cause

After a Metal command-buffer failure, the session only marked checkpoint_valid=false; checkpoint.len and token history remained visible through session accessors. The server could then observe a nonzero live position even though common-prefix matching correctly returned zero, which made recovery and cache decisions operate around stale state. Cold rebuilds also relied on prefill overwriting prior graph state instead of resetting the Metal compressor/indexer frontiers first.

Validation

  • make ds4-server
  • make all
  • make ds4_cpu.o ds4_server_cpu.o
  • ./ds4_test --server --metal-kernels
  • make test was also run. It compiled and passed long-context, tool-call-quality, metal-kernels, and server tests, but failed logprob-vectors on long_memory_archive. The same ./ds4_test --logprob-vectors failure reproduces on clean origin/main with the same local model/fixtures, so it is not introduced by this PR.

@Chedrian07 Chedrian07 closed this by deleting the head repository May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant