Skip to content

fix(semantic): memory queue stall + permanent fs-error classification#1531

Open
ZaynJarvis wants to merge 8 commits intomainfrom
fix-memory-semantic-queue-stall
Open

fix(semantic): memory queue stall + permanent fs-error classification#1531
ZaynJarvis wants to merge 8 commits intomainfrom
fix-memory-semantic-queue-stall

Conversation

@ZaynJarvis
Copy link
Copy Markdown
Collaborator

Summary

Rebased-on-main cherry-pick of #951 (by @deepakdevp) with minor adjustments to land on current main. Original commits preserved via git cherry-pick to keep the contributor's authorship intact.

  • 46a6961_process_memory_directory() raises on ls/write failures instead of swallowing them, so on_dequeue() can always route to report_success() / report_error() and the in-progress counter no longer gets stuck. Fixes Memory semantic queue stalls on context_type=memory jobs; pending backlog grows while processed stays at 0 #864.
  • c252078 — filesystem errors (FileNotFoundError, PermissionError, IsADirectoryError, NotADirectoryError) are now classified as permanent so they hit report_error() instead of being re-enqueued forever.
  • 8581232 — test harness fixup: DequeueHandlerBase.set_callbacks now takes (on_success, on_requeue, on_error); the original PR's tests called it with two args.

Conflict resolution notes

  • classify_api_error moved from openviking/utils/circuit_breaker.py to openviking/utils/model_retry.py after PR fix(semantic): ensure memory processing always reports completion status #951 was opened. The _PERMANENT_IO_ERRORS isinstance check was applied to model_retry.py accordingly.
  • Error paths in _process_memory_directory() now release lifecycle_lock_handle_id before re-raising, matching the new lock-ownership model on main.

Test plan

  • uv run pytest tests/utils/test_circuit_breaker.py tests/storage/test_memory_semantic_stall.py — 21 passed
  • Soak test under real semantic-queue load to confirm stall no longer reproduces

🤖 Generated with Claude Code

deepakdevp and others added 3 commits April 17, 2026 14:00
_process_memory_directory() had early return paths that could bypass
report_success()/report_error() in on_dequeue(), leaving the queue's
in_progress counter permanently stuck. This caused the semantic queue
to appear stalled with pending items never being processed.

All code paths now properly propagate to the completion callbacks.

Fixes #864.
…inite retry

Address review feedback: filesystem errors (FileNotFoundError,
PermissionError, IsADirectoryError, NotADirectoryError) are now
classified as permanent by classify_api_error(), so they hit
report_error() instead of being infinitely re-enqueued.

Tests updated to exercise real classifier behavior without mocking.
DequeueHandlerBase.set_callbacks now takes (on_success, on_requeue, on_error);
the original PR #951 test harness called it with only (on_success, on_error).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Failed to generate code suggestions for PR

@ZaynJarvis
Copy link
Copy Markdown
Collaborator Author

pending build fix.

ZaynJarvis and others added 4 commits April 17, 2026 18:03
_mark_failed's two call sites were removed when _process_memory_directory
started raising on error. The closure itself was left behind. Delete it —
telemetry failure is now reported by on_dequeue's exception handler via
get_request_wait_tracker().mark_semantic_failed().

Add test_memory_ls_transient_error_requeues to cover the transient branch
of the memory path: a 500-class error from ls() must route through
_reenqueue_semantic_msg() and fire report_requeue() + report_success(),
not report_error(). The previous tests only exercised permanent errors.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

Memory semantic queue stalls on context_type=memory jobs; pending backlog grows while processed stays at 0

2 participants