Skip to content

feat(gadget): support user recovery handover#51

Merged
samcday merged 1 commit into
mainfrom
ur-adopted-lol
May 14, 2026
Merged

feat(gadget): support user recovery handover#51
samcday merged 1 commit into
mainfrom
ur-adopted-lol

Conversation

@samcday
Copy link
Copy Markdown
Owner

@samcday samcday commented May 13, 2026

Quiesce and release existing ublk devices on SIGHUP so a successor daemon can adopt them without completing in-flight I/O as failures. Add VM coverage for preserving the device identity across handover.

@samcday
Copy link
Copy Markdown
Owner Author

samcday commented May 13, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 13, 2026

Claude finished @samcday's task in 2m 31s —— View job


Code Review

  • Read CLAUDE.md for repo guidelines
  • Review changed files
  • Analyze logic and correctness
  • Post final review

The overall approach is sound: quiesce in-flight I/O on SIGHUP using a background thread (so the async event loop keeps servicing I/O completions during the quiesce wait), then drop device handles to release them for adoption. The two-phase split (begin_user_recovery_handoverrelease_devices_for_user_recovery) is a real improvement over the old blocking begin_user_recovery.

Issues

1. wait_for_user_recovery_start retries already-recovered devices on each loop iteration (medium)

crates/smoo-gadget-app/src/lib.rs:1803–1835

loop {
    let mut busy = false;
    for dev_id in dev_ids.iter().copied() {      // iterates ALL ids every pass
        match ublk.start_user_recovery(dev_id).await {
            Ok(()) => { ... }
            Err(err) if error_is_errno(&err, libc::EBUSY) => { busy = true; ... }
            Err(err) => return Err(err)...;      // any other error is fatal
        }
    }
    if !busy { return Ok(()); }
    // sleep 100ms, loop back
}

If device 0 enters recovery (Ok(())) but device 1 is still EBUSY, busy = true and we sleep and loop. On the second pass, we call start_user_recovery(0) again. The kernel's ublk_ctrl_start_recovery checks device state; a device already in UBLK_S_DEV_QUIESCED will likely return EINVAL (not EBUSY), which hits the fatal arm and aborts the whole adopt.

Fix: track which ids have been successfully started and skip them on retry. Fix this →


2. preserve_devices_on_drop() called twice

crates/smoo-gadget-app/src/lib.rs:2135 and :2147

preserve_devices_on_drop() is called in both begin_user_recovery_handover and again at the top of release_devices_for_user_recovery. The second call is harmless (presumably idempotent), but it's confusing — the guarantee should be set once when the decision is made.


3. Hardcoded raw kernel constants without version context

crates/smoo-gadget-ublk/src/lib.rs:477–478

const UBLK_CMD_QUIESCE_DEV_RAW: u32 = 0x16;
const UBLK_F_QUIESCE_RAW: u64 = 1 << 12;

The _RAW suffix communicates "bindgen doesn't have these yet" but not why (which kernel version added them, or when this workaround can be removed). A single-line comment with the kernel version would let a future reader know when to replace these with the bindgen-generated names.


4. UBLK_F_QUIESCE_RAW set unconditionally on all new devices

crates/smoo-gadget-ublk/src/lib.rs:543–545

info.flags |=
    (UBLK_F_USER_RECOVERY | UBLK_F_USER_RECOVERY_REISSUE) as u64 | UBLK_F_QUIESCE_RAW;

build.rs has #ifdef UBLK_U_CMD_QUIESCE_DEV around the MARK_FIX_753 call, but the runtime flag is set unconditionally. On a kernel that doesn't support the quiesce flag, UBLK_CMD_ADD_DEV will likely return EINVAL, making smoo-gadget completely non-functional on older kernels. If that's an intentional hard dependency on the new kernel, it's worth a comment (or a startup check) making that explicit. If it should degrade gracefully, the flag needs a capability check.


Minor observations

  • stop_accepting_new_io is also called from the graceful-shutdown path (SIGINT/SIGTERM): If SIGINT arrives after SIGHUP starts the handover, the shutdown path and release_devices_for_user_recovery could both call stop_accepting_new_io. The handover.is_none() guard on the SIGHUP arm means SIGHUP is idempotent, but there's no guard preventing SIGINT from racing the handover path. In practice the loop would break from the shutdown arm before the handover completes, so the quiesce thread would be abandoned — which should be fine since the process is exiting anyway, but the quiesce thread may outlive the process briefly.

  • parse_duration accepts "1 s" (space before suffix): The .trim() on the suffix string means a space between number and unit is silently accepted. Whether that's desirable is a matter of taste, but it may surprise users who typo "1 s" and get a successful parse rather than an error.

  • Test result.assert(true, false) at line 90: The false skips data-integrity comparison in the standard harness, which is intentional here since the test does its own byte-level verification above. A brief comment would help readers unfamiliar with the harness API.

  • Test state-file cleanup (let _ = remove_file(&state_path).await) silently ignores I/O errors. This is fine for test cleanup, but if the state file wasn't created (e.g., early failure), the _ idiom means no noise — which is the right call.


The new spawn_gadget_child extraction and adopt_restart_gadget helper are clean refactors. The test covers the right invariant (dev_id stability + data integrity across handover). The quiesce-on-background-thread pattern is the correct way to keep the async event loop responsive during quiesce.

Quiesce and release existing ublk devices on SIGHUP so a successor daemon can adopt them without completing in-flight I/O as failures. Add VM coverage for preserving the device identity across handover.

Signed-off-by: Sam Day <me@samcday.com>
@samcday
Copy link
Copy Markdown
Owner Author

samcday commented May 14, 2026

@claude did a bunch of rework, previous feedback should be addressed but also the way we quiesce and deal with the uring changed a fair bit, please re-review.

@claude
Copy link
Copy Markdown

claude Bot commented May 14, 2026

Claude finished @samcday's task in 8m 32s —— View job


Code Review (re-review)

  • Read CLAUDE.md for repo guidelines
  • Read changed files and diff
  • Check previous issues are addressed
  • Analyze new quiesce/uring logic
  • Post final review

Previous feedback — all addressed

1. wait_for_user_recovery_start retry-all bug — fixed. The remaining: HashSet<u32> pattern correctly skips already-recovered devices on each loop iteration.

2. preserve_devices_on_drop() called twice — fixed. It's now called exactly once, in begin_user_recovery_handover.

3. Raw _RAW kernel constants — gone. UBLK_F_QUIESCE and UBLK_U_CMD_QUIESCE_DEV are now proper bindgen-generated symbols from the updated bundled header.

4. UBLK_F_QUIESCE set unconditionally — still true, but this is now the right answer: the bundled ublk_cmd.h owns the definition, so it's a deliberate hard dependency on kernel support rather than a silent incompatibility. The behavior (EINVAL from ADD_DEV on old kernels) is unchanged but the intent is clear.


New quiesce/uring logic

The rewrite is a genuine improvement. The key design points and their correctness:

Two-phase drain: quiesce only starts after inflight_io == 0 (line 1100). Since the queue_rx.recv() arm is gated by handover.is_none(), no new requests can inflate inflight_io once SIGHUP is received. inflight_io is decremented via DataPlaneEvent::RequestCompleted/RequestFailed through data_plane_rx, which remains ungated — so in-flight app-level I/O can drain correctly.

UBLK_IO_RES_ABORT handling in queue_worker_main: the new tag_active + tag_open state machine is correct. Idle FETCH tags mark themselves closed and decrement open_tags; active tags (result already sent to completion_txs) continue and get a FETCH abort on the subsequent COMMIT_AND_FETCH, which closes them. The early-return on open_tags == 0 && !tag_active.iter().any(...) is consistent with the while-loop condition.

Union access fixed: io_cmd_with_addr and io_desc_nr_sectors properly access the __bindgen_anon_1 unions instead of the flat #if 0 workarounds. This unblocks any zone/zone-append functionality too.

next_io maps channel-closed to ENODEV (line 1399) so the queue_task_loop reports a QueueEvent::QueueError with ENODEV, which is then silently accepted when expected_queue_abort is true. Clean.


Remaining issues

1. stop_accepting_new_io still called twice in the handover path (medium, pre-existing)

crates/smoo-gadget-app/src/lib.rs:1191 (SIGHUP arm) and crates/smoo-gadget-app/src/lib.rs:2184 (inside release_devices_for_user_recovery).

The first call is correct and necessary — it prevents inflight_io from growing while the drain proceeds. The second call inside release_devices_for_user_recovery is a no-op for the handover path (queue_tasks is already empty, queue_tx is already None) but is needed for any future caller of release_devices_for_user_recovery who hasn't already stopped I/O. The asymmetry is confusing — the function's own docstring or a comment would help explain why it's idempotent-by-design.


2. Items already in queue_rx at handover start have abandoned tag_active entries (low)

crates/smoo-gadget-ublk/src/lib.rs:1559–1560

When SIGHUP fires, the queue_rx.recv() arm becomes gated (handover.is_none()). Any requests already in queue_rx (put there by queue_task_loop just before it was stopped) will never be dispatched to the app layer. Their tags have tag_active = true in queue_worker_main. After quiesce, those tags receive UBLK_IO_RES_ABORT; the code sees them as active and waits for completions that will never come — the workers stay alive until stop = true fires from device drop.

This is actually correct behavior: UBLK_F_USER_RECOVERY_REISSUE causes the kernel to reissue those requests to the successor. But the comment at line 1559–1560 says:

// Quiesce aborts idle FETCH commands, but already-fetched requests still
// need their completions committed before the old owner can exit cleanly.

This is only partially true. Requests dispatched to the app are committed before device drop (because inflight_io == 0 implies all completions have been sent). But requests stranded in queue_rx or request_rxs are abandoned, not committed — they rely on REISSUE. The comment is misleading for this case.


3. #ifdef UBLK_U_CMD_QUIESCE_DEV in build.rs is always true (low)

crates/smoo-gadget-ublk/build.rs:67–69

#ifdef UBLK_U_CMD_QUIESCE_DEV
MARK_FIX_753(UBLK_U_CMD_QUIESCE_DEV);
#endif

Since ublk_cmd.h now unconditionally defines UBLK_U_CMD_QUIESCE_DEV, this guard is dead. It likely existed to handle the case where the system kernel headers lacked this symbol, but with a bundled header that's no longer relevant. A comment (// Always true; symbol is provided by the bundled header) would prevent future confusion, or the guard can simply be dropped. Fix this →


4. temp_state_path duplicated (minor)

crates/smoo-test-harness/tests/user_recovery_handover.rs:137 and crates/smoo-test-harness/tests/link_replay.rs:538

The helper is identical in both files. Moving it to the shared common module would remove the duplication.


Minor observations

  • State-file existence check → hard error (crates/smoo-gadget-app/src/lib.rs:175–185): the new behavior that rejects an existing state file unless --adopt is passed is a better UX than the old silent load. The error message is clear.

  • --adopt-deadline defaults to infinite wait: with no deadline, the successor will wait forever for the predecessor to quiesce. This is correct for automated upgrades where the predecessor is expected to drain cleanly, but an operator running --adopt manually on a hung predecessor will get no feedback. The help text could note that omitting --adopt-deadline means indefinite wait.

  • Quiesce timeout Duration::ZERO = "wait forever" (crates/smoo-gadget-ublk/src/lib.rs:329–331): the doc comment on spawn_quiesce_devices makes this explicit. Good — the mapping is non-obvious.

  • ublk_cmd.h additions (UBLK_F_UPDATE_SIZE, UBLK_F_USER_RECOVERY_FAIL_IO, UBLK_F_PER_IO_DAEMON, etc.): these are upstream additions carried along. Not used yet; no concerns.


The quiesce architecture is solid. The test covering drain-before-quiesce is the right invariant to pin down. Ready to merge once the comment at line 1559–1560 is updated (or once you're satisfied the behavior is documented elsewhere).

@samcday samcday merged commit 85b7770 into main May 14, 2026
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant