Skip to content

Improve runtime stability, observability, and target-aware controls#8

Open
PK3NZO wants to merge 3 commits intoNullLatency:masterfrom
PK3NZO:pr/runtime-stability-diagnostics
Open

Improve runtime stability, observability, and target-aware controls#8
PK3NZO wants to merge 3 commits intoNullLatency:masterfrom
PK3NZO:pr/runtime-stability-diagnostics

Conversation

@PK3NZO
Copy link
Copy Markdown

@PK3NZO PK3NZO commented Apr 25, 2026

Summary

This PR hardens FlowDriver's runtime transport path and adds the operational controls needed to run it more safely under real network pressure.

The main focus is stability, observability, and runtime control: bounded retries/timeouts, backpressure, payload limits, session metrics, health endpoints, diagnostics tooling, multi-lane Drive transport support, and target-aware client behavior.

What Changed

Runtime transport stability

  • Added runtime tuning profiles for transport behavior.
  • Added configurable operation timeouts, retry limits, payload caps, and backpressure controls.
  • Improved session lifecycle tracking and runtime metrics.
  • Hardened transport cleanup paths to reduce stale or stuck runtime state.
  • Added safer handling around Drive transport file polling, deletion, pagination, and retry behavior.

Google Drive backend robustness

  • Improved Google Drive transport reliability with bounded retries and operation deadlines.
  • Added async cleanup behavior so stale transport files are removed without blocking the hot path.
  • Added multi-lane backend support to spread runtime traffic across multiple backend lanes.
  • Added purge tooling for stale Drive transport files.

Target-aware client controls

  • Added target policy support so client behavior can be tuned per destination.
  • Added tests for target policy selection.
  • Added client-side runtime flags/config plumbing for these controls.
  • Added server dial tuning knobs to improve connection behavior under slow or unstable upstream targets.

Observability and operations

  • Added health and metrics endpoints.
  • Added a client diagnostics collection script.
  • Added a systemd service template.
  • Expanded README setup and runtime documentation.
  • Updated example client/server configs to document the new runtime options.

Config Additions / Changes

This PR adds or documents runtime configuration for:

  • transport tuning profile
  • storage timeout / retry behavior
  • payload and backpressure limits
  • multi-lane backend settings
  • metrics / health server behavior
  • target-aware client policy
  • server dial timeout behavior

Bugs / Risks Addressed

  • Unbounded or overly long Drive operations making sessions feel stuck.
  • Weak visibility into runtime/session state during failures.
  • Stale Drive transport files accumulating after interrupted sessions.
  • Single-lane backend pressure limiting resilience under load.
  • Runtime behavior being too global when different targets need different handling.
  • Server dial behavior being too rigid for unstable networks.

Validation

  • go test ./...
  • git diff --check

Result / Impact

This should make FlowDriver more practical to run outside a happy-path demo environment:

  • runtime failures degrade more predictably
  • operators get health, metrics, and diagnostics visibility
  • Drive transport cleanup is less fragile
  • backend pressure can be distributed across lanes
  • clients can tune behavior for specific targets
  • the server has better control over dial behavior and timeouts

Notes

This PR intentionally focuses on stability, observability, and operational tooling. Cold-start latency optimization should be handled in a follow-up PR after measuring the new runtime metrics.

@PK3NZO PK3NZO changed the title Improve runtime stability and diagnostics Improve runtime stability, diagnostics, and target controls Apr 25, 2026
@PK3NZO PK3NZO changed the title Improve runtime stability, diagnostics, and target controls Improve runtime stability, observability, and target-aware controls Apr 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant