Conversation
Add a docker-compose stack (OpenObserve + OpenTelemetry Collector) and collector config that scrapes the relayer Prometheus /metrics, tails structured JSON container logs, and accepts OTLP for future traces, exporting all signals to OpenObserve. Includes a README with run instructions, an API-health dashboard query set, alert definitions, rollout notes, and the known gaps (no trace instrumentation yet, no job-queue metric).
…ilure Enoki sponsored dry-run aborts in 0x2::balance::split with ENotEnough when a pool wallet's SUI gas coins are fragmented or too small to cover the budget. It was classified Transient, so it burned all 5 wallet retries rotating through equally-starved pool wallets and raised a misleading retries-exhausted alert. Add a distinct GasPoolExhausted classification that aborts retries (like the object-lock case) and fires a dedicated alert pointing ops at SUI gas coin consolidation/top-up. Add a gas-pool maintenance runbook.
Move title to frontmatter, remove em dashes, replace 'i.e.' and prose 'via' to satisfy the Sui documentation style guide audit.
Address review on #231: - P1: a balance::split ENotEnough now stays Transient so Apalis rotates onto another pool wallet, and only escalates to GasPoolExhausted once every candidate wallet (min(pool_size, max_attempts)) has hit the same gas-budget failure. A single starved wallet no longer fails an upload a healthy wallet could serve. - P2: the metadata-transfer recovery path applies the same escalation and dispatches the gas-pool ops alert (previously only the upload arm did). Tests: single bad wallet stays retriable, full-pool exhaustion escalates and aborts, threshold computation, non-gas-budget passthrough.
Beyond the regex audit: remove quotation marks, replace the 'and/or' slash, add body text between stacked headings, and write out word abbreviations (tx, min, max) and the (s) plural per the Sui documentation style guide.
Satisfy the Sui style-guide audit: remove quotation marks from the frontmatter title and add the required description and keywords fields.
Add an opt-in background task (ZO_REMOTE_WRITE_URL) that gathers the relayer's Prometheus registry and pushes it to OpenObserve's /prometheus/api/v1/write endpoint as snappy-compressed protobuf (counters, gauges, histograms expanded to _bucket/_sum/_count, summaries). No-op when the env var is unset, so a single OpenObserve service can ingest the existing memwal_* metrics without a collector and production is unchanged until an environment opts in.
…-retry-dev fix(server): retry invalidated Enoki wallet txs
…bservability-poc feat(observability): OpenObserve self-hosted PoC (WALM-81)
…l-classification fix(relayer): classify Enoki balance::split ENotEnough as gas-pool failure (WALM-88)
Add full backend + frontend incident management for Statuspage parity.
Backend:
- incidents + incident_updates tables with indexes
- Admin API endpoints (POST/PATCH/DELETE /api/incidents, POST /api/incidents/:id/updates)
- API-key auth via STATUS_ADMIN_API_KEY header
- /api/status returns incidents: { active, recent }
- Atom/RSS feeds include real incident entries
Frontend:
- New /admin route with incident admin panel
- Create incident form with title, status, severity, component, message
- Existing incidents list with inline update, resolve, delete controls
- IncidentHistory renders real incidents; falls back to synthesized text when empty
- Updated footer navigation across all routes
Also fixes:
- 204 responses now send empty body (no JSON null)
- .env.example documents STATUS_ADMIN_API_KEY
- listIncidents now returns updates array (dead UI branch fixed) - deleteIncident checks rowCount, returns false when nothing deleted - Validate status/severity enums before DB insert → 400 (not 500) - Validate startedAt/resolvedAt/createdAt dates → 400 (not 500) - PATCH to resolved auto-inserts 'Incident resolved.' timeline entry - createIncident uses atomic transaction for identifier + updates - timingSafeEqual for API key comparison - Remove unreachable method guard before static file serving
- Client: AdminPanel calls onMutate (main page refresh) after every mutation so the snapshot is fresh when user navigates back to / - Server: addIncidentUpdate now sets resolved_at when status transitions to resolved, matching updateIncident behavior
- Server: readActiveAndRecentIncidents now fetches and attaches updates so /api/status includes complete incident history - Client: IncidentDay changed from single message to messages[] array - Client: buildIncidentDays renders all updates with timestamps - Client: IncidentHistory renders each update with left-border indentation
…hdog No stable 0.0.5/0.0.6 has shipped to main/npm yet, so merge the two unreleased changelog sections into a single 0.0.5 and consume the sse-heartbeat-watchdog changeset into it. Bumps package.json 0.0.6 -> 0.0.5.
fix(mcp): SSE heartbeat watchdog — recover from silently dead relayer sessions
Add a safety callout to the SDK and getting-started quickstarts: use your own account, load credentials from env, and note that recall is scoped per account + namespace so a copied ID lands memories in a shared space. Closes #255
Cover the nodejs_compat flag, expected bundle size, which entry point bundles cleanest on edge, and the dynamic-import / graceful-degradation pattern for crash isolation. Register the page in the SDK nav. Closes #256
The default MemWal client still requires @mysten/seal + @mysten/sui (it builds a SEAL session key client-side); it is lighter than /manual only because /manual additionally pulls @mysten/walrus + client-side upload. Verified against packages/sdk peerDependencies and memwal.ts.
Verified by bundling the default MemWal client with wrangler 4.96
(deploy --dry-run):
- Without nodejs_compat the build fails: 'Could not resolve "crypto"'
(the SDK calls await import("crypto")) — flag is genuinely required.
- With nodejs_compat: ~1.2 MB raw / ~225 KB gzip, not ~3 MB (that
figure likely counted the sourcemap or the /manual entry).
Add a memory_limiter processor (first in every pipeline) so the collector sheds load instead of OOMing when OpenObserve is slow or unreachable, and cap batches with send_batch_max_size to avoid oversized ingest payloads. Expose a health_check liveness endpoint (:13133) for orchestrator probes, and add mem_limit/cpus ceilings to both services so a runaway ingest can't starve the host.
Add a Dockerfile (bakes the config in, since Railway can't bind-mount it) and railway.json so the OTel collector can run as its own Railway service and scrape the relayer /metrics over the private network — closing the metrics gap on deployments where only direct OTLP logs/traces reach OpenObserve. Parameterize the exporter's OpenObserve host (OPENOBSERVE_HOST) so the same config targets the compose service locally and openobserve.railway.internal on Railway. Document the Railway deploy steps and required variables.
0.115.0 does not exist on Docker Hub; the Railway build failed to resolve it. 0.154.0 is a current stable contrib release.
Address the 15 violations flagged by the style-guide audit on the three changed files: remove em dashes from added prose and code comments, unquote the Cloudflare Workers frontmatter title/description and add keywords, capitalize Mainnet/Testnet, use sentence-case headings, and add an intro sentence before the Next steps list.
…private network The relayer bound 0.0.0.0 (IPv4 only), so service-to-service traffic over Railway's IPv6-only private network (e.g. the observability collector scraping relayer.railway.internal:PORT/metrics) could not connect. Binding the IPv6 unspecified address is dual-stack and still serves IPv4, so public access is unchanged.
The relayer's private domain is a generated name (lucky-strength.railway.internal),
not relayer.railway.internal — scraping the display name silently failed.
Document the ${{relayer.RAILWAY_PRIVATE_DOMAIN}} reference and the IPv6 [::]
bind requirement so the same trap isn't hit again.
- Server: probe both STATUS_RELAYER_PRODUCTION_URL and STATUS_RELAYER_STAGING_URL
- Server: store checks under separate targets (relayer-production / relayer-staging)
- Server: /api/status returns components[] and histories{} keyed by component id
- Server: overall service status aggregates across components
- Client: StatusSnapshot uses components[] and histories{}
- Client: buildRows renders one row per monitored component
- Client: uptime calendar, incident history, and admin component select use production history/component list
- Docs/Dockerfile/.env.example updated for new env vars
chore(observability): harden otel collector config and compose limits
…t-safety docs(sdk): Cloudflare Workers guide + quickstart accountId safety callout
…-route-desktop-users-to-remote-mcp-onboarding WALM-113: setup skill polish + grounded MCP tool results
feat(status): monitoring status page (WALM-99)
staging <- dev
hungtranphamminh
approved these changes
Jun 12, 2026
UyLeQuoc
approved these changes
Jun 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.