main <- staging by ducnmm · Pull Request #276 · MystenLabs/MemWal

ducnmm · 2026-06-12T06:26:38Z

No description provided.

Add a docker-compose stack (OpenObserve + OpenTelemetry Collector) and collector config that scrapes the relayer Prometheus /metrics, tails structured JSON container logs, and accepts OTLP for future traces, exporting all signals to OpenObserve. Includes a README with run instructions, an API-health dashboard query set, alert definitions, rollout notes, and the known gaps (no trace instrumentation yet, no job-queue metric).

…ilure Enoki sponsored dry-run aborts in 0x2::balance::split with ENotEnough when a pool wallet's SUI gas coins are fragmented or too small to cover the budget. It was classified Transient, so it burned all 5 wallet retries rotating through equally-starved pool wallets and raised a misleading retries-exhausted alert. Add a distinct GasPoolExhausted classification that aborts retries (like the object-lock case) and fires a dedicated alert pointing ops at SUI gas coin consolidation/top-up. Add a gas-pool maintenance runbook.

Move title to frontmatter, remove em dashes, replace 'i.e.' and prose 'via' to satisfy the Sui documentation style guide audit.

Address review on #231: - P1: a balance::split ENotEnough now stays Transient so Apalis rotates onto another pool wallet, and only escalates to GasPoolExhausted once every candidate wallet (min(pool_size, max_attempts)) has hit the same gas-budget failure. A single starved wallet no longer fails an upload a healthy wallet could serve. - P2: the metadata-transfer recovery path applies the same escalation and dispatches the gas-pool ops alert (previously only the upload arm did). Tests: single bad wallet stays retriable, full-pool exhaustion escalates and aborts, threshold computation, non-gas-budget passthrough.

Beyond the regex audit: remove quotation marks, replace the 'and/or' slash, add body text between stacked headings, and write out word abbreviations (tx, min, max) and the (s) plural per the Sui documentation style guide.

Satisfy the Sui style-guide audit: remove quotation marks from the frontmatter title and add the required description and keywords fields.

Add an opt-in background task (ZO_REMOTE_WRITE_URL) that gathers the relayer's Prometheus registry and pushes it to OpenObserve's /prometheus/api/v1/write endpoint as snappy-compressed protobuf (counters, gauges, histograms expanded to _bucket/_sum/_count, summaries). No-op when the env var is unset, so a single OpenObserve service can ingest the existing memwal_* metrics without a collector and production is unchanged until an environment opts in.

…-retry-dev fix(server): retry invalidated Enoki wallet txs

…bservability-poc feat(observability): OpenObserve self-hosted PoC (WALM-81)

…l-classification fix(relayer): classify Enoki balance::split ENotEnough as gas-pool failure (WALM-88)

Add full backend + frontend incident management for Statuspage parity. Backend: - incidents + incident_updates tables with indexes - Admin API endpoints (POST/PATCH/DELETE /api/incidents, POST /api/incidents/:id/updates) - API-key auth via STATUS_ADMIN_API_KEY header - /api/status returns incidents: { active, recent } - Atom/RSS feeds include real incident entries Frontend: - New /admin route with incident admin panel - Create incident form with title, status, severity, component, message - Existing incidents list with inline update, resolve, delete controls - IncidentHistory renders real incidents; falls back to synthesized text when empty - Updated footer navigation across all routes Also fixes: - 204 responses now send empty body (no JSON null) - .env.example documents STATUS_ADMIN_API_KEY

- listIncidents now returns updates array (dead UI branch fixed) - deleteIncident checks rowCount, returns false when nothing deleted - Validate status/severity enums before DB insert → 400 (not 500) - Validate startedAt/resolvedAt/createdAt dates → 400 (not 500) - PATCH to resolved auto-inserts 'Incident resolved.' timeline entry - createIncident uses atomic transaction for identifier + updates - timingSafeEqual for API key comparison - Remove unreachable method guard before static file serving

…erver rule

- Client: AdminPanel calls onMutate (main page refresh) after every mutation so the snapshot is fresh when user navigates back to / - Server: addIncidentUpdate now sets resolved_at when status transitions to resolved, matching updateIncident behavior

- Server: readActiveAndRecentIncidents now fetches and attaches updates so /api/status includes complete incident history - Client: IncidentDay changed from single message to messages[] array - Client: buildIncidentDays renders all updates with timestamps - Client: IncidentHistory renders each update with left-border indentation

…erve (WALM-113)

…hdog No stable 0.0.5/0.0.6 has shipped to main/npm yet, so merge the two unreleased changelog sections into a single 0.0.5 and consume the sse-heartbeat-watchdog changeset into it. Bumps package.json 0.0.6 -> 0.0.5.

fix(mcp): SSE heartbeat watchdog — recover from silently dead relayer sessions

Add a safety callout to the SDK and getting-started quickstarts: use your own account, load credentials from env, and note that recall is scoped per account + namespace so a copied ID lands memories in a shared space. Closes #255

Cover the nodejs_compat flag, expected bundle size, which entry point bundles cleanest on edge, and the dynamic-import / graceful-degradation pattern for crash isolation. Register the page in the SDK nav. Closes #256

The default MemWal client still requires @mysten/seal + @mysten/sui (it builds a SEAL session key client-side); it is lighter than /manual only because /manual additionally pulls @mysten/walrus + client-side upload. Verified against packages/sdk peerDependencies and memwal.ts.

Verified by bundling the default MemWal client with wrangler 4.96 (deploy --dry-run): - Without nodejs_compat the build fails: 'Could not resolve "crypto"' (the SDK calls await import("crypto")) — flag is genuinely required. - With nodejs_compat: ~1.2 MB raw / ~225 KB gzip, not ~3 MB (that figure likely counted the sourcemap or the /manual entry).

Add a memory_limiter processor (first in every pipeline) so the collector sheds load instead of OOMing when OpenObserve is slow or unreachable, and cap batches with send_batch_max_size to avoid oversized ingest payloads. Expose a health_check liveness endpoint (:13133) for orchestrator probes, and add mem_limit/cpus ceilings to both services so a runaway ingest can't starve the host.

Add a Dockerfile (bakes the config in, since Railway can't bind-mount it) and railway.json so the OTel collector can run as its own Railway service and scrape the relayer /metrics over the private network — closing the metrics gap on deployments where only direct OTLP logs/traces reach OpenObserve. Parameterize the exporter's OpenObserve host (OPENOBSERVE_HOST) so the same config targets the compose service locally and openobserve.railway.internal on Railway. Document the Railway deploy steps and required variables.

0.115.0 does not exist on Docker Hub; the Railway build failed to resolve it. 0.154.0 is a current stable contrib release.

Address the 15 violations flagged by the style-guide audit on the three changed files: remove em dashes from added prose and code comments, unquote the Cloudflare Workers frontmatter title/description and add keywords, capitalize Mainnet/Testnet, use sentence-case headings, and add an intro sentence before the Next steps list.

…private network The relayer bound 0.0.0.0 (IPv4 only), so service-to-service traffic over Railway's IPv6-only private network (e.g. the observability collector scraping relayer.railway.internal:PORT/metrics) could not connect. Binding the IPv6 unspecified address is dual-stack and still serves IPv4, so public access is unchanged.

The relayer's private domain is a generated name (lucky-strength.railway.internal), not relayer.railway.internal — scraping the display name silently failed. Document the ${{relayer.RAILWAY_PRIVATE_DOMAIN}} reference and the IPv6 [::] bind requirement so the same trap isn't hit again.

- Server: probe both STATUS_RELAYER_PRODUCTION_URL and STATUS_RELAYER_STAGING_URL - Server: store checks under separate targets (relayer-production / relayer-staging) - Server: /api/status returns components[] and histories{} keyed by component id - Server: overall service status aggregates across components - Client: StatusSnapshot uses components[] and histories{} - Client: buildRows renders one row per monitored component - Client: uptime calendar, incident history, and admin component select use production history/component list - Docs/Dockerfile/.env.example updated for new env vars

chore(observability): harden otel collector config and compose limits

…t-safety docs(sdk): Cloudflare Workers guide + quickstart accountId safety callout

…-route-desktop-users-to-remote-mcp-onboarding WALM-113: setup skill polish + grounded MCP tool results

feat(status): monitoring status page (WALM-99)

staging <- dev

jasong-03 and others added 30 commits June 4, 2026 11:48

docs(relayer): fix gas-pool runbook for Sui style guide

1d7c5d0

Move title to frontmatter, remove em dashes, replace 'i.e.' and prose 'via' to satisfy the Sui documentation style guide audit.

docs(relayer): add frontmatter description/keywords, unquote title

3ee1277

Satisfy the Sui style-guide audit: remove quotation marks from the frontmatter title and add the required description and keywords fields.

fix(server): retry invalidated Enoki wallet txs

34aa396

feat(observability): export relayer telemetry via OTLP

b64e696

fix(server): use rust 1.88 builder image

2f11752

fix(observability): use blocking OTLP http exporter

637c03f

fix(server): bound apalis startup migrations

9f222ba

fix(observability): use OTLP JSON over HTTP

517d525

fix(observability): annotate HTTP spans with OTEL semantics

45f6338

Merge pull request #251 from MystenLabs/codex/fix-enoki-walrus-upload…

0cb5e36

…-retry-dev fix(server): retry invalidated Enoki wallet txs

Merge pull request #230 from MystenLabs/feature/walm-81-openobserve-o…

ddc1b49

…bservability-poc feat(observability): OpenObserve self-hosted PoC (WALM-81)

fix(relayer): harden gas-pool retry classification

841818f

Add standalone status service

86c528d

Merge pull request #231 from MystenLabs/feature/walm-88-enoki-gas-poo…

b8b3c19

…l-classification fix(relayer): classify Enoki balance::split ENotEnough as gas-pool failure (WALM-88)

Polish status page layout

a67314b

Add status history storage

44648f3

Merge remote-tracking branch 'origin/dev' into codex/walm-99-status-page

3418c58

Add status history tabs

d3268d6

Match Statuspage routes and feeds

4d91f25

Match Statuspage incident typography

d440f03

Lower WAL balance alert threshold to 2 WAL

9e43e05

Fix set-metadata result pattern for usize return

cd48587

Fix namespace reuse in set-metadata low WAL alert

ce094e2

DalenMax and others added 25 commits June 11, 2026 15:01

fix(status): align Past Incidents bucket logic with threshold-based s…

0f8af38

…erver rule

docs(skills): client routing, terminal-first setup, remote MCP self-s…

1958eec

…erve (WALM-113)

fix(relayer): ground MCP tool results with walruscan links

b3f758b

chore(mcp): revert version to 0.0.5, fold 0.0.6 changelog into 0.0.5

9d5f2f2

docs(mcp): trim 0.0.5 changelog to concise, user-facing entries

c80ded8

Merge pull request #261 from xogdg/fix/sse-heartbeat-watchdog

71d0cc8

fix(mcp): SSE heartbeat watchdog — recover from silently dead relayer sessions

docs(sdk): add Cloudflare Workers guide

f87a942

Cover the nodejs_compat flag, expected bundle size, which entry point bundles cleanest on edge, and the dynamic-import / graceful-degradation pattern for crash isolation. Register the page in the SDK nav. Closes #256

fix(observability): pin collector to an existing contrib tag (0.154.0)

81f926d

0.115.0 does not exist on Docker Hub; the Railway build failed to resolve it. 0.154.0 is a current stable contrib release.

Merge pull request #274 from MystenLabs/chore/otel-collector-hardening

764e7f1

chore(observability): harden otel collector config and compose limits

Merge pull request #270 from MystenLabs/docs/cf-workers-and-quickstar…

c8cc0b8

…t-safety docs(sdk): Cloudflare Workers guide + quickstart accountId safety callout

Merge pull request #269 from MystenLabs/uyle/walm-113-mcp-setup-skill…

ae70f43

…-route-desktop-users-to-remote-mcp-onboarding WALM-113: setup skill polish + grounded MCP tool results

Merge pull request #268 from MystenLabs/codex/walm-99-status-page

cf2e109

feat(status): monitoring status page (WALM-99)

Merge pull request #275 from MystenLabs/dev

6434dab

staging <- dev

hungtranphamminh self-requested a review June 12, 2026 06:29

hungtranphamminh approved these changes Jun 12, 2026

View reviewed changes

UyLeQuoc approved these changes Jun 12, 2026

View reviewed changes

ducnmm merged commit 6c7a008 into main Jun 12, 2026
27 checks passed

railway-app Bot temporarily deployed to Walrus Memory / staging1 June 15, 2026 06:48 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

main <- staging#276

main <- staging#276
ducnmm merged 98 commits into
mainfrom
staging

ducnmm commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

ducnmm commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants