Skip to content

main <- staging#276

Merged
ducnmm merged 98 commits into
mainfrom
staging
Jun 12, 2026
Merged

main <- staging#276
ducnmm merged 98 commits into
mainfrom
staging

Conversation

@ducnmm

@ducnmm ducnmm commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

No description provided.

jasong-03 and others added 30 commits June 4, 2026 11:48
Add a docker-compose stack (OpenObserve + OpenTelemetry Collector) and
collector config that scrapes the relayer Prometheus /metrics, tails
structured JSON container logs, and accepts OTLP for future traces, exporting
all signals to OpenObserve. Includes a README with run instructions, an
API-health dashboard query set, alert definitions, rollout notes, and the
known gaps (no trace instrumentation yet, no job-queue metric).
…ilure

Enoki sponsored dry-run aborts in 0x2::balance::split with ENotEnough when a
pool wallet's SUI gas coins are fragmented or too small to cover the budget.
It was classified Transient, so it burned all 5 wallet retries rotating
through equally-starved pool wallets and raised a misleading retries-exhausted
alert.

Add a distinct GasPoolExhausted classification that aborts retries (like the
object-lock case) and fires a dedicated alert pointing ops at SUI gas coin
consolidation/top-up. Add a gas-pool maintenance runbook.
Move title to frontmatter, remove em dashes, replace 'i.e.' and prose 'via'
to satisfy the Sui documentation style guide audit.
Address review on #231:
- P1: a balance::split ENotEnough now stays Transient so Apalis rotates onto
  another pool wallet, and only escalates to GasPoolExhausted once every
  candidate wallet (min(pool_size, max_attempts)) has hit the same gas-budget
  failure. A single starved wallet no longer fails an upload a healthy wallet
  could serve.
- P2: the metadata-transfer recovery path applies the same escalation and
  dispatches the gas-pool ops alert (previously only the upload arm did).

Tests: single bad wallet stays retriable, full-pool exhaustion escalates and
aborts, threshold computation, non-gas-budget passthrough.
Beyond the regex audit: remove quotation marks, replace the 'and/or' slash,
add body text between stacked headings, and write out word abbreviations
(tx, min, max) and the (s) plural per the Sui documentation style guide.
Satisfy the Sui style-guide audit: remove quotation marks from the frontmatter
title and add the required description and keywords fields.
Add an opt-in background task (ZO_REMOTE_WRITE_URL) that gathers the relayer's
Prometheus registry and pushes it to OpenObserve's /prometheus/api/v1/write
endpoint as snappy-compressed protobuf (counters, gauges, histograms expanded
to _bucket/_sum/_count, summaries). No-op when the env var is unset, so a single
OpenObserve service can ingest the existing memwal_* metrics without a collector
and production is unchanged until an environment opts in.
…-retry-dev

fix(server): retry invalidated Enoki wallet txs
…bservability-poc

feat(observability): OpenObserve self-hosted PoC (WALM-81)
…l-classification

fix(relayer): classify Enoki balance::split ENotEnough as gas-pool failure (WALM-88)
Add full backend + frontend incident management for Statuspage parity.

Backend:
- incidents + incident_updates tables with indexes
- Admin API endpoints (POST/PATCH/DELETE /api/incidents, POST /api/incidents/:id/updates)
- API-key auth via STATUS_ADMIN_API_KEY header
- /api/status returns incidents: { active, recent }
- Atom/RSS feeds include real incident entries

Frontend:
- New /admin route with incident admin panel
- Create incident form with title, status, severity, component, message
- Existing incidents list with inline update, resolve, delete controls
- IncidentHistory renders real incidents; falls back to synthesized text when empty
- Updated footer navigation across all routes

Also fixes:
- 204 responses now send empty body (no JSON null)
- .env.example documents STATUS_ADMIN_API_KEY
- listIncidents now returns updates array (dead UI branch fixed)
- deleteIncident checks rowCount, returns false when nothing deleted
- Validate status/severity enums before DB insert → 400 (not 500)
- Validate startedAt/resolvedAt/createdAt dates → 400 (not 500)
- PATCH to resolved auto-inserts 'Incident resolved.' timeline entry
- createIncident uses atomic transaction for identifier + updates
- timingSafeEqual for API key comparison
- Remove unreachable method guard before static file serving
DalenMax and others added 25 commits June 11, 2026 15:01
- Client: AdminPanel calls onMutate (main page refresh) after every mutation
  so the snapshot is fresh when user navigates back to /
- Server: addIncidentUpdate now sets resolved_at when status transitions
  to resolved, matching updateIncident behavior
- Server: readActiveAndRecentIncidents now fetches and attaches updates
  so /api/status includes complete incident history
- Client: IncidentDay changed from single message to messages[] array
- Client: buildIncidentDays renders all updates with timestamps
- Client: IncidentHistory renders each update with left-border indentation
…hdog

No stable 0.0.5/0.0.6 has shipped to main/npm yet, so merge the two
unreleased changelog sections into a single 0.0.5 and consume the
sse-heartbeat-watchdog changeset into it. Bumps package.json 0.0.6 -> 0.0.5.
fix(mcp): SSE heartbeat watchdog — recover from silently dead relayer sessions
Add a safety callout to the SDK and getting-started quickstarts: use your
own account, load credentials from env, and note that recall is scoped per
account + namespace so a copied ID lands memories in a shared space.

Closes #255
Cover the nodejs_compat flag, expected bundle size, which entry point
bundles cleanest on edge, and the dynamic-import / graceful-degradation
pattern for crash isolation. Register the page in the SDK nav.

Closes #256
The default MemWal client still requires @mysten/seal + @mysten/sui (it
builds a SEAL session key client-side); it is lighter than /manual only
because /manual additionally pulls @mysten/walrus + client-side upload.
Verified against packages/sdk peerDependencies and memwal.ts.
Verified by bundling the default MemWal client with wrangler 4.96
(deploy --dry-run):
- Without nodejs_compat the build fails: 'Could not resolve "crypto"'
  (the SDK calls await import("crypto")) — flag is genuinely required.
- With nodejs_compat: ~1.2 MB raw / ~225 KB gzip, not ~3 MB (that
  figure likely counted the sourcemap or the /manual entry).
Add a memory_limiter processor (first in every pipeline) so the collector
sheds load instead of OOMing when OpenObserve is slow or unreachable, and
cap batches with send_batch_max_size to avoid oversized ingest payloads.

Expose a health_check liveness endpoint (:13133) for orchestrator probes,
and add mem_limit/cpus ceilings to both services so a runaway ingest can't
starve the host.
Add a Dockerfile (bakes the config in, since Railway can't bind-mount it)
and railway.json so the OTel collector can run as its own Railway service
and scrape the relayer /metrics over the private network — closing the
metrics gap on deployments where only direct OTLP logs/traces reach
OpenObserve.

Parameterize the exporter's OpenObserve host (OPENOBSERVE_HOST) so the same
config targets the compose service locally and openobserve.railway.internal
on Railway. Document the Railway deploy steps and required variables.
0.115.0 does not exist on Docker Hub; the Railway build failed to resolve
it. 0.154.0 is a current stable contrib release.
Address the 15 violations flagged by the style-guide audit on the three
changed files: remove em dashes from added prose and code comments,
unquote the Cloudflare Workers frontmatter title/description and add
keywords, capitalize Mainnet/Testnet, use sentence-case headings, and add
an intro sentence before the Next steps list.
…private network

The relayer bound 0.0.0.0 (IPv4 only), so service-to-service traffic over
Railway's IPv6-only private network (e.g. the observability collector
scraping relayer.railway.internal:PORT/metrics) could not connect. Binding
the IPv6 unspecified address is dual-stack and still serves IPv4, so public
access is unchanged.
The relayer's private domain is a generated name (lucky-strength.railway.internal),
not relayer.railway.internal — scraping the display name silently failed.
Document the ${{relayer.RAILWAY_PRIVATE_DOMAIN}} reference and the IPv6 [::]
bind requirement so the same trap isn't hit again.
- Server: probe both STATUS_RELAYER_PRODUCTION_URL and STATUS_RELAYER_STAGING_URL
- Server: store checks under separate targets (relayer-production / relayer-staging)
- Server: /api/status returns components[] and histories{} keyed by component id
- Server: overall service status aggregates across components
- Client: StatusSnapshot uses components[] and histories{}
- Client: buildRows renders one row per monitored component
- Client: uptime calendar, incident history, and admin component select use production history/component list
- Docs/Dockerfile/.env.example updated for new env vars
chore(observability): harden otel collector config and compose limits
…t-safety

docs(sdk): Cloudflare Workers guide + quickstart accountId safety callout
…-route-desktop-users-to-remote-mcp-onboarding

WALM-113: setup skill polish + grounded MCP tool results
feat(status): monitoring status page (WALM-99)
@hungtranphamminh hungtranphamminh self-requested a review June 12, 2026 06:29
@ducnmm ducnmm merged commit 6c7a008 into main Jun 12, 2026
27 checks passed
@railway-app railway-app Bot temporarily deployed to Walrus Memory / staging1 June 15, 2026 06:48 Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants