Skip to content

harden redis cache against stalled connections#1929

Open
efstajas wants to merge 1 commit into
mainfrom
harden-redis-cache-against-stalled-connections
Open

harden redis cache against stalled connections#1929
efstajas wants to merge 1 commit into
mainfrom
harden-redis-cache-against-stalled-connections

Conversation

@efstajas

Copy link
Copy Markdown
Contributor

Follow-up to today's Filecoin app incident.

The Filecoin app instance's Redis connection went half-open: Railway drops idle internal TCP connections, and a low-traffic deployment like Filecoin lets the socket sit idle long enough to get dropped. node-redis kept sending commands into the dead socket with no reply, so every cache read — explore page and project pages alike — hung for tens of seconds to minutes, blocking SSR and tripping the health check into 500s. Mainnet was unaffected because its constant traffic keeps the socket warm. Evicting the cache key didn't help (the value was never the problem); a restart fixed it by re-establishing the connection.

Two changes so a bad connection can't take a deployment down again:

  • redis.ts: add pingInterval: 10000 so the client PINGs on idle and detects/reconnects a dead socket instead of queueing commands into the void. This is the actual root-cause fix.
  • cached.ts: bound the cache read with a 1s timeout and fall through to the fetcher on timeout/error, so a degraded cache makes pages a bit slower rather than hanging them. Also stops silently swallowing write failures.

Net effect: a stalled cache now means slightly slower uncached pages, not a downed app.

One thing left deliberately out of scope: a few endpoints (api/tlv, api/projects, fiat price, embed) still do direct redis.get reads outside cached(). pingInterval protects them from the indefinite-wedge failure mode too, but wrapping them in the same timeout helper would be a reasonable follow-up.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the server-side Redis cache against half-open/stalled connections that can otherwise hang SSR and trigger health-check failures (as seen in the Filecoin deployment), ensuring cache degradation falls back to fresh fetches instead of wedging requests.

Changes:

  • Configure the Redis client to proactively PING on an idle interval (pingInterval: 10000) to detect and reconnect dead sockets.
  • Bound Redis cache reads with a short timeout (1s) and fall back to the fetcher on timeout/error.
  • Stop silently swallowing Redis write failures by logging async set() errors.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
src/routes/api/redis.ts Adds Redis pingInterval configuration to detect/recover from idle-dropped TCP connections.
src/lib/utils/cache/remote/cached.ts Adds a read timeout + error fallback for cache reads and logs cache write failures to avoid SSR hangs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants