harden redis cache against stalled connections#1929
Open
efstajas wants to merge 1 commit into
Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR hardens the server-side Redis cache against half-open/stalled connections that can otherwise hang SSR and trigger health-check failures (as seen in the Filecoin deployment), ensuring cache degradation falls back to fresh fetches instead of wedging requests.
Changes:
- Configure the Redis client to proactively
PINGon an idle interval (pingInterval: 10000) to detect and reconnect dead sockets. - Bound Redis cache reads with a short timeout (1s) and fall back to the fetcher on timeout/error.
- Stop silently swallowing Redis write failures by logging async
set()errors.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| src/routes/api/redis.ts | Adds Redis pingInterval configuration to detect/recover from idle-dropped TCP connections. |
| src/lib/utils/cache/remote/cached.ts | Adds a read timeout + error fallback for cache reads and logs cache write failures to avoid SSR hangs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to today's Filecoin app incident.
The Filecoin app instance's Redis connection went half-open: Railway drops idle internal TCP connections, and a low-traffic deployment like Filecoin lets the socket sit idle long enough to get dropped. node-redis kept sending commands into the dead socket with no reply, so every cache read — explore page and project pages alike — hung for tens of seconds to minutes, blocking SSR and tripping the health check into 500s. Mainnet was unaffected because its constant traffic keeps the socket warm. Evicting the cache key didn't help (the value was never the problem); a restart fixed it by re-establishing the connection.
Two changes so a bad connection can't take a deployment down again:
redis.ts: addpingInterval: 10000so the client PINGs on idle and detects/reconnects a dead socket instead of queueing commands into the void. This is the actual root-cause fix.cached.ts: bound the cache read with a 1s timeout and fall through to the fetcher on timeout/error, so a degraded cache makes pages a bit slower rather than hanging them. Also stops silently swallowing write failures.Net effect: a stalled cache now means slightly slower uncached pages, not a downed app.
One thing left deliberately out of scope: a few endpoints (
api/tlv,api/projects, fiat price, embed) still do directredis.getreads outsidecached().pingIntervalprotects them from the indefinite-wedge failure mode too, but wrapping them in the same timeout helper would be a reasonable follow-up.