Skip to content

perf(redis): resilience, Redis-backed rate limiter, and health metrics (#278)#296

Open
Abhushan187 wants to merge 1 commit into
geturbackend:mainfrom
Abhushan187:perf/redis-resilience-rate-limiter
Open

perf(redis): resilience, Redis-backed rate limiter, and health metrics (#278)#296
Abhushan187 wants to merge 1 commit into
geturbackend:mainfrom
Abhushan187:perf/redis-resilience-rate-limiter

Conversation

@Abhushan187

@Abhushan187 Abhushan187 commented Jun 9, 2026

Copy link
Copy Markdown

Closes #278

Changes by file

File Change
packages/common/src/config/redis.js Exponential+jitter retry strategy, connection timeouts, keepalive, offline queue
apps/public-api/src/middlewares/api_usage.js Replace in-memory projectRateLimiter with rate-limit-redis store; keeps global IP limiter
apps/public-api/src/controllers/health.controller.js Add getRedisHealth with INFO stats/memory/clients metrics
apps/public-api/src/routes/health.js Mount GET /api/health/redis
apps/public-api/package.json Add rate-limit-redis dependency, autocannon devDependency
docker-compose.yml Redis maxclients 10000, tcp-keepalive 60, sysctls tuning
apps/public-api/tests/load/rate-limit.load.test.js Autocannon load test verifying 429 behavior under concurrency
apps/public-api/tests/resilience/redis-failure.resilience.test.js Jest tests for graceful degradation on Redis disconnect

Technical Notes

  • rate-limit-redis uses redis.call(...) for pipelining compatibility with ioredis 5.x.
  • The existing global IP-based limiter is untouched; only the project-scoped rate limiter is migrated to Redis.
  • Retry strategy caps at 30s base delay + 100ms jitter to avoid thundering herd on Redis restart.
  • maxRetriesPerRequest: 3 prevents hung requests during transient outages.
  • Health endpoint parses INFO output with regex to avoid extra dependencies.

Summary by CodeRabbit

  • New Features

    • Redis health monitoring endpoint for system status checks
    • Project-specific rate limiting with configurable per-project limits
  • Improvements

    • Enhanced Redis connection resilience with exponential backoff retry strategy and connection pooling
    • Updated infrastructure configuration for improved stability and client capacity

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR implements Redis connection pooling, health monitoring, and project-scoped rate limiting for the public API. It upgrades Redis client resilience with exponential backoff retry logic, adds a dedicated /redis health endpoint, introduces Redis-backed rate limiting per project, and includes comprehensive load and failure resilience tests.

Changes

Redis Connection Resilience and Rate Limiting

Layer / File(s) Summary
Redis client resilience configuration and Docker setup
packages/common/src/config/redis.js, docker-compose.yml
Redis client now retries with exponential backoff capped at 30s plus jitter, enables offline queuing and ready checks, sets connection/command/disconnect timeouts and keep-alive. Redis service pinned to 7-alpine with explicit redis-server options for --maxclients 10000 and --tcp-keepalive 60; kernel sysctls set to net.core.somaxconn=65535.
Redis health check endpoint
apps/public-api/src/controllers/health.controller.js, apps/public-api/src/routes/health.js
New getRedisHealth controller checks Redis readiness and retrieves stats/memory/clients INFO fields, returning HTTP 200 with structured health object or HTTP 503 on failure. Endpoint registered at GET /redis in health routes.
Project-specific Redis rate limiter
apps/public-api/package.json, apps/public-api/src/middlewares/api_usage.js
rate-limit-redis dependency added; projectRateLimiter middleware uses Redis store with rl:project: prefix, 15-minute window, dynamic per-project max from req.project.rateLimit (default 500), project-ID key generation, and custom JSON responses on limit exceed.
Load and resilience testing suite
apps/public-api/package.json, apps/public-api/tests/load/rate-limit.load.test.js, apps/public-api/tests/resilience/redis-failure.resilience.test.js
autocannon dev dependency added. Load test clears/verifies Redis rate-limit keys and benchmarks 100 concurrent requests for 30s, reporting 2xx/429 counts and latency metrics. Resilience test simulates Redis disconnection, verifies health endpoint returns 503, and confirms API degrades gracefully without crashing.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

nsoc26, level-3, quality:exceptional

Suggested reviewers

  • yash-pouranik

Poem

🐰 A pool of connections, now Redis won't strain,
With health checks and limits, no more connection pain!
Exponential backoff keeps retries so neat,
Rate limiting by project makes our API complete!
Tests prove resilience when Redis goes dark—
Our burrow's more stable, right down to the bark! 🌲

🚥 Pre-merge checks | ✅ 3 | ❌ 3

❌ Failed checks (1 warning, 2 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Linked Issues check ❓ Inconclusive The PR partially addresses issue #278's objectives. It implements a Redis-backed rate limiter, health endpoint, resilience tests, and Docker tuning, but does not implement connection pooling with configurable pool size (max/min connections) as specified. Clarify whether the shared Redis client approach meets pooling requirements, or if an explicit pool implementation with min/max connections should be added per issue #278.
Description check ❓ Inconclusive The PR summary documents all changes clearly but lacks detailed inline explanations of the rate-limit-redis implementation and how it differs from expected pooling behavior. Add explanation of whether the shared Redis client satisfies the pooling requirement or if further pooling implementation is needed.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: adding Redis resilience, a Redis-backed rate limiter, and health metrics.
Out of Scope Changes check ✅ Passed All changes are within scope of implementing Redis resilience and rate limiting. Package.json adds rate-limit-redis and autocannon (necessary dependencies), and docker-compose tuning aligns with issue requirements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/public-api/src/controllers/health.controller.js`:
- Around line 73-78: The catch block in the health controller currently returns
raw err.message to clients; update the handler in the health check function (the
try/catch that references res and err) to wrap or normalize the error using the
project's AppError (or an equivalent sanitizer) and return a safe public message
(e.g., "Service unavailable") instead of err.message; ensure the response still
includes redis.connected: false and an optional non-sensitive error code or
AppError.safeMessage but never expose MongoDB or raw backend messages directly.
- Around line 35-39: The getRedisHealth controller currently returns a raw
object ({ status, timestamp, redis }) — update all return responses in the
getRedisHealth function (and the other similar response blocks at the other two
places referenced) to conform to the required response envelope: return
res.status(...).json({ success: boolean, data: { timestamp: string, redis: {
connected: boolean, error?: string } }, message: string }); ensure success is
false for 503/500 errors and true for 200, place timestamp and redis under data,
and provide a concise message string (e.g., "Redis not ready" or "Redis
healthy") so clients receive { success, data, message }.

In `@apps/public-api/src/middlewares/api_usage.js`:
- Around line 22-48: The projectRateLimiter constant is never exported so
downstream code that imports { limiter, logger } can't use it; update the
module.exports object to include projectRateLimiter (alongside limiter and
logger) so the project-scoped Redis rate limiter is available to middleware
wiring; locate projectRateLimiter in this file and add its identifier to the
exported properties (ensure export naming matches how other modules import it).

In `@apps/public-api/tests/resilience/redis-failure.resilience.test.js`:
- Around line 10-20: The test titled "health endpoint returns 503 when Redis is
unreachable" disconnects Redis with redis.disconnect() but only reconnects on
the happy path; update the test so the disconnect/reconnect is enclosed in a
try/finally (or use finally semantics) around the request and assertions,
calling redis.connect() in the finally block to guarantee reconnection even if
assertions fail; locate the test block (the async test function that calls
redis.disconnect(), request(app).get('/api/health/redis'), and redis.connect())
and modify it to always call redis.connect() in finally.
- Around line 6-8: The test teardown in afterEach currently calls
redis.flushdb(), which is too destructive for shared Redis; change it to only
remove keys created by this suite by tracking and deleting keys by prefix or
pattern instead. Modify the afterEach handler that references redis.flushdb() to
use a safe deletion strategy (e.g., SCAN/KEYS with the test-specific prefix or
maintain a list/set of created keys during tests and call DEL on those entries)
and ensure any helper that creates keys (tests that call redis.set/redis.hset)
uses the agreed prefix so keys can be reliably targeted for cleanup.
- Around line 29-31: The test currently accepts any of [200, 401, 403, 500]
which masks regressions; update the assertion on the response (the expect([...])
that checks res.status) to assert the narrow set required by the middleware
contract—preferably a single expected status (e.g., 401 or 500) or at most two
statuses that are explicitly allowed during Redis degradation—so change the
expect call on res.status in the redis-failure.resilience.test to that narrow
set and add a brief comment linking it to the expected degraded behavior from
the auth/middleware code.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7d9aae05-9154-48eb-be94-62a99a028370

📥 Commits

Reviewing files that changed from the base of the PR and between b6e59ab and 21e346c.

📒 Files selected for processing (8)
  • apps/public-api/package.json
  • apps/public-api/src/controllers/health.controller.js
  • apps/public-api/src/middlewares/api_usage.js
  • apps/public-api/src/routes/health.js
  • apps/public-api/tests/load/rate-limit.load.test.js
  • apps/public-api/tests/resilience/redis-failure.resilience.test.js
  • docker-compose.yml
  • packages/common/src/config/redis.js

Comment on lines +35 to +39
return res.status(503).json({
status: 'error',
timestamp: new Date().toISOString(),
redis: { connected: false, error: 'Redis client not ready' },
});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the required controller response envelope for getRedisHealth.

This endpoint currently returns { status, timestamp, redis }, which breaks the controller response contract and can break clients expecting { success, data, message }.

As per coding guidelines, "All API endpoints must return response format: { success: bool, data: {}, message: "" }."

Also applies to: 61-63, 74-78

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/public-api/src/controllers/health.controller.js` around lines 35 - 39,
The getRedisHealth controller currently returns a raw object ({ status,
timestamp, redis }) — update all return responses in the getRedisHealth function
(and the other similar response blocks at the other two places referenced) to
conform to the required response envelope: return res.status(...).json({
success: boolean, data: { timestamp: string, redis: { connected: boolean,
error?: string } }, message: string }); ensure success is false for 503/500
errors and true for 200, place timestamp and redis under data, and provide a
concise message string (e.g., "Redis not ready" or "Redis healthy") so clients
receive { success, data, message }.

Source: Coding guidelines

Comment on lines +73 to +78
} catch (err) {
return res.status(503).json({
status: 'error',
timestamp: new Date().toISOString(),
redis: { connected: false, error: err.message },
});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not return raw backend error messages from the health controller.

Line 77 exposes err.message to clients. Please route this through AppError (or equivalent sanitized handling) and return a safe public message.

As per coding guidelines, "Use AppError class for errors. Never use raw throw statements or expose MongoDB errors to clients."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/public-api/src/controllers/health.controller.js` around lines 73 - 78,
The catch block in the health controller currently returns raw err.message to
clients; update the handler in the health check function (the try/catch that
references res and err) to wrap or normalize the error using the project's
AppError (or an equivalent sanitizer) and return a safe public message (e.g.,
"Service unavailable") instead of err.message; ensure the response still
includes redis.connected: false and an optional non-sensitive error code or
AppError.safeMessage but never expose MongoDB or raw backend messages directly.

Source: Coding guidelines

Comment on lines +22 to +48
const projectRateLimiter = rateLimit({
store: new RedisStore({
sendCommand: (...args) => redis.call(...args),
prefix: 'rl:project:',
}),
windowMs: 15 * 60 * 1000,
max: async (req, res) => {
if (req.project && req.project.rateLimit) {
return req.project.rateLimit;
}
return 500;
},
keyGenerator: (req, res) => {
if (!req.project || !req.project._id) {
return 'unauthorized';
}
return req.project._id.toString();
},
handler: (req, res, next, options) => {
res.status(options.statusCode).json({
error: "Too Many Requests",
message: "Project rate limit exceeded. Please try again later."
});
},
standardHeaders: true,
legacyHeaders: false,
});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

projectRateLimiter is defined but not exported, so it is never applied.

The new limiter is created but module.exports omits it, and downstream middleware wiring currently imports only { limiter, logger }. This leaves project-scoped Redis rate limiting inactive.

Suggested minimal fix
-module.exports = { limiter, logger };
+module.exports = { limiter, logger, projectRateLimiter };

Also applies to: 150-150

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/public-api/src/middlewares/api_usage.js` around lines 22 - 48, The
projectRateLimiter constant is never exported so downstream code that imports {
limiter, logger } can't use it; update the module.exports object to include
projectRateLimiter (alongside limiter and logger) so the project-scoped Redis
rate limiter is available to middleware wiring; locate projectRateLimiter in
this file and add its identifier to the exported properties (ensure export
naming matches how other modules import it).

Comment on lines +6 to +8
afterEach(async () => {
if (redis.status === 'ready') await redis.flushdb();
});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

flushdb() is too destructive for shared test Redis instances.

At Line 7, redis.flushdb() clears the full DB. This can erase unrelated fixtures/tests when Redis DB isolation is not guaranteed. Prefer deleting only keys created by this suite/prefix.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/public-api/tests/resilience/redis-failure.resilience.test.js` around
lines 6 - 8, The test teardown in afterEach currently calls redis.flushdb(),
which is too destructive for shared Redis; change it to only remove keys created
by this suite by tracking and deleting keys by prefix or pattern instead. Modify
the afterEach handler that references redis.flushdb() to use a safe deletion
strategy (e.g., SCAN/KEYS with the test-specific prefix or maintain a list/set
of created keys during tests and call DEL on those entries) and ensure any
helper that creates keys (tests that call redis.set/redis.hset) uses the agreed
prefix so keys can be reliably targeted for cleanup.

Comment on lines +10 to +20
test('health endpoint returns 503 when Redis is unreachable', async () => {
// Simulate Redis failure by disconnecting temporarily
await redis.disconnect();

const res = await request(app).get('/api/health/redis');
expect(res.status).toBe(503);
expect(res.body.redis.connected).toBe(false);

// Reconnect for subsequent tests
await redis.connect();
});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Always reconnect Redis in a finally block.

At Line 19 and Line 32, reconnect only happens on the happy path. If an assertion fails earlier, the suite can leave Redis disconnected and create cascading failures.

Suggested fix
     test('health endpoint returns 503 when Redis is unreachable', async () => {
-        // Simulate Redis failure by disconnecting temporarily
-        await redis.disconnect();
-
-        const res = await request(app).get('/api/health/redis');
-        expect(res.status).toBe(503);
-        expect(res.body.redis.connected).toBe(false);
-
-        // Reconnect for subsequent tests
-        await redis.connect();
+        await redis.disconnect();
+        try {
+            const res = await request(app).get('/api/health/redis');
+            expect(res.status).toBe(503);
+            expect(res.body.redis.connected).toBe(false);
+        } finally {
+            await redis.connect();
+        }
     });

     test('public API endpoints degrade gracefully when Redis is down', async () => {
-        await redis.disconnect();
-
-        const res = await request(app)
-            .get('/api/data')
-            .set('x-api-key', 'test-key');
-
-        // Should not crash; may return 401 or 500 depending on auth flow
-        expect([200, 401, 403, 500]).toContain(res.status);
-
-        await redis.connect();
+        await redis.disconnect();
+        try {
+            const res = await request(app)
+                .get('/api/data')
+                .set('x-api-key', 'test-key');
+            expect([200, 401, 403, 500]).toContain(res.status);
+        } finally {
+            await redis.connect();
+        }
     });

Also applies to: 22-33

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/public-api/tests/resilience/redis-failure.resilience.test.js` around
lines 10 - 20, The test titled "health endpoint returns 503 when Redis is
unreachable" disconnects Redis with redis.disconnect() but only reconnects on
the happy path; update the test so the disconnect/reconnect is enclosed in a
try/finally (or use finally semantics) around the request and assertions,
calling redis.connect() in the finally block to guarantee reconnection even if
assertions fail; locate the test block (the async test function that calls
redis.disconnect(), request(app).get('/api/health/redis'), and redis.connect())
and modify it to always call redis.connect() in finally.

Comment on lines +29 to +31
// Should not crash; may return 401 or 500 depending on auth flow
expect([200, 401, 403, 500]).toContain(res.status);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Status assertion is too broad to catch regressions.

At Line 30, allowing 200/401/403/500 makes this test pass for many unrelated failures. Assert the specific expected degraded behavior (or a narrow set tied to current middleware contract) so this test actually protects behavior.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/public-api/tests/resilience/redis-failure.resilience.test.js` around
lines 29 - 31, The test currently accepts any of [200, 401, 403, 500] which
masks regressions; update the assertion on the response (the expect([...]) that
checks res.status) to assert the narrow set required by the middleware
contract—preferably a single expected status (e.g., 401 or 500) or at most two
statuses that are explicitly allowed during Redis degradation—so change the
expect call on res.status in the redis-failure.resilience.test to that narrow
set and add a brief comment linking it to the expected degraded behavior from
the auth/middleware code.

@yash-pouranik

Copy link
Copy Markdown
Collaborator

@Abhushan187
please fix the issues pointed by coderabbit.

@yash-pouranik

Copy link
Copy Markdown
Collaborator

and 4 CI checks are also failing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[PERF] Add Redis connection pooling for high-throughput public API endpoints to reduce connection overhead

2 participants