Skip to content

Conversation

@miparnisari
Copy link
Contributor

@miparnisari miparnisari commented Nov 11, 2025

Pre-reqs

Description

This PR introduces a new memory protection middleware to help prevent out-of-memory conditions in SpiceDB by implementing admission control based on current memory usage.

This is not a perfect solution (doesn't prevent non-traffic-related sources of OOM) and is meant to support other future improvements to resource sharing in a single SpiceDB node.

The middleware is installed both in the main API and in Dispatch APIs, and it uses this lib: https://github.com/odigos-io/go-rtml

⚠️ It does NOT reject health check requests based on memory usage,
⚠️ It needs a special build flag.

References

Alternative implementation: #2646.

Upon testing both implementations, it was found that:

  1. this PR rejects less requests.
  2. this PR didn't OOM the docker container when mem_limit was hit.
image

@miparnisari miparnisari changed the title feat: add Memory Protection Middleware feat: add Memory Protection Middleware (alternative implementation) Nov 11, 2025
@github-actions github-actions bot added area/cli Affects the command line area/dependencies Affects dependencies area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools) area/dispatch Affects dispatching of requests labels Nov 11, 2025
@codecov
Copy link

codecov bot commented Nov 11, 2025

Codecov Report

❌ Patch coverage is 93.06931% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.01%. Comparing base (ec7032a) to head (967b1b9).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
pkg/cmd/server/server.go 88.24% 2 Missing and 2 partials ⚠️
pkg/cmd/testserver/testserver.go 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2691      +/-   ##
==========================================
- Coverage   77.01%   77.01%   -0.00%     
==========================================
  Files         462      464       +2     
  Lines       49134    49184      +50     
==========================================
+ Hits        37837    37872      +35     
- Misses       8509     8520      +11     
- Partials     2788     2792       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@miparnisari miparnisari changed the title feat: add Memory Protection Middleware (alternative implementation) feat: add Memory Protection Middleware (preferred implementation) Nov 11, 2025
if c.MemoryProtectionEnabled {
return memoryprotection.NewRealTimeMemoryUsageProvider()
}
return &memoryprotection.HarcodedMemoryLimitProvider{AcceptAllRequests: true}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would've expected that if the feature was disabled we would just not install the middleware at all, instead of installing it with a no-op limiter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering now why do we have the MemoryProtectionEnabled flag at all 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be in favor of having it be always-on, and then we just don't install the middleware in tests (if that's possible).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at our existing middlewares to see if we have an example of one that is conditionally installed, and it doesn't exist... if a middleware isn't enabled, it still runs, but it just no-ops. Example:

func (r *reporter) ServerReporter(ctx context.Context, callMeta interceptors.CallMeta) (interceptors.Reporter, context.Context) {
if !r.isEnabled {
return &interceptors.NoopReporter{}, ctx
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Captured in #2723

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I think I'm going to split the difference and not expose it as a user-facing flag, but leave the ability to configure it. My thought is that if (e.g.) our steelthread tests start hitting OOM conditions that trigger this middleware, it'll be good to disable the middleware so that we get failures instead of ResourceExhausteds

type GoRealTimeMemoryLimiter struct{}

func (g GoRealTimeMemoryLimiter) IsMemLimitReached() bool {
return rtml.IsMemLimitReached()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish we had a way to set a threshold here. Since this will get called both for dispatch and primary api calls, it will start rejecting them both equally, when really we want to reject primary api before we start rejecting dispatch.

I don't think this is worth blocking on, but I do wonder if we should contribute a threshold % to the upstream library.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Opened odigos-io/go-rtml#3

"-failfast",
"-count=1",
"-race",
"-timeout=20m",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why increasing from 10m to 20m?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, that was a leftover from a different PR. The answer is https://github.com/authzed/spicedb/pull/2674/files#r2505759338

Tl;dr instead of putting timeouts in the test code itself, i think it's better to put it in go test -timeout

WithInterceptor(grpcauth.UnaryServerInterceptor(opts.AuthFunc)).
EnsureAlreadyExecuted(DefaultMiddlewareGRPCProm). // so that prom middleware reports auth failures
EnsureAlreadyExecuted(DefaultMiddlewareGRPCProm). // so that prom middleware reports auth failures
EnsureAlreadyExecuted(DefaultMiddlewareMemoryProtection). // so that memory protection happens before auth
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it have to run before auth?

The rationale for EnsureAlreadyExecuted is to declare a dependency between middlewares explicitly: "I need this to run before me because I depend on its side effects". So why does the auth middleware depend on memory protection?

If not, you can remove it.

If anything, I'd argue that memory protection depends on all the previous middlewares, because we want observability to take place before we kick memory protection, otherwise we would be running blind when ResourceExhausted is returned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this was a leftover from @ecordell's original PR. I don't have a strong opinion around this, I didn't know that EnsureAlreadyExecuted meant an explicit dependency.

@ecordell do you have any thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion here, I just thought that the earlier the memory protection could run the better.

Copy link
Contributor

@tstirrat15 tstirrat15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments, otherwise looks good.

Comment on lines +87 to +89
acceptedStr := strconv.FormatBool(accepted)

RequestsProcessed.WithLabelValues(endpointType, acceptedStr).Inc()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, we're doing it with a label? Hmm...

@miparnisari miparnisari force-pushed the oomprotect-with-rtml branch 13 times, most recently from 112d59b to 491398d Compare November 14, 2025 01:33
@miparnisari miparnisari marked this pull request as ready for review November 14, 2025 20:56
@miparnisari miparnisari requested a review from a team as a code owner November 14, 2025 20:56
Copy link
Contributor

@tstirrat15 tstirrat15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be keen to see what @ecordell has to say, and I'm down with the idea of making this memory middleware non-optional.

if c.MemoryProtectionEnabled {
return memoryprotection.NewRealTimeMemoryUsageProvider()
}
return &memoryprotection.HarcodedMemoryLimitProvider{AcceptAllRequests: true}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be in favor of having it be always-on, and then we just don't install the middleware in tests (if that's possible).

@miparnisari miparnisari changed the title feat: add Memory Protection Middleware (preferred implementation) feat: add Memory Protection Middleware Nov 14, 2025
@github-actions github-actions bot removed the area/dispatch Affects dispatching of requests label Nov 14, 2025
@tstirrat15 tstirrat15 force-pushed the oomprotect-with-rtml branch 3 times, most recently from 173b7ce to 27e46d0 Compare November 21, 2025 21:28
tstirrat15
tstirrat15 previously approved these changes Nov 21, 2025
Copy link
Contributor

@tstirrat15 tstirrat15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it should be good.

Comment on lines -16 to -17
return sh.RunWithV(
map[string]string{},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No vars, so don't need With

Copy link
Contributor

@tstirrat15 tstirrat15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tstirrat15 tstirrat15 added this pull request to the merge queue Nov 25, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 25, 2025
@tstirrat15 tstirrat15 added this pull request to the merge queue Nov 25, 2025
Merged via the queue into main with commit e7390fe Nov 25, 2025
45 checks passed
@tstirrat15 tstirrat15 deleted the oomprotect-with-rtml branch November 25, 2025 17:41
@github-actions github-actions bot locked and limited conversation to collaborators Nov 25, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area/cli Affects the command line area/dependencies Affects dependencies area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants