-
Notifications
You must be signed in to change notification settings - Fork 2.2k
docs: Adds tenant-based configuration overrides proposal #8550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| delete_delay: 48h # --delete-delay=48h | ||
|
|
||
| # tenant-specific configs | ||
| tenant_overrides: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be great if this supported from day 1 regexes for matching tenants (I'm aware that was noted under future work but I don't think it'd be much more work at that point to implement that at the same time). For example, let's say that you want to override your default for non prod tenants: ".+-(dev|qa|uat)$".
|
|
||
| ## Goals | ||
|
|
||
| - Enable **tenant-based configuration overrides** for Thanos components, starting with the Compactor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the proposal!
Just trying to understand, what is currently blocking you from doing this already with multiple sharded compactor instances?
You can use the --selector.relabel-config to select blocks for a particular tenant, already thanks to external labels. And then further split by time or anything else.
This makes the process really scalable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Orchestrating that many compactors might be hard tho, which is also why we've created https://github.com/thanos-community/thanos-operator/ cc: @philipgough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great hint. But even with that, you'll have to manage multiple instances of compactors. Like you said, it's harder to maintain.
Thanks for mentioning the operator, didn't know that. Already looks great.
There is seems like you also intend to deploy separate components for each tenant, if different configs are required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for less complex compactor sharding until needed...! There's an overhead resource cost to moving with that model as well.
I'm personally happy with how a single compactor does the job (until you reach nodes limits). We don't have a need for sharding it at this point but it'd be very convenient for a single compactor as @ricket-son suggested to be able to handle multiple retention rules per tenant (preferably via regex).
@saswatamcode On the topic of the thanos-orchestrator, it's looking good - good job ! It definitely seems like a solution that's more flexible then using a good old helm chart (or whatever other templating tool) to manage Thanos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right so something that seems more reasonable to me is not basing this off of tenancy or just a tenant label, as there can be several flexible usecases for such a feature.
How about instead some selector config that will allow us to set separate retention configs for particular set of blocks?
Basically will be a map of selector relabel config to retention config. We could call it Retention Policy then
So the config here will look like,
retention_policies:
- action: drop
regex: "A"
source_labels:
- tenant
retention:
resolution_raw: 5d
resolution_5m: 10d
delete_delay: 12h
This to me makes the system a lot more flexible. This way you could also easily group tenants with similar retention needs or from the same subset.
And you are also free to use this in any way needed, like business metrics might have longer retention than SRE metrics and so on.
Wdyt? 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Receive already has write level tenancy via headers, that can be set from some auth proxy in front
StoreGW has selector configs if you want to run separate, but don't really need to IMO as Querier has flags to enforce the presence of labels, which you can use for tenancy as well 🙂 (or some proxy infront to do this for you like prom-label-proxy)
Ruler is a bit tricky.
You have to enforce tenant labels in a couple different places for rule config, once in the rule expression, and then in the rule labels as well, so that the generated series retains the tenant. We actually implemented this in Thanos Operator already, you need only add config like so,
ruleTenancyConfig:
tenantLabel: tenant_id
tenantValueLabel: operator.thanos.io/tenant
and it will pick up PrometheusRule object with those tenant labels and enforce it there.
The block generated from Ruler would be mixed, but since you'll be enforcing Querier-level tenancy, it would only ever select rules for a particular tenant.
The next step I want to implement in the operator is stateless ruler you can actually create remote write targets so that rules of a particular tenant get remote written to your Receive with a certain tenant label, that way you maintain separate tenant rule blocks too!
In our experience, having a centralized config, means we have to encode a lot of operational logic into Thanos, which takes away from the core scope of this project (scalable metrics storage and querying).
This is why we started Thanos Operator as a way to do all this as declaratively as possible, whilst keeping Thanos scoped to storage and still be gradually adoptable over existing Prometheus setups.
Also TODO for me to actually write some nice blogpost for the operator + some doc on Thanos for the tenancy stuff, as it isn't immediately clear!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But yeah, having this selector relable config on compactor would be cool @ricket-son, makes the tenancy picture easier to manage 🙂
If you update the proposal with this, can do one more round of review and merge!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well this shouldn't aim to configure tenancy in general @saswatamcode. But to configure the overrides of configuration for a tenant.
I mean I don't care if its in a unified section or not. Whatever is simpler to implement. As long as it's documented, I guess it doesn't matter.
Just wanted to know, what you think about possible future tenant config overrides on other components, like receive for ingestion-limit / rate-limits / etc. and how this will be implemented / declared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update the proposal soon.
@GiedriusS whats your opinion on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello jumping in the discussion to expose one of our usecase that seems to be part of this proposal.
We are running a Receive cluster with multiple tenants. As of today the --tsdb.retention flag must be global and applies to all tenants "TSDB" as per: https://thanos.io/tip/components/receive.md/#tenant-lifecycle-management
We would like to be able to configure tenant_specific retention the same way we are overriding ingestion limits:
https://thanos.io/tip/components/receive.md/#understanding-the-configuration-file
One of our usecase is to allow bigger retention for a tenant only while all the other leverage the global config.
(please note, we dont have object storage enabled here and leverage local disks only)
Adds a document which proposes improving the flexibility on components' configuration on a per-tenant basis.
Related Issue: #8544