-
Notifications
You must be signed in to change notification settings - Fork 570
[draft] initial commit for clusterobservability CRD #4475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
frzifus
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jinja2 for picking this up! Thats rly great stuff! Just left some comments.
| signals: | ||
| - traces | ||
| - metrics | ||
| - logs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to provide here some specific usecases instead of signals or something like "prepared" configurations?
The doc outlines the ClusterObservability CR also creates e.g. an Auto-Instr. CR. Maybe we want system logs but ignore k8s logs or the other way around.
What comes to my mind would be something like this:
| signals: | |
| - traces | |
| - metrics | |
| - logs | |
| collections: | |
| - all | |
| # - metrics_system | |
| # - metrics_ingestion | |
| # - logs_system | |
| # - logs_k8s | |
| # - traces_ingestion | |
| # - instr_all | |
| # - instr_java |
That way you could deploy a collector to collect metrics and logs from a host. While not accepting reported logs and not creating e.g. auto-instr. CRD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been pushing for less toggles, not more. Here is a bit of reasoning around it:
I don't think we'll hear on feedback and specific use cases (such as logs_system vs logs_k8s) until we put this in the hands of users
If we make those choices now, there's a good chance we get this wrong. It looks somewhat easy to list use cases right now, but I'd like them discussed in the RFC with descriptions, justifications, stability level, etc.
I would decouple signals and use cases. Signals are technical plumbing toggles. Use cases can work across multiple signals.
Finally, I am very lazy and I want us to support less combinations of possible problems.
I want us to ship something that is too short, too clumsy, and gets us user feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally see the point of keeping things simple. Would it then maybe make sense to skip the signals entry entirely and add something in case we get a request?
Just the naming of the signals seems to be a bit intransparent what it actually does. That means Ive to lookup the docs. At the same time I think it should be easy to block telemetry reported by internal services without needing to configure e.g. a network policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree if we can live without toggles for now we should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, then we simply get rid of it for now? 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a few different opinions (here is another comment on this) on how many knobs we should expose. I think we should keep the toggle for telemetry types at least. We could make this optional, with all signals enabled by default, so users don’t need to set it explicitly unless they want to customize it. Let me know if that sounds reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally would still question whats the benefit of having the option to disable specific signal types. Especally when a signal itself remains quite intransparent. Disabling metrics will disable collecting hostmetrics, metrics ingestion and whatever is behind it. Personally, I would advocate for leaving the option out completely for now. Adding knobs in the future should be easy. That way we also can move forward and have this discussion in the future.
Would be good to get some views form the others ( @open-telemetry/operator-approvers ) on this.
|
|
||
| // checkInstrumentationStatus checks the status of the single Instrumentation CR. | ||
| func checkInstrumentationStatus(ctx context.Context, cli client.Client, co *v1alpha1.ClusterObservability) componentStatus { | ||
| instrumentationName := "default-instrumentation" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this lead to reconcile loops when two CRs are created on the same namespace? Maybe we could prefix this using the ClusterObservability CR Name.
swiatekm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some preliminary comments about the API and the logistics of introducing this CRD to the operator.
| ) | ||
|
|
||
| // OTLPHTTPExporter defines OTLP HTTP exporter configuration. | ||
| // This structure mirrors the official OpenTelemetry Collector otlphttpexporter configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a danger associated with doing this, where a breaking change in the exporter will force a breaking change on us. In addition, we'd need to keep up with additions made by the exporter. At that point, we may as well import the upstream config struct directly.
I wonder if it isn't safer to make this into an opaque type and just pass it down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also could trim it down to a more minimal set just to unblock this PR.
| // +required | ||
| // +kubebuilder:validation:MinItems=1 | ||
| // +listType=set | ||
| Signals []ObservabilitySignal `json:"signals"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the simplicity of disabling/enabling by signal type.
However I would prefer if the CR design was future proof and allowed us to add more detailed "nobs" in the future if we have to (once we collect feedback from the end users).
Perhaps we can change the ObservabilitySignal from string to a struct.
|
Got it managed to point it to your branch: https://github.com/jinja2/opentelemetry-operator/pull/1/files |
Signed-off-by: Benedikt Bongartz <[email protected]>
Signed-off-by: Benedikt Bongartz <[email protected]>
Signed-off-by: Benedikt Bongartz <[email protected]>
Signed-off-by: Benedikt Bongartz <[email protected]>
Signed-off-by: Benedikt Bongartz <[email protected]>
|
Thanks everyone for the feedback, and thanks @frzifus for the fixes! I am out of office until the start of December, and will pick this up again when I return |
Description:
Draft for Design Discussion - This PR introduces an initial implementation of the ClusterObservability controller as defined in the RFC in repo.
The new ClusterObservability CR is meant to allow users to deploy a complete observability stack with a single Custom Resource:
This initial implemantation creates the following resources:
The details of the current proposed design can we found in docs/cluster-observability.md in draft PR. There is still more work to do, for e.g. adding tests, making the controller more extendable for introducing more distro specific implementation, more telemetry configs, etc. But I'd like to get some feedback on the fundamentals like design, CR config options, code structure.
Link to tracking Issue(s):
Testing:
Tests are missing for now, will be added once we discuss the design/implementation details.
Documentation:
I have added the new controller's design doc here