Skip to content

Conversation

@jinja2
Copy link

@jinja2 jinja2 commented Nov 3, 2025

Description:

Draft for Design Discussion - This PR introduces an initial implementation of the ClusterObservability controller as defined in the RFC in repo.

The new ClusterObservability CR is meant to allow users to deploy a complete observability stack with a single Custom Resource:

apiVersion: opentelemetry.io/v1alpha1
kind: ClusterObservability
metadata:
  name: cluster-observability
  namespace: opentelemetry-operator-system
spec:
  signals: ["traces", "metrics", "logs"]
  exporter:
    endpoint: "https://otel-backend.example.com:4318"
    headers:
      "Authorization": "Bearer token"

This initial implemantation creates the following resources:

  • Agent Collector (DaemonSet): Node-level metrics, logs, and OTLP receiver
  • Cluster Collector (Deployment): Cluster-level k8s metrics and events
  • Instrumentation CR: Auto-instrumentation configuration

The details of the current proposed design can we found in docs/cluster-observability.md in draft PR. There is still more work to do, for e.g. adding tests, making the controller more extendable for introducing more distro specific implementation, more telemetry configs, etc. But I'd like to get some feedback on the fundamentals like design, CR config options, code structure.

Link to tracking Issue(s):

Testing:
Tests are missing for now, will be added once we discuss the design/implementation details.

Documentation:
I have added the new controller's design doc here

@jinja2 jinja2 marked this pull request as ready for review November 3, 2025 16:40
@jinja2 jinja2 requested a review from a team as a code owner November 3, 2025 16:40
Copy link
Member

@frzifus frzifus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jinja2 for picking this up! Thats rly great stuff! Just left some comments.

Comment on lines +13 to +16
signals:
- traces
- metrics
- logs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to provide here some specific usecases instead of signals or something like "prepared" configurations?

The doc outlines the ClusterObservability CR also creates e.g. an Auto-Instr. CR. Maybe we want system logs but ignore k8s logs or the other way around.

What comes to my mind would be something like this:

Suggested change
signals:
- traces
- metrics
- logs
collections:
- all
# - metrics_system
# - metrics_ingestion
# - logs_system
# - logs_k8s
# - traces_ingestion
# - instr_all
# - instr_java

That way you could deploy a collector to collect metrics and logs from a host. While not accepting reported logs and not creating e.g. auto-instr. CRD.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been pushing for less toggles, not more. Here is a bit of reasoning around it:
I don't think we'll hear on feedback and specific use cases (such as logs_system vs logs_k8s) until we put this in the hands of users
If we make those choices now, there's a good chance we get this wrong. It looks somewhat easy to list use cases right now, but I'd like them discussed in the RFC with descriptions, justifications, stability level, etc.
I would decouple signals and use cases. Signals are technical plumbing toggles. Use cases can work across multiple signals.
Finally, I am very lazy and I want us to support less combinations of possible problems.

I want us to ship something that is too short, too clumsy, and gets us user feedback.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally see the point of keeping things simple. Would it then maybe make sense to skip the signals entry entirely and add something in case we get a request?

Just the naming of the signals seems to be a bit intransparent what it actually does. That means Ive to lookup the docs. At the same time I think it should be easy to block telemetry reported by internal services without needing to configure e.g. a network policy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree if we can live without toggles for now we should.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, then we simply get rid of it for now? 😃

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a few different opinions (here is another comment on this) on how many knobs we should expose. I think we should keep the toggle for telemetry types at least. We could make this optional, with all signals enabled by default, so users don’t need to set it explicitly unless they want to customize it. Let me know if that sounds reasonable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally would still question whats the benefit of having the option to disable specific signal types. Especally when a signal itself remains quite intransparent. Disabling metrics will disable collecting hostmetrics, metrics ingestion and whatever is behind it. Personally, I would advocate for leaving the option out completely for now. Adding knobs in the future should be easy. That way we also can move forward and have this discussion in the future.

Would be good to get some views form the others ( @open-telemetry/operator-approvers ) on this.


// checkInstrumentationStatus checks the status of the single Instrumentation CR.
func checkInstrumentationStatus(ctx context.Context, cli client.Client, co *v1alpha1.ClusterObservability) componentStatus {
instrumentationName := "default-instrumentation"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this lead to reconcile loops when two CRs are created on the same namespace? Maybe we could prefix this using the ClusterObservability CR Name.

Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some preliminary comments about the API and the logistics of introducing this CRD to the operator.

)

// OTLPHTTPExporter defines OTLP HTTP exporter configuration.
// This structure mirrors the official OpenTelemetry Collector otlphttpexporter configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a danger associated with doing this, where a breaking change in the exporter will force a breaking change on us. In addition, we'd need to keep up with additions made by the exporter. At that point, we may as well import the upstream config struct directly.

I wonder if it isn't safer to make this into an opaque type and just pass it down.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also could trim it down to a more minimal set just to unblock this PR.

// +required
// +kubebuilder:validation:MinItems=1
// +listType=set
Signals []ObservabilitySignal `json:"signals"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the simplicity of disabling/enabling by signal type.

However I would prefer if the CR design was future proof and allowed us to add more detailed "nobs" in the future if we have to (once we collect feedback from the end users).

Perhaps we can change the ObservabilitySignal from string to a struct.

@frzifus
Copy link
Member

frzifus commented Nov 14, 2025

@jinja2 Ive been unable to create a PR to your branch. I just created a draft addressing the comments on top. Feel free to merge/cherry-pick changes:

I placed a summary into the PR description.

cc @atoulme

@frzifus
Copy link
Member

frzifus commented Nov 14, 2025

Got it managed to point it to your branch: https://github.com/jinja2/opentelemetry-operator/pull/1/files

@jinja2
Copy link
Author

jinja2 commented Nov 19, 2025

Thanks everyone for the feedback, and thanks @frzifus for the fixes!

I am out of office until the start of December, and will pick this up again when I return

@frzifus frzifus added the discuss-at-sig This issue or PR should be discussed at the next SIG meeting label Nov 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

discuss-at-sig This issue or PR should be discussed at the next SIG meeting

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create a first managed deployment of the collector

5 participants