[draft] initial commit for clusterobservability CRD #4475

jinja2 · 2025-11-03T16:26:52Z

Description:

Draft for Design Discussion - This PR introduces an initial implementation of the ClusterObservability controller as defined in the RFC in repo.

The new ClusterObservability CR is meant to allow users to deploy a complete observability stack with a single Custom Resource:

apiVersion: opentelemetry.io/v1alpha1
kind: ClusterObservability
metadata:
  name: cluster-observability
  namespace: opentelemetry-operator-system
spec:
  signals: ["traces", "metrics", "logs"]
  exporter:
    endpoint: "https://otel-backend.example.com:4318"
    headers:
      "Authorization": "Bearer token"

This initial implemantation creates the following resources:

Agent Collector (DaemonSet): Node-level metrics, logs, and OTLP receiver
Cluster Collector (Deployment): Cluster-level k8s metrics and events
Instrumentation CR: Auto-instrumentation configuration

The details of the current proposed design can we found in docs/cluster-observability.md in draft PR. There is still more work to do, for e.g. adding tests, making the controller more extendable for introducing more distro specific implementation, more telemetry configs, etc. But I'd like to get some feedback on the fundamentals like design, CR config options, code structure.

Link to tracking Issue(s):

Resolves: Create a first managed deployment of the collector #3820

Testing:
Tests are missing for now, will be added once we discuss the design/implementation details.

Documentation:
I have added the new controller's design doc here

frzifus

Thanks @jinja2 for picking this up! Thats rly great stuff! Just left some comments.

frzifus · 2025-11-05T20:34:36Z

config/samples/clusterobservability_v1alpha1_clusterobservability.yaml

+  signals:
+    - traces
+    - metrics
+    - logs


Would it make sense to provide here some specific usecases instead of signals or something like "prepared" configurations?

The doc outlines the ClusterObservability CR also creates e.g. an Auto-Instr. CR. Maybe we want system logs but ignore k8s logs or the other way around.

What comes to my mind would be something like this:

Suggested change

signals:

- traces

- metrics

- logs

collections:

- all

# - metrics_system

# - metrics_ingestion

# - logs_system

# - logs_k8s

# - traces_ingestion

# - instr_all

# - instr_java

That way you could deploy a collector to collect metrics and logs from a host. While not accepting reported logs and not creating e.g. auto-instr. CRD.

I have been pushing for less toggles, not more. Here is a bit of reasoning around it:
I don't think we'll hear on feedback and specific use cases (such as logs_system vs logs_k8s) until we put this in the hands of users
If we make those choices now, there's a good chance we get this wrong. It looks somewhat easy to list use cases right now, but I'd like them discussed in the RFC with descriptions, justifications, stability level, etc.
I would decouple signals and use cases. Signals are technical plumbing toggles. Use cases can work across multiple signals.
Finally, I am very lazy and I want us to support less combinations of possible problems.

I want us to ship something that is too short, too clumsy, and gets us user feedback.

I totally see the point of keeping things simple. Would it then maybe make sense to skip the signals entry entirely and add something in case we get a request?

Just the naming of the signals seems to be a bit intransparent what it actually does. That means Ive to lookup the docs. At the same time I think it should be easy to block telemetry reported by internal services without needing to configure e.g. a network policy.

I agree if we can live without toggles for now we should.

Nice, then we simply get rid of it for now? 😃

I see a few different opinions (here is another comment on this) on how many knobs we should expose. I think we should keep the toggle for telemetry types at least. We could make this optional, with all signals enabled by default, so users don’t need to set it explicitly unless they want to customize it. Let me know if that sounds reasonable.

I personally would still question whats the benefit of having the option to disable specific signal types. Especally when a signal itself remains quite intransparent. Disabling metrics will disable collecting hostmetrics, metrics ingestion and whatever is behind it. Personally, I would advocate for leaving the option out completely for now. Adding knobs in the future should be easy. That way we also can move forward and have this discussion in the future.

Would be good to get some views form the others ( @open-telemetry/operator-approvers ) on this.

frzifus · 2025-11-05T20:40:10Z

internal/status/clusterobservability/handle.go

+
+// checkInstrumentationStatus checks the status of the single Instrumentation CR.
+func checkInstrumentationStatus(ctx context.Context, cli client.Client, co *v1alpha1.ClusterObservability) componentStatus {
+	instrumentationName := "default-instrumentation"


Can this lead to reconcile loops when two CRs are created on the same namespace? Maybe we could prefix this using the ClusterObservability CR Name.

swiatekm

Some preliminary comments about the API and the logistics of introducing this CRD to the operator.

swiatekm · 2025-11-06T15:33:38Z

apis/v1alpha1/clusterobservability_types.go

+)
+
+// OTLPHTTPExporter defines OTLP HTTP exporter configuration.
+// This structure mirrors the official OpenTelemetry Collector otlphttpexporter configuration.


There's a danger associated with doing this, where a breaking change in the exporter will force a breaking change on us. In addition, we'd need to keep up with additions made by the exporter. At that point, we may as well import the upstream config struct directly.

I wonder if it isn't safer to make this into an opaque type and just pass it down.

We also could trim it down to a more minimal set just to unblock this PR.

config/crd/kustomization.yaml

docs/api/clusterobservabilities.md

internal/controllers/clusterobservability_controller.go

pavolloffay · 2025-11-06T18:23:51Z

apis/v1alpha1/clusterobservability_types.go

+	// +required
+	// +kubebuilder:validation:MinItems=1
+	// +listType=set
+	Signals []ObservabilitySignal `json:"signals"`


I like the simplicity of disabling/enabling by signal type.

However I would prefer if the CR design was future proof and allowed us to add more detailed "nobs" in the future if we have to (once we collect feedback from the end users).

Perhaps we can change the ObservabilitySignal from string to a struct.

frzifus · 2025-11-14T10:47:51Z

@jinja2 Ive been unable to create a PR to your branch. I just created a draft addressing the comments on top. Feel free to merge/cherry-pick changes:

patch: clusterobservability CRD #4504

I placed a summary into the PR description.

cc @atoulme

frzifus · 2025-11-14T13:52:31Z

Got it managed to point it to your branch: https://github.com/jinja2/opentelemetry-operator/pull/1/files

Signed-off-by: Benedikt Bongartz <[email protected]>

jinja2 · 2025-11-19T00:22:37Z

Thanks everyone for the feedback, and thanks @frzifus for the fixes!

I am out of office until the start of December, and will pick this up again when I return

jinja2 marked this pull request as ready for review November 3, 2025 16:40

jinja2 requested a review from a team as a code owner November 3, 2025 16:40

frzifus reviewed Nov 5, 2025

View reviewed changes

swiatekm reviewed Nov 6, 2025

View reviewed changes

pavolloffay reviewed Nov 6, 2025

View reviewed changes

frzifus mentioned this pull request Nov 14, 2025

patch: clusterobservability CRD #4504

Closed

frzifus mentioned this pull request Nov 14, 2025

address change requests and fix build jinja2/opentelemetry-operator#1

Closed

frzifus requested review from atoulme, frzifus, pavolloffay and swiatekm November 17, 2025 14:13

initial commit for clusterobservability CRD

a0c490e

jinja2 force-pushed the test branch from 96ac3da to a0c490e Compare November 18, 2025 22:27

frzifus and others added 6 commits November 18, 2025 14:45

use k8s cleanup for collector and instr CRs

8a3e6c8

Signed-off-by: Benedikt Bongartz <[email protected]>

exclude clusterObservability CR from bundle

466d41b

Signed-off-by: Benedikt Bongartz <[email protected]>

skip generating clusterObservability CRD docs for now

a609cc8

Signed-off-by: Benedikt Bongartz <[email protected]>

algin instrumentation name to cluster observability CR name

da43425

Signed-off-by: Benedikt Bongartz <[email protected]>

add changelog for the clusterObservability CRD

754454e

Signed-off-by: Benedikt Bongartz <[email protected]>

generate bundle manifests

e44af74

frzifus added the discuss-at-sig This issue or PR should be discussed at the next SIG meeting label Nov 19, 2025

-  signals:
-    - traces
-    - metrics
-    - logs
+  collections:
+    - all
+    # - metrics_system
+    # - metrics_ingestion
+    # - logs_system
+    # - logs_k8s
+    # - traces_ingestion
+    # - instr_all
+    # - instr_java

[draft] initial commit for clusterobservability CRD #4475

Are you sure you want to change the base?

[draft] initial commit for clusterobservability CRD #4475

Uh oh!

Conversation

jinja2 commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

frzifus left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

swiatekm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

frzifus commented Nov 14, 2025

Uh oh!

frzifus commented Nov 14, 2025

Uh oh!

jinja2 commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jinja2 commented Nov 3, 2025 •

edited

Loading