Conversation
b2401a9 to
56ee1e8
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2610 +/- ##
==========================================
+ Coverage 39.57% 40.11% +0.54%
==========================================
Files 315 319 +4
Lines 27508 28309 +801
==========================================
+ Hits 10885 11355 +470
- Misses 15826 16125 +299
- Partials 797 829 +32
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 9 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
56ee1e8 to
1bd0750
Compare
1bd0750 to
94b4da5
Compare
94b4da5 to
98f90bb
Compare
b2a8245 to
e1688ef
Compare
e1688ef to
5155320
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 515532045d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
internal/controller/datadogagent/feature/servicediscovery/feature.go
Outdated
Show resolved
Hide resolved
internal/controller/datadogagent/feature/servicediscovery/feature.go
Outdated
Show resolved
Hide resolved
domalessi
left a comment
There was a problem hiding this comment.
Left a suggestion to improve clarity! Approving so you're not blocked on me.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
|
||
| // hasOtherSystemProbeFeatures returns true if any feature besides service discovery | ||
| // requires the full system-probe binary. When true, system-probe-lite cannot be used. | ||
| func hasOtherSystemProbeFeatures(features *v2alpha1.DatadogFeatures) bool { |
There was a problem hiding this comment.
How can we ensure that list is correct and complete and remains that way? It's not obvious that I have to patch feature/servicediscovery/feature.go if I'm adding some unrelated feature that uses system-probe, and that a failure to do so may cause my feature to not work in some cases.
tbavelier
left a comment
There was a problem hiding this comment.
Need to change the CRD logic: such logic should not be in the CRD as a field we don't want users to expose, see comments
Not sure about the command/args if it warrants changes
Will discuss with team for the other SP features if they have an idea
Conflict with main due to pointer refactor I believe
internal/controller/datadogagent/feature/servicediscovery/feature.go
Outdated
Show resolved
Hide resolved
| c.Command = []string{"/bin/sh", "-c"} | ||
| c.Args = []string{systemProbeLiteCommand(common.DefaultSystemProbeSocketPath, f.userExplicitlyEnabled)} |
There was a problem hiding this comment.
Are we 100% sure we want to change both cmd and args ? Cmd changes the entrypoint which might have un-intended consequences, see #2864 for a more detailed explanation
There was a problem hiding this comment.
The reason we change both is because we need the shell logic behind it. One problem we have with the Operator and Helm chart, is that we can't know which agent version we are running, which means if you modify only the args:
- you might try to rely on system-probe's fallback to SPL mechanism. But if you are on a system-probe version that doesn't have the support, it will crash because of an unknown CLI option.
- on supported versions, you spawned the full system-probe to have it check a config we already have in the operator, which means a huge spike in resource consumption which can be enough to trigger pods memory limits and defeats the purpose of having SPL in the first place.
- If only discovery is explicitly enabled, on non-supported versions, you might end up with system-probe starting up, and figuring out it doesn't need to run because it does not have discovery at all, and exits. In this case, you will have a crashloopbackoff.
By using a shell here, we can try to do best effort, run SPL directly when possible, or the full system-probe, and fallback to a sleep infinity on non-supported version, that will prevent the container from going into a CLB.
Of course, we are open to suggestions on improving this.
Conflict resolution: - defaults/datadogagent_default_test.go: kept EnabledByDefault field while adopting ptr.To() in place of apiutils.NewBoolPointer() (removed upstream) - feature/servicediscovery/feature.go: kept branch's Configure() logic that supports EnabledByDefault and userExplicitlyEnabled (used by ManageNodeAgent) Also updated factory.go to use slices.Contains (modernize lint requirement). Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Remove EnabledByDefault from the CRD and instead use the nil/false/true tri-state on Enabled: - Enabled=nil → not configured by user (operator may auto-enable later) - Enabled=false → explicit user opt-out - Enabled=true → explicit user opt-in (system-probe fallback allowed) The fallback distinction is preserved: userExplicitlyEnabled is only true when Enabled=*true, so a future operator auto-enable path (nil) would still use the conservative sleep-infinity fallback. Enabled is intentionally not defaulted to false so that nil remains distinguishable from an explicit opt-out. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
9bff932 to
04225c5
Compare
What does this PR do?
Adds support for starting system-probe-lite (SPL) instead of the full system-probe in the operator when only discovery is enabled.
SPL is a privileged Rust binary that implements just the discovery module. Kubernetes needs a different approach than Docker/Linux since the operator is updated asynchronously to agent versions — we can't enable discovery by default and guarantee we won't start the full system-probe with default config.
How it works:
discovery.enabledis explicitly enabled and SPL doesn't exist, we fall back to the full system-probe binary (since the user opted in knowingly)discovery.enabledis unset (nil), this means the user did not make an explicit choice and the operator chooses whether or not to enable the feature automatically, we fall back tosleep infinityto avoid starting system-probe unexpectedly when using an older agent image that doesn't have SPL. Note that the operator does not activate discovery by default in this PR; that is a separate, future step.Motivation
Match agent configuration.
Additional Notes
Minimum Agent Versions
Are there minimum versions of the Datadog Agent and/or Cluster Agent required?
However, like in the Helm chart change, it checks first for the
sd-agentbinary to exists in the container image. This means using an older agent image will simply usesystem-probe's discovery implementation instead of sd-agent.Describe your test plan
features.serviceDiscovery.enabledset to false: no system-probe container.features.serviceDiscovery.enabledset to true:system-probecontainer starts; execssystem-probe-lite.features.npm.enabled):system-probecontainer starts;system-proberuns directly.system-probecontainer starts;system-proberuns directly.Checklist
bug,enhancement,refactoring,documentation,tooling, and/ordependenciesqa/skip-qalabel