Skip to content

discovery: add system-probe-lite support#2610

Open
Yumasi wants to merge 8 commits intomainfrom
guillaume.pagnoux/add-sd-agent-suppor
Open

discovery: add system-probe-lite support#2610
Yumasi wants to merge 8 commits intomainfrom
guillaume.pagnoux/add-sd-agent-suppor

Conversation

@Yumasi
Copy link
Copy Markdown
Member

@Yumasi Yumasi commented Feb 18, 2026

What does this PR do?

Adds support for starting system-probe-lite (SPL) instead of the full system-probe in the operator when only discovery is enabled.

SPL is a privileged Rust binary that implements just the discovery module. Kubernetes needs a different approach than Docker/Linux since the operator is updated asynchronously to agent versions — we can't enable discovery by default and guarantee we won't start the full system-probe with default config.

How it works:

  • When only discovery is explicitly enabled (no other system-probe features), the system-probe container runs SPL as the entry point
  • If discovery.enabled is explicitly enabled and SPL doesn't exist, we fall back to the full system-probe binary (since the user opted in knowingly)
  • If discovery.enabled is unset (nil), this means the user did not make an explicit choice and the operator chooses whether or not to enable the feature automatically, we fall back to sleep infinity to avoid starting system-probe unexpectedly when using an older agent image that doesn't have SPL. Note that the operator does not activate discovery by default in this PR; that is a separate, future step.
  • If other system-probe features are also enabled (NPM, USM, etc.), the regular system-probe binary is used directly.

Motivation

Match agent configuration.

Additional Notes

Minimum Agent Versions

Are there minimum versions of the Datadog Agent and/or Cluster Agent required?

  • Agent: v7.78

However, like in the Helm chart change, it checks first for the sd-agent binary to exists in the container image. This means using an older agent image will simply use system-probe's discovery implementation instead of sd-agent.

Describe your test plan

  • Tested on an AWS cluster that it behaves as follow:
    • Default config i.e features.serviceDiscovery.enabled set to false: no system-probe container.
    • features.serviceDiscovery.enabled set to true: system-probe container starts; execs system-probe-lite.
    • The above + another system-probe feature (e.g features.npm.enabled): system-probe container starts; system-probe runs directly.
    • Discovery enabled + agent version without sd-agent: system-probe container starts; system-probe runs directly.

Checklist

  • PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
  • PR has a milestone or the qa/skip-qa label
  • All commits are signed (see: signing commits)

@Yumasi Yumasi added the enhancement New feature or request label Feb 18, 2026
@Yumasi Yumasi force-pushed the guillaume.pagnoux/add-sd-agent-suppor branch from b2401a9 to 56ee1e8 Compare February 18, 2026 14:31
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Feb 18, 2026

Codecov Report

❌ Patch coverage is 93.93939% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.11%. Comparing base (bceb41f) to head (04225c5).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
...r/datadogagent/feature/servicediscovery/feature.go 93.93% 1 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2610      +/-   ##
==========================================
+ Coverage   39.57%   40.11%   +0.54%     
==========================================
  Files         315      319       +4     
  Lines       27508    28309     +801     
==========================================
+ Hits        10885    11355     +470     
- Misses      15826    16125     +299     
- Partials      797      829      +32     
Flag Coverage Δ
unittests 40.11% <93.93%> (+0.54%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...ller/datadogagent/defaults/datadogagent_default.go 87.03% <ø> (-0.05%) ⬇️
...r/datadogagent/feature/servicediscovery/feature.go 84.41% <93.93%> (+5.24%) ⬆️

... and 9 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bceb41f...04225c5. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Yumasi Yumasi force-pushed the guillaume.pagnoux/add-sd-agent-suppor branch from 56ee1e8 to 1bd0750 Compare March 18, 2026 16:09
@Yumasi Yumasi added this to the v1.25.0 milestone Mar 19, 2026
@Yumasi Yumasi force-pushed the guillaume.pagnoux/add-sd-agent-suppor branch from 1bd0750 to 94b4da5 Compare March 19, 2026 10:04
@Yumasi Yumasi changed the title discovery: add sd-agent support discovery: add system-probe-lite support Mar 19, 2026
@Yumasi Yumasi force-pushed the guillaume.pagnoux/add-sd-agent-suppor branch from 94b4da5 to 98f90bb Compare March 19, 2026 11:06
@Yumasi Yumasi force-pushed the guillaume.pagnoux/add-sd-agent-suppor branch 8 times, most recently from b2a8245 to e1688ef Compare March 25, 2026 13:16
@Yumasi Yumasi force-pushed the guillaume.pagnoux/add-sd-agent-suppor branch from e1688ef to 5155320 Compare March 25, 2026 14:14
@levan-m levan-m modified the milestones: v1.25.0, v1.26.0 Mar 25, 2026
@Yumasi Yumasi marked this pull request as ready for review March 25, 2026 15:55
@Yumasi Yumasi requested a review from a team March 25, 2026 15:55
@Yumasi Yumasi requested review from a team as code owners March 25, 2026 15:55
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 515532045d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown

@domalessi domalessi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a suggestion to improve clarity! Approving so you're not blocked on me.

Copy link
Copy Markdown

@domalessi domalessi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes!

Yumasi and others added 3 commits March 27, 2026 10:15
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

// hasOtherSystemProbeFeatures returns true if any feature besides service discovery
// requires the full system-probe binary. When true, system-probe-lite cannot be used.
func hasOtherSystemProbeFeatures(features *v2alpha1.DatadogFeatures) bool {
Copy link
Copy Markdown
Contributor

@vitkyrka vitkyrka Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we ensure that list is correct and complete and remains that way? It's not obvious that I have to patch feature/servicediscovery/feature.go if I'm adding some unrelated feature that uses system-probe, and that a failure to do so may cause my feature to not work in some cases.

@vitkyrka vitkyrka modified the milestones: v1.26.0, v1.27.0 Apr 6, 2026
Copy link
Copy Markdown
Member

@tbavelier tbavelier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to change the CRD logic: such logic should not be in the CRD as a field we don't want users to expose, see comments
Not sure about the command/args if it warrants changes
Will discuss with team for the other SP features if they have an idea
Conflict with main due to pointer refactor I believe

Comment on lines +165 to +166
c.Command = []string{"/bin/sh", "-c"}
c.Args = []string{systemProbeLiteCommand(common.DefaultSystemProbeSocketPath, f.userExplicitlyEnabled)}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we 100% sure we want to change both cmd and args ? Cmd changes the entrypoint which might have un-intended consequences, see #2864 for a more detailed explanation

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we change both is because we need the shell logic behind it. One problem we have with the Operator and Helm chart, is that we can't know which agent version we are running, which means if you modify only the args:

  • you might try to rely on system-probe's fallback to SPL mechanism. But if you are on a system-probe version that doesn't have the support, it will crash because of an unknown CLI option.
  • on supported versions, you spawned the full system-probe to have it check a config we already have in the operator, which means a huge spike in resource consumption which can be enough to trigger pods memory limits and defeats the purpose of having SPL in the first place.
  • If only discovery is explicitly enabled, on non-supported versions, you might end up with system-probe starting up, and figuring out it doesn't need to run because it does not have discovery at all, and exits. In this case, you will have a crashloopbackoff.

By using a shell here, we can try to do best effort, run SPL directly when possible, or the full system-probe, and fallback to a sleep infinity on non-supported version, that will prevent the container from going into a CLB.

Of course, we are open to suggestions on improving this.

Conflict resolution:
- defaults/datadogagent_default_test.go: kept EnabledByDefault field while
  adopting ptr.To() in place of apiutils.NewBoolPointer() (removed upstream)
- feature/servicediscovery/feature.go: kept branch's Configure() logic that
  supports EnabledByDefault and userExplicitlyEnabled (used by ManageNodeAgent)

Also updated factory.go to use slices.Contains (modernize lint requirement).

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Remove EnabledByDefault from the CRD and instead use the nil/false/true
tri-state on Enabled:
- Enabled=nil  → not configured by user (operator may auto-enable later)
- Enabled=false → explicit user opt-out
- Enabled=true  → explicit user opt-in (system-probe fallback allowed)

The fallback distinction is preserved: userExplicitlyEnabled is only
true when Enabled=*true, so a future operator auto-enable path (nil)
would still use the conservative sleep-infinity fallback.

Enabled is intentionally not defaulted to false so that nil remains
distinguishable from an explicit opt-out.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@Yumasi Yumasi force-pushed the guillaume.pagnoux/add-sd-agent-suppor branch from 9bff932 to 04225c5 Compare April 9, 2026 09:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants