set sampling ratio of dropping for single syscall individually #1309

wangyongfeng5 · 2023-08-29T07:07:41Z

What type of PR is this?

/kind feature

Any specific area of the project related to this PR?

/area driver-kmod

/area driver-bpf

/area driver-modern-bpf

/area libscap-engine-bpf

/area libscap-engine-kmod

/area libscap-engine-modern-bpf

/area libscap

/area libpman

Does this PR require a change in the driver versions?

/version driver-API-version-major

/version driver-API-version-minor

What this PR does / why we need it:
Set sampling ratio of dropping for single syscall individually，users can set different discard rates for system calls of different importance.

poiana · 2023-08-29T07:07:44Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

poiana · 2023-08-29T07:07:51Z

Welcome @wangyongfeng5! It looks like this is your first PR to falcosecurity/libs 🎉

poiana · 2023-08-29T07:08:00Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wangyongfeng5
Once this PR has been reviewed and has the lgtm label, please assign molter73 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2023-08-29T07:08:11Z

Please double check driver/SCHEMA_VERSION file. See versioning.

/hold

FedeDP · 2023-08-29T07:31:48Z

Hi! Thanks for your contribution!
This is a rather interesting idea; i think you need to bump driver/API_VERSION major (and SCAP_MINIMUM_API_VERSION).
Also, it would be great to have tests around the new feature; we already cover the sampling ratio feature: https://github.com/falcosecurity/libs/tree/master/test/drivers/test_suites/actions_suite; can you expand/fixup them?

Signed-off-by: Manny Wang <[email protected]>

Andreagit97 · 2023-08-29T09:04:34Z

Seems like a nice improvement, tagging our sampling ratio expert @gnosek! Moreover, we are in the process of releasing a new libs version, not sure we can do this for the next tag but we will try to do our best!

Signed-off-by: Manny Wang <[email protected]>

gnosek

Sorry to say, I'm not too excited about this PR as it currently stands. The issues I have with it are:

we really need DROP_[EX] events to contain the sampling ratio (getting rid of the sampling ratio obviously complicates this :))
replacing "the" sampling ratio with an array feels wrong. With your patch, the sampling ratio now becomes an attribute of the individual syscalls, so maybe it should become an extension of the ppm_sc_of_interest concept (accept/drop becomes accept/accept-1/nth-of-the-time/drop) rather than an extension of the global sampling ratio concept?

My suggestion would be to leave the global sampling ratio alone and add a separate per-syscall sampling step afterwards, so that the higher sampling ratio (global or per-syscall) wins, although you'd realistically only use one or the other.

Also, what's the end use case for this? What would you use the extra flexibility for?

gnosek · 2023-08-29T10:53:56Z

driver/main.c

+		}

-		vpr_info("new sampling ratio: %d\n", new_sampling_ratio);
+		vpr_info("new default sampling ratio: %d\n", new_sampling_ratio);


This isn't really setting just the default, it's overwriting any previous per-syscall sampling ratios configured

gnosek · 2023-08-29T10:57:48Z

driver/main.c

 		ret = 0;
 		goto cleanup_ioctl;
 	}
+	case PPM_IOCTL_SET_DROPPING_RATIO:


Bikeshedding names is always fun, but I'd rather keep the SAMPLING in the name here (e.g. PPM_IOCTL_SET_SYSCALL_SAMPLING_RATIO?). Otherwise I'd keep wondering if sampling_ratio and dropping_ratio are the same thing and why do we need two ioctls to manage them.

Also, please consider (not forcing this in any way, just please consider :)) passing a (two-u32) struct by pointer instead of unpacking the arguments from a single (by-value) u64.

gnosek · 2023-08-29T11:04:31Z

driver/bpf/fillers.h

 	 * ratio
 	 */
-	return bpf_push_u32_to_ring(data, data->settings->sampling_ratio);
+	return bpf_push_u32_to_ring(data, 0);


If I'm correct, this is going to cause a lot of pain for us :( we actually do rely on getting the sampling ratio in drop events) and would require a major redesign once there's no single sampling ratio.

It feels like a way forward would be to keep the notion of a single sampling ratio and disable it only when a consumer uses the per-syscall ioctl (and we'd never do that then).

something like:

if(settings->sampling_ratio != PER_SYSCALL_SAMPLING_RATIO) { sampling_ratio = settings->sampling_ratio; } else { sampling_ratio = settings->sampling_ratios[syscall_id]; }

and in the ioctl that sets per syscall sampling ratios, just set settings->sampling_ratio = PER_SYSCALL_SAMPLING_RATIO too

If there are both global and per-system call ratios, which one should be reported here? Global? Or the one being sampled? In the komd driver, it seems difficult to obtain the system call being sampled based on its mechanism that may delay the insertion of drop events.

gnosek · 2023-08-29T11:06:21Z

driver/main.c

+		vpr_info("PPM_IOCTL_SET_DROPPING_RATIO, syscall(%u), ratio(%u), consumer %p\n", syscall_to_set, new_sampling_ratio, consumer_id);
+
+		if (syscall_to_set >= SYSCALL_TABLE_SIZE) {
+			pr_err("invalid syscall %u\n", syscall_to_set);


This will appear in the kernel log without any extra context, so maybe add a few words here (like that we're in this particular ioctl for example). I'm sure there's precedent for cryptic log messages but let's try to make things better one line at a time :)

gnosek · 2023-08-29T11:07:37Z

driver/modern_bpf/programs/tail_called/events/custom_logic/drop.bpf.c

 	/*=============================== COLLECT PARAMETERS ===========================*/

-	ringbuf__store_u32(&ringbuf, maps__get_sampling_ratio());
+	ringbuf__store_u32(&ringbuf, 0);


Same as for the other engines, we can't work without a reliable sampling ratio report (as seen by the driver) :<

gnosek · 2023-08-29T11:08:43Z

driver/ppm_fillers.c

 	 * ratio
 	 */
-	res = val_to_ring(args, args->consumer->sampling_ratio, 0, false, 0);
+	res = val_to_ring(args, 0, 0, false, 0);


Same. We need it :(

gnosek · 2023-08-29T11:09:55Z

userspace/libpman/src/maps.c

+	if(g_syscall_table[syscall_id].flags & (UF_NEVER_DROP | UF_ALWAYS_DROP)
+	  || g_syscall_table[syscall_id].flags == UF_NONE 
+	  || !(g_syscall_table[syscall_id].flags & UF_USED))
+	{
+		return 1;
+	}


We don't check the event flags in pman_set_default_sampling_ratio, and I guess it's not critical here either (setting the sampling ratio for an unused event feels harmless?)

wangyongfeng5 · 2023-08-30T03:37:24Z

Sorry to say, I'm not too excited about this PR as it currently stands. The issues I have with it are:

we really need DROP_[EX] events to contain the sampling ratio (getting rid of the sampling ratio obviously complicates this :))

replacing "the" sampling ratio with an array feels wrong. With your patch, the sampling ratio now becomes an attribute of the individual syscalls, so maybe it should become an extension of the ppm_sc_of_interest concept (accept/drop becomes accept/accept-1/nth-of-the-time/drop) rather than an extension of the global sampling ratio concept?

My suggestion would be to leave the global sampling ratio alone and add a separate per-syscall sampling step afterwards, so that the higher sampling ratio (global or per-syscall) wins, although you'd realistically only use one or the other.

Also, what's the end use case for this? What would you use the extra flexibility for?

My end use case: The user process makes some unreasonable system calls, such as calling accpet on a non-blocking socket, which generates a large number of useless events, so we have to enable sampling, but at the same time we don't want to lose other useful system calls, such as 'sendto ', as they are important to the upper-level rules, here are two ways:

Provide a method to protect certain system calls from being discarded during sampling
Provide a method to set an individual sampling rate, and set the protected system call to 100%, while providing more options other than 100%
This PR uses the second method. In fact, I have tried both methods.

So, I would like to know:
1.Do you think the above usage scenarios need to be met?
2.If yes, which of the above two methods do you think is more suitable?

FedeDP · 2024-02-27T15:51:27Z

/remove-lifecycle stale

poiana · 2024-05-27T15:53:13Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

Andreagit97 · 2024-05-28T07:31:24Z

/remove-lifecycle stale

poiana · 2024-08-26T10:10:19Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

Andreagit97 · 2024-08-27T08:44:07Z

/remove-lifecycle stale

poiana · 2024-11-25T10:12:24Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana · 2024-12-25T10:13:03Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten

FedeDP · 2025-01-02T09:08:46Z

/remove-lifecycle rotten

poiana · 2025-04-02T10:15:20Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana · 2025-05-02T10:16:55Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten

FedeDP · 2025-05-05T06:52:22Z

/remove-lifecycle rotten

poiana · 2025-08-03T10:19:06Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

FedeDP · 2025-08-04T07:45:26Z

/remove-lifecycle stale

poiana · 2025-11-02T10:22:04Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana added kind/feature New feature or request dco-signoff: no area/driver-kmod do-not-merge/release-note-label-needed area/driver-bpf area/driver-modern-bpf area/libscap-engine-bpf area/libscap-engine-kmod area/libscap-engine-modern-bpf area/libscap labels Aug 29, 2023

poiana added area/libpman size/XL labels Aug 29, 2023

poiana requested review from Molter73 and hbrueckner August 29, 2023 07:08

poiana added the do-not-merge/hold label Aug 29, 2023

set sampling ratio of dropping for single syscall individually

4235cfd

Signed-off-by: Manny Wang <[email protected]>

wangyongfeng5 force-pushed the individual_dropping_ratio branch from e790789 to 4235cfd Compare August 29, 2023 08:04

poiana added dco-signoff: yes and removed dco-signoff: no labels Aug 29, 2023

Add individual sampling ratio of syscall

5d1e8de

Signed-off-by: Manny Wang <[email protected]>

wangyongfeng5 force-pushed the individual_dropping_ratio branch from 9bee2a4 to 5d1e8de Compare August 29, 2023 10:49

Add individual sampling ratio of syscall

9e5f247

Signed-off-by: Manny Wang <[email protected]>

gnosek reviewed Aug 29, 2023

View reviewed changes

poiana removed the lifecycle/stale label Feb 27, 2024

poiana added the lifecycle/stale label May 27, 2024

poiana removed the lifecycle/stale label May 28, 2024

poiana added the lifecycle/stale label Aug 26, 2024

poiana removed the lifecycle/stale label Aug 27, 2024

poiana added the lifecycle/stale label Nov 25, 2024

poiana added lifecycle/rotten and removed lifecycle/stale labels Dec 25, 2024

poiana removed the lifecycle/rotten label Jan 2, 2025

poiana added the lifecycle/stale label Apr 2, 2025

poiana added lifecycle/rotten and removed lifecycle/stale labels May 2, 2025

poiana removed the lifecycle/rotten label May 5, 2025

poiana added the lifecycle/stale label Aug 3, 2025

poiana removed the lifecycle/stale label Aug 4, 2025

poiana added the lifecycle/stale label Nov 2, 2025

set sampling ratio of dropping for single syscall individually #1309

Are you sure you want to change the base?

set sampling ratio of dropping for single syscall individually #1309

Uh oh!

Conversation

wangyongfeng5 commented Aug 29, 2023

Uh oh!

poiana commented Aug 29, 2023

Uh oh!

poiana commented Aug 29, 2023

Uh oh!

poiana commented Aug 29, 2023

Uh oh!

github-actions bot commented Aug 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FedeDP commented Aug 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Andreagit97 commented Aug 29, 2023

Uh oh!

gnosek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangyongfeng5 commented Aug 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FedeDP commented Feb 27, 2024

Uh oh!

poiana commented May 27, 2024

Uh oh!

Andreagit97 commented May 28, 2024

Uh oh!

poiana commented Aug 26, 2024

Uh oh!

Andreagit97 commented Aug 27, 2024

Uh oh!

poiana commented Nov 25, 2024

Uh oh!

poiana commented Dec 25, 2024

Uh oh!

FedeDP commented Jan 2, 2025

Uh oh!

poiana commented Apr 2, 2025

Uh oh!

poiana commented May 2, 2025

Uh oh!

FedeDP commented May 5, 2025

Uh oh!

poiana commented Aug 3, 2025

Uh oh!

FedeDP commented Aug 4, 2025

Uh oh!

poiana commented Nov 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

github-actions bot commented Aug 29, 2023 •

edited

Loading

FedeDP commented Aug 29, 2023 •

edited

Loading

wangyongfeng5 commented Aug 30, 2023 •

edited

Loading