Skip to content

Conversation

@pinoOgni
Copy link
Contributor

This PR continues the work done in #801 and it:

  • adds support for _msearch, _bulk and _doc Elasticsearch operations.
  • removes the extractDBQueryText function: we just return the body and we don't check if it's a valid JSON because from an observability point of view it is still an Elasticsearch request/response. This way, we gain performance and provide the user with the query of the request.
  • updates tests
  • updates operation name extraction

@pinoOgni pinoOgni requested a review from a team as a code owner October 31, 2025 13:03
@codecov
Copy link

codecov bot commented Oct 31, 2025

Codecov Report

❌ Patch coverage is 48.27586% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.16%. Comparing base (2dd80df) to head (2f0bb21).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/ebpf/common/http/elasticsearch.go 48.27% 15 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #853      +/-   ##
==========================================
+ Coverage   46.27%   55.16%   +8.88%     
==========================================
  Files         251      251              
  Lines       21483    21481       -2     
==========================================
+ Hits         9942    11850    +1908     
+ Misses      10899     8815    -2084     
- Partials      642      816     +174     
Flag Coverage Δ
integration-test 23.28% <0.00%> (?)
integration-test-arm 0.00% <0.00%> (?)
integration-test-vm-${ARCH}-${KERNEL_VERSION} 0.00% <0.00%> (?)
k8s-integration-test 2.76% <0.00%> (?)
oats-test 0.00% <0.00%> (?)
unittests 46.27% <48.27%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@mmat11
Copy link
Contributor

mmat11 commented Oct 31, 2025

Copy link
Contributor

@grcevski grcevski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great work, good to merge as soon as tests pass.

@grcevski
Copy link
Contributor

The shard-4 failure looks related, it's failing in the ES test.

@pinoOgni
Copy link
Contributor Author

The shard-4 failure looks related, it's failing in the ES test.

Let me investigate it!

@pinoOgni pinoOgni marked this pull request as draft October 31, 2025 15:15
Copy link
Contributor

@NimrodAvni78 NimrodAvni78 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@pinoOgni
Copy link
Contributor Author

pinoOgni commented Nov 3, 2025

Hi everyone, I've been doing some research on the tests that fail.
Locally, they always work, so I looked at the logs from the CI (and thanks to @mmat11) and noticed that in some cases there seems to be conflicts with the kprobe/tcp_sendmsg probe. So I think tests are flaky.

@mmat11
Copy link
Contributor

mmat11 commented Nov 3, 2025

Hi everyone, I've been doing some research on the tests that fail. Locally, they always work, so I looked at the logs from the CI (and thanks to @mmat11) and noticed that in some cases there seems to be conflicts between the kprobe/unix_stream_sendmsg and kprobe/tcp_sendmsg probes. So I think tests are flaky.

edit: we are still looking for the root cause
This is very hard to investigate locally because it's somehow triggered more frequently in CI.
I suggest flagging this integration test as flaky and open a follow-up issue for when the real issue is solved to re-enable it

@grcevski
Copy link
Contributor

grcevski commented Nov 4, 2025

Hi everyone, I've been doing some research on the tests that fail. Locally, they always work, so I looked at the logs from the CI (and thanks to @mmat11) and noticed that in some cases there seems to be conflicts between the kprobe/unix_stream_sendmsg and kprobe/tcp_sendmsg probes. So I think tests are flaky.

edit: we are still looking for the root cause This is very hard to investigate locally because it's somehow triggered more frequently in CI. I suggest flagging this integration test as flaky and open a follow-up issue for when the real issue is solved to re-enable it

I'm good with that plan, let's disable the test if you'd like and continue on another PR.

@grcevski
Copy link
Contributor

grcevski commented Nov 4, 2025

I can also take a look at the BPF logs to see if anything is weird, unix_stream_sendmsg should be triggering for unix socks, I wonder if another traffic somehow passes through there and it's impacting this code path.

@pinoOgni
Copy link
Contributor Author

pinoOgni commented Nov 4, 2025

I can also take a look at the BPF logs to see if anything is weird, unix_stream_sendmsg should be triggering for unix socks, I wonder if another traffic somehow passes through there and it's impacting this code path.

Hi @grcevski, my comment on the unix_stream_sendmsg was wrong. I updated the comment today.

@grcevski
Copy link
Contributor

grcevski commented Nov 4, 2025

No problem, let me know what you want to do. I re-ran the test one more time to see if it passes and it failed again in the same spot. I haven't had time today to look into it to see if I can spot the issue. I'm at a company offsite this week, but I'll try tomorrow.

@pinoOgni pinoOgni force-pushed the pgn/elasticsearch branch 2 times, most recently from 1a214e2 to 7c00bb1 Compare November 5, 2025 11:22
@grcevski
Copy link
Contributor

grcevski commented Nov 5, 2025

Increasing the timeout seems to have helped. The integration tests passed and the only failure seems a compile issue of unit tests. The k8s netolly test was a temporary issue and I've re-started it.

@pinoOgni
Copy link
Contributor Author

pinoOgni commented Nov 5, 2025

Increasing the timeout seems to have helped. The integration tests passed and the only failure seems a compile issue of unit tests. The k8s netolly test was a temporary issue and I've re-started it.

Hi @grcevski we tried increasing the timeout but the tests still failed so we decided to comment out the call to the testElasticsearchDoc function (I forgot to reset the timeout to its original value).

Basically, we discovered that the POST request of type _doc does not return so OBI returns error 408 after the timeout so not receiving any response I can not establish if it is an ES request/response and I do not create the relative span and consequently the assert fails because there is nothing on Jaeger.

Unfortunately we were unable to understand why this happens, I tried several times to run them locally and they always pass and I can see the span for the _doc operation. I also checked the logs of the first PR and the _doc request is successful as shown in the logs and it's strange because that part has remained the same.

I would like to try just one last thing to test the _doc if it doesn't work I'll push everything as is.

@grcevski
Copy link
Contributor

grcevski commented Nov 5, 2025

Hm okay, one thing to try is to set high_request_volume to be true for the test, which will make us terminate the response right away, rather than waiting. I noticed a flaky behaviour caused by this last week, I haven't been able to look into it yet. We have userspace code that checks for stuck requests, but I suspect it doesn't work for some reason.

@pinoOgni
Copy link
Contributor Author

pinoOgni commented Nov 5, 2025

Apart from one test not related to PR everything works.

To confirm that the spans are created correctly, we only check if the Elasticsearch server responds in a way that OBI can check the header and labels the traffic as ES. Therefore, the spans are still created correctly.

However, I still can't understand why the Elasticsearch server in CI doesn't respond to POSTs of the _doc type.

@pinoOgni pinoOgni marked this pull request as ready for review November 5, 2025 14:42
@MrAlias MrAlias merged commit 8cf8ec9 into open-telemetry:main Nov 5, 2025
63 of 64 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants