Skip to content

high mutex contention in metric sums #7037

@cursedquail

Description

@cursedquail

Description

We recently tried adding some application metrics with the otel metrics sdk to one of our kafka consumers. When we rolled out the change to our staging environment, the consumers immediately fell behind & couldn't keep up with the throughput of the kafka partition.

some profiling revealed that we were spending a lot of time contended on this mutex

We were also using delta metrics for this - I haven't evaluated if there is better performance with cumulative sums. It is possible that what we were actually seeing is needing to recreate the values every minute with newRes. I'm not 100% convinced of that, just because we aren't emitting the counters very often, and the slowdown is very substantial.

Environment

  • OS: linux arm64
  • Go Version: 1.24.4
  • opentelemetry-go version: 1.37.0

Steps To Reproduce

I suspect that calling sum.Add(1) from a large number of goroutines in a hot loop will reproduce this issue.

I'll note that for our case, every goroutine is writing to an independent set of attributes - if the map used a rwlock & the values in the map had their own mutex, I suspect we would see little to no contention.

Metadata

Metadata

Assignees

Labels

area:metricsPart of OpenTelemetry MetricsenhancementNew feature or requestpkg:SDKRelated to an SDK package

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions