-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
I've been running InfluxDB 2 for several years and I can't immediately move to InfluxDB 3 -- too much reliance on Flux. I've built a meta-monitoring stack (Influx monitoring itself) and realised that the storage_shard_write_err_sum (before #25056 is fixed) field increments every time there is a write timeout.
Steps to reproduce:
List the minimal actions needed to reproduce the behaviour.
- Push data to InfluxDB
- Add latency while processing a batch change -- either I/O contention, memory pressure/swapping, stop the process etc, alternatively set a very strict HTTP timeout (1ms or something crazy)
- Monitor the
storage_shard_write_err_summetric
Expected behaviour:
The storage_shard_write_err field should only increase for true errors that the client cannot do anything about (e.g out of disk space), and not for client retriable errors (like timeout)
To be clear: I'm not really concerned about the performance here; I'm happy for the timeouts to happen to signal some backpressure/partial write. It's just that this shouldn't trigger any monitoring alerts (c.f. 4xx vs 5xx HTTP errors)
Actual behaviour:
The metric increases every time a timeout happens.
Environment info:
- Please provide the command you used to build the project, including any
RUSTFLAGS. - System info: Ubuntu 24.04, running kernel
6.14.0-33-generic - Deployed on Kubernetes
- Other relevant environment details: 2x 6-wide RAIDZ2 on ZFS exposed to InfluxDB as a volume mount (via zfs-localpv). It's platter based and so when IOPS is hammered applications which are IO bound will take a performance hit.
Config:
This doesn't rely on config.