[receiver/journald] add a watchdog goroutine to restart journalctl when it hangs #44018
+154
−5
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
The systemd-journald could crash or have journal file corruptions when rotating the journal files, which could leave the
ReadBytes()method stuck forever.This PR introduces a "watchdog" goroutine to take a
read_timeoutconfig and cancel the current running journalctl process. The input will restart the journalctl. Based on our observation, a restart of journalctl after the crash or file corruptions resumes the log consumption.The overhead on the consumption is mainly updating the
lastRead, which should be negligible most of the time. The watchdog goroutine runs in an interval ofread_timeout / 2.The
read_timeoutis 0 by default, which means no watchdog check and remains backward compatible.Link to tracking issue
Fixes #44007.
Testing
Add a simple bash script that print a line every second, and load it to systemd.
log_every_second.sh:log.service:Start the otelcol with the following config. Take note that the
read_timeoutis set to 5s.systemctl stop log.serviceand observe the otelcol's behaviour. The journactl process will be cancelled after the read_timeout (5s in this case). Once we resume thelog.service, it will start to consume from it:Documentation