[receiver/journald] add a watchdog goroutine to restart journalctl when it hangs #44018

namco1992 · 2025-11-04T08:20:09Z

Description

The systemd-journald could crash or have journal file corruptions when rotating the journal files, which could leave the ReadBytes() method stuck forever.

This PR introduces a "watchdog" goroutine to take a read_timeout config and cancel the current running journalctl process. The input will restart the journalctl. Based on our observation, a restart of journalctl after the crash or file corruptions resumes the log consumption.

The overhead on the consumption is mainly updating the lastRead, which should be negligible most of the time. The watchdog goroutine runs in an interval of read_timeout / 2.

The read_timeout is 0 by default, which means no watchdog check and remains backward compatible.

Link to tracking issue

Fixes #44007.

Testing

Add a simple bash script that print a line every second, and load it to systemd.

log_every_second.sh:

while true; do
    echo "Log message: $(date)"
    sleep 1
done

log.service:

[Unit]
Description=Print logs to journald every second
After=network.target

[Service]
ExecStart=/usr/local/bin/log_every_second.sh
Restart=always
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Start the otelcol with the following config. Take note that the read_timeout is set to 5s.

service:
  extensions: [file_storage/state]
  telemetry:
    logs:
      level: info
  pipelines:
    logs:
      receivers: [journald]
      processors: []
      exporters: [debug]

receivers:
  journald:
    storage: file_storage/state
    read_timeout: 5s

exporters:
  debug:
    verbosity: basic
    sampling_initial: 1
    sampling_thereafter: 0

extensions:
  file_storage/state:
    directory: /tmp

systemctl stop log.service and observe the otelcol's behaviour. The journactl process will be cancelled after the read_timeout (5s in this case). Once we resume the log.service, it will start to consume from it:

2025-11-04T02:01:04.248Z	warn	journald/input.go:184	journalctl read exceeded timeout, canceling command	{"otelcol.component.id": "journald", "otelcol.component.kind": "receiver", "otelcol.signal": "logs", "operator_id": "journald_input", "operator_type": "journald_input", "last_read": "2025-11-04T02:00:59.247Z", "read_timeout": 5}
2025-11-04T02:01:04.249Z	error	journald/input.go:104	journalctl command exited	{"otelcol.component.id": "journald", "otelcol.component.kind": "receiver", "otelcol.signal": "logs", "operator_id": "journald_input", "operator_type": "journald_input", "error": "signal: killed"}
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/journald.(*Input).run
	github.com/open-telemetry/opentelemetry-collector-contrib/pkg/[email protected]/operator/input/journald/input.go:104
2025-11-04T02:01:07.904Z	info	Logs	{"otelcol.component.id": "debug", "otelcol.component.kind": "exporter", "otelcol.signal": "logs", "resource logs": 1, "log records": 9}

Documentation

…en it hangs The systemd-journald could crash or have journal file corruptions when rotating the journal files, which could leave the `ReadBytes()` method stuck forever. This PR introduces a "watchdog" goroutine to take a `read_timeout` config and cancel the current running journalctl process. The input will restart the journalctl. Based on our observation, a restart of journalctl after the crash or file corruptions resumes the log consumption. The overhead on the consumption is mainly updating the `lastRead`, which should be negligible most of the time. The watchdog goroutine runs in an interval of `read_timeout / 2`. The `read_timeout` is 0 by default, which means no watchdog check and remains backward compatible. Add a simple bash script that print a line every second, and load it to systemd. `log_every_second.sh`: ```bash while true; do echo "Log message: $(date)" sleep 1 done ``` `log.service`: ``` [Unit] Description=Print logs to journald every second After=network.target [Service] ExecStart=/usr/local/bin/log_every_second.sh Restart=always StandardOutput=journal StandardError=journal [Install] WantedBy=multi-user.target ``` Start the otelcol with the following config. Take note that the `read_timeout` is set to 5s. ```yaml service: extensions: [file_storage/state] telemetry: logs: level: info pipelines: logs: receivers: [journald] processors: [] exporters: [debug] receivers: journald: storage: file_storage/state read_timeout: 5s exporters: debug: verbosity: basic sampling_initial: 1 sampling_thereafter: 0 extensions: file_storage/state: directory: /tmp ``` `systemctl stop log.service` and observe the otelcol's behaviour. The journactl process will be cancelled after the read_timeout (5s in this case). Once we resume the `log.service`, it will start to consume from it: ```bash 2025-11-04T02:01:04.248Z warn journald/input.go:184 journalctl read exceeded timeout, canceling command {"otelcol.component.id": "journald", "otelcol.component.kind": "receiver", "otelcol.signal": "logs", "operator_id": "journald_input", "operator_type": "journald_input", "last_read": "2025-11-04T02:00:59.247Z", "read_timeout": 5} 2025-11-04T02:01:04.249Z error journald/input.go:104 journalctl command exited {"otelcol.component.id": "journald", "otelcol.component.kind": "receiver", "otelcol.signal": "logs", "operator_id": "journald_input", "operator_type": "journald_input", "error": "signal: killed"} github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/journald.(*Input).run github.com/open-telemetry/opentelemetry-collector-contrib/pkg/[email protected]/operator/input/journald/input.go:104 2025-11-04T02:01:07.904Z info Logs {"otelcol.component.id": "debug", "otelcol.component.kind": "exporter", "otelcol.signal": "logs", "resource logs": 1, "log records": 9} ``` Signed-off-by: Mengnan Gong <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[receiver/journald] add a watchdog goroutine to restart journalctl when it hangs #44018

[receiver/journald] add a watchdog goroutine to restart journalctl when it hangs #44018

namco1992 commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[receiver/journald] add a watchdog goroutine to restart journalctl when it hangs #44018

Are you sure you want to change the base?

[receiver/journald] add a watchdog goroutine to restart journalctl when it hangs #44018

Conversation

namco1992 commented Nov 4, 2025

Description

Link to tracking issue

Testing

Documentation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant