Skip to content

Conversation

@namco1992
Copy link
Contributor

Description

The systemd-journald could crash or have journal file corruptions when rotating the journal files, which could leave the ReadBytes() method stuck forever.

This PR introduces a "watchdog" goroutine to take a read_timeout config and cancel the current running journalctl process. The input will restart the journalctl. Based on our observation, a restart of journalctl after the crash or file corruptions resumes the log consumption.

The overhead on the consumption is mainly updating the lastRead, which should be negligible most of the time. The watchdog goroutine runs in an interval of read_timeout / 2.

The read_timeout is 0 by default, which means no watchdog check and remains backward compatible.

Link to tracking issue

Fixes #44007.

Testing

Add a simple bash script that print a line every second, and load it to systemd.

log_every_second.sh:

while true; do
    echo "Log message: $(date)"
    sleep 1
done

log.service:

[Unit]
Description=Print logs to journald every second
After=network.target

[Service]
ExecStart=/usr/local/bin/log_every_second.sh
Restart=always
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Start the otelcol with the following config. Take note that the read_timeout is set to 5s.

service:
  extensions: [file_storage/state]
  telemetry:
    logs:
      level: info
  pipelines:
    logs:
      receivers: [journald]
      processors: []
      exporters: [debug]

receivers:
  journald:
    storage: file_storage/state
    read_timeout: 5s

exporters:
  debug:
    verbosity: basic
    sampling_initial: 1
    sampling_thereafter: 0

extensions:
  file_storage/state:
    directory: /tmp

systemctl stop log.service and observe the otelcol's behaviour. The journactl process will be cancelled after the read_timeout (5s in this case). Once we resume the log.service, it will start to consume from it:

2025-11-04T02:01:04.248Z	warn	journald/input.go:184	journalctl read exceeded timeout, canceling command	{"otelcol.component.id": "journald", "otelcol.component.kind": "receiver", "otelcol.signal": "logs", "operator_id": "journald_input", "operator_type": "journald_input", "last_read": "2025-11-04T02:00:59.247Z", "read_timeout": 5}
2025-11-04T02:01:04.249Z	error	journald/input.go:104	journalctl command exited	{"otelcol.component.id": "journald", "otelcol.component.kind": "receiver", "otelcol.signal": "logs", "operator_id": "journald_input", "operator_type": "journald_input", "error": "signal: killed"}
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/journald.(*Input).run
	github.com/open-telemetry/opentelemetry-collector-contrib/pkg/[email protected]/operator/input/journald/input.go:104
2025-11-04T02:01:07.904Z	info	Logs	{"otelcol.component.id": "debug", "otelcol.component.kind": "exporter", "otelcol.signal": "logs", "resource logs": 1, "log records": 9}

Documentation

…en it hangs

The systemd-journald could crash or have journal file corruptions when
rotating the journal files, which could leave the `ReadBytes()` method
stuck forever.

This PR introduces a "watchdog" goroutine to take a `read_timeout`
config and cancel the current running journalctl process. The input will
restart the journalctl. Based on our observation, a restart of
journalctl after the crash or file corruptions resumes the log
consumption.

The overhead on the consumption is mainly updating the `lastRead`, which
should be negligible most of the time. The watchdog goroutine runs in an
interval of `read_timeout / 2`.

The `read_timeout` is 0 by default, which means no watchdog check and
remains backward compatible.

Add a simple bash script that print a line every second, and load it to systemd.

`log_every_second.sh`:
```bash
while true; do
    echo "Log message: $(date)"
    sleep 1
done
```

`log.service`:
```
[Unit]
Description=Print logs to journald every second
After=network.target

[Service]
ExecStart=/usr/local/bin/log_every_second.sh
Restart=always
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target
```

Start the otelcol with the following config. Take note that the
`read_timeout` is set to 5s.
```yaml
service:
  extensions: [file_storage/state]
  telemetry:
    logs:
      level: info
  pipelines:
    logs:
      receivers: [journald]
      processors: []
      exporters: [debug]

receivers:
  journald:
    storage: file_storage/state
    read_timeout: 5s

exporters:
  debug:
    verbosity: basic
    sampling_initial: 1
    sampling_thereafter: 0

extensions:
  file_storage/state:
    directory: /tmp
```

`systemctl stop log.service` and observe the otelcol's behaviour. The journactl process will be cancelled after the read_timeout (5s in this case). Once we resume the `log.service`, it will start to consume from it:
```bash
2025-11-04T02:01:04.248Z	warn	journald/input.go:184	journalctl read exceeded timeout, canceling command	{"otelcol.component.id": "journald", "otelcol.component.kind": "receiver", "otelcol.signal": "logs", "operator_id": "journald_input", "operator_type": "journald_input", "last_read": "2025-11-04T02:00:59.247Z", "read_timeout": 5}
2025-11-04T02:01:04.249Z	error	journald/input.go:104	journalctl command exited	{"otelcol.component.id": "journald", "otelcol.component.kind": "receiver", "otelcol.signal": "logs", "operator_id": "journald_input", "operator_type": "journald_input", "error": "signal: killed"}
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/journald.(*Input).run
	github.com/open-telemetry/opentelemetry-collector-contrib/pkg/[email protected]/operator/input/journald/input.go:104
2025-11-04T02:01:07.904Z	info	Logs	{"otelcol.component.id": "debug", "otelcol.component.kind": "exporter", "otelcol.signal": "logs", "resource logs": 1, "log records": 9}
```

Signed-off-by: Mengnan Gong <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[receiver/journald] add a watchdog goroutine to restart journalctl when it hangs

1 participant