Skip to content

[receiver/journald] add a watchdog goroutine to restart journalctl when it hangs #44007

@namco1992

Description

@namco1992

Component(s)

receiver/journald

What happened?

Description

We've been experiencing a weird bug where the journald receiver stops consuming logs. We took a goroutine profile and noticed that it stuck at ReadBytes in the stdout reader goroutine:

goroutine 173 [IO wait, 11861 minutes]:
internal/poll.runtime_pollWait(0x7f1915a59820, 0x72)
	runtime/netpoll.go:351 +0x85
internal/poll.(*pollDesc).wait(0xc007b579e0?, 0xc007d6f000?, 0x1)
	internal/poll/fd_poll_runtime.go:84 +0x27
internal/poll.(*pollDesc).waitRead(...)
	internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc007b579e0, {0xc007d6f000, 0x1000, 0x1000})
	internal/poll/fd_unix.go:165 +0x27a
os.(*File).read(...)
	os/file_posix.go:29
os.(*File).Read(0xc00112acf0, {0xc007d6f000?, 0xc0019aa4b8?, 0x55f93f5fa660?})
	os/file.go:124 +0x4f
bufio.(*Reader).fill(0xc0010cff50)
	bufio/bufio.go:113 +0x103
bufio.(*Reader).ReadSlice(0xc0010cff50, 0xa)
	bufio/bufio.go:380 +0x29
bufio.(*Reader).collectFragments(0xc0010cff50, 0xa)
	bufio/bufio.go:455 +0x70
bufio.(*Reader).ReadBytes(0xc000beee10?, 0x90?)
	bufio/bufio.go:482 +0x17
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/journald.(*Input).runJournalctl.func2()
	github.com/open-telemetry/opentelemetry-collector-contrib/pkg/[email protected]/operator/input/journald/input.go:180 +0x21a
created by github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/journald.(*Input).runJournalctl in goroutine 170
	github.com/open-telemetry/opentelemetry-collector-contrib/pkg/[email protected]/operator/input/journald/input.go:174 +0x105

It waits for 11861 min, which is roughly 8.23 days. There surely were some logs produced in the past 8 days.

Since the journalctl process is still running, I took a look at the systemd-journald.service:

● systemd-journald.service - Journal Service
     Loaded: loaded (/usr/lib/systemd/system/systemd-journald.service; static)
    Drop-In: /etc/systemd/system/systemd-journald.service.d
             └─override.conf
     Active: active (running) since Mon 2025-07-14 02:29:42 UTC; 1 week 1 day ago
 Invocation: ba97a48cca2b43c68eb0fa46f5efd2f9
TriggeredBy: ● systemd-journald.socket
             ● systemd-journald-dev-log.socket
       Docs: man:systemd-journald.service(8)
             man:journald.conf(5)
   Main PID: 16228 (systemd-journal)
     Status: "Processing requests..."
         IO: 25G read, 46.7G written
      Tasks: 1 (limit: 307720)
   FD Store: 37 (limit: 4224)
     Memory: 55.8M (peak: 87.3M)
        CPU: 4min 57.588s
     CGroup: /system.slice/systemd-journald.service
             └─16228 /usr/lib/systemd/systemd-journaldJul 14 02:41:35 36ssds1683 systemd-journald[16228]: /var/log/journal/f69d0000f59211f0800010ffe032c644/system.journal: Jo>
Jul 14 02:41:35 36ssds1683 systemd-journald[16228]: Failed to create new system journal: No such file or directory
Notice: journal has been rotated since unit was started, output may be incomplete.

It seems there is something wrong on Jul 14 02:41:35, which almost matches the time when the otel collector goroutine stops receiving anything from it.

I ran journalctl --verify and found a corrupted file too:

3b2eb0: Data object references invalid entry at 319d5d0
File corruption detected at /var/log/journal/f69d0000f59211f0800010ffe032c644/system.journal:52024480 (of 58720256 bytes, 88%).
FAIL: /var/log/journal/f69d0000f59211f0800010ffe032c644/system.journal (Bad message) 

To summarize, it seems there could be file corruptions when journald rotates log files, and that results in the journalctl hanging, but not crashing. We had a patch to upstream to catch the journalctl crash and restart journalctl #35635, but it couldn't catch this case because the journalctl is technically "running", but just not doing anything.

Besides the journal file corruption, another common cause we observed is machine reboots. In our env, the journal files might not be closed properly, or the active journal file might be recreated during startup, which leads to journalctl hanging. But I haven't dig deeper on the systemd-journald side.

Steps to Reproduce

I have no reliable way to reproduce it but we've observed this happens in many instances.

Expected Result

The journald receiver should retry to consume from the journal when it hangs for too long.

Actual Result

The journald receiver hangs forever until next restart.

Collector version

v0.125.0

Environment information

Environment

OS: Linux 6.12.56 kernel
Compiler(if manually compiled): go 1.23

OpenTelemetry Collector configuration

Log output

Additional context

Related journald issues:

systemd/systemd#36247
systemd/systemd#27489

Tip

React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding +1 or me too, to help us triage it. Learn more here.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions