-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Description
Component(s)
receiver/journald
What happened?
Description
We've been experiencing a weird bug where the journald receiver stops consuming logs. We took a goroutine profile and noticed that it stuck at ReadBytes in the stdout reader goroutine:
goroutine 173 [IO wait, 11861 minutes]:
internal/poll.runtime_pollWait(0x7f1915a59820, 0x72)
runtime/netpoll.go:351 +0x85
internal/poll.(*pollDesc).wait(0xc007b579e0?, 0xc007d6f000?, 0x1)
internal/poll/fd_poll_runtime.go:84 +0x27
internal/poll.(*pollDesc).waitRead(...)
internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc007b579e0, {0xc007d6f000, 0x1000, 0x1000})
internal/poll/fd_unix.go:165 +0x27a
os.(*File).read(...)
os/file_posix.go:29
os.(*File).Read(0xc00112acf0, {0xc007d6f000?, 0xc0019aa4b8?, 0x55f93f5fa660?})
os/file.go:124 +0x4f
bufio.(*Reader).fill(0xc0010cff50)
bufio/bufio.go:113 +0x103
bufio.(*Reader).ReadSlice(0xc0010cff50, 0xa)
bufio/bufio.go:380 +0x29
bufio.(*Reader).collectFragments(0xc0010cff50, 0xa)
bufio/bufio.go:455 +0x70
bufio.(*Reader).ReadBytes(0xc000beee10?, 0x90?)
bufio/bufio.go:482 +0x17
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/journald.(*Input).runJournalctl.func2()
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/[email protected]/operator/input/journald/input.go:180 +0x21a
created by github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/journald.(*Input).runJournalctl in goroutine 170
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/[email protected]/operator/input/journald/input.go:174 +0x105
It waits for 11861 min, which is roughly 8.23 days. There surely were some logs produced in the past 8 days.
Since the journalctl process is still running, I took a look at the systemd-journald.service:
● systemd-journald.service - Journal Service
Loaded: loaded (/usr/lib/systemd/system/systemd-journald.service; static)
Drop-In: /etc/systemd/system/systemd-journald.service.d
└─override.conf
Active: active (running) since Mon 2025-07-14 02:29:42 UTC; 1 week 1 day ago
Invocation: ba97a48cca2b43c68eb0fa46f5efd2f9
TriggeredBy: ● systemd-journald.socket
● systemd-journald-dev-log.socket
Docs: man:systemd-journald.service(8)
man:journald.conf(5)
Main PID: 16228 (systemd-journal)
Status: "Processing requests..."
IO: 25G read, 46.7G written
Tasks: 1 (limit: 307720)
FD Store: 37 (limit: 4224)
Memory: 55.8M (peak: 87.3M)
CPU: 4min 57.588s
CGroup: /system.slice/systemd-journald.service
└─16228 /usr/lib/systemd/systemd-journaldJul 14 02:41:35 36ssds1683 systemd-journald[16228]: /var/log/journal/f69d0000f59211f0800010ffe032c644/system.journal: Jo>
Jul 14 02:41:35 36ssds1683 systemd-journald[16228]: Failed to create new system journal: No such file or directory
Notice: journal has been rotated since unit was started, output may be incomplete.It seems there is something wrong on Jul 14 02:41:35, which almost matches the time when the otel collector goroutine stops receiving anything from it.
I ran journalctl --verify and found a corrupted file too:
3b2eb0: Data object references invalid entry at 319d5d0
File corruption detected at /var/log/journal/f69d0000f59211f0800010ffe032c644/system.journal:52024480 (of 58720256 bytes, 88%).
FAIL: /var/log/journal/f69d0000f59211f0800010ffe032c644/system.journal (Bad message) To summarize, it seems there could be file corruptions when journald rotates log files, and that results in the journalctl hanging, but not crashing. We had a patch to upstream to catch the journalctl crash and restart journalctl #35635, but it couldn't catch this case because the journalctl is technically "running", but just not doing anything.
Besides the journal file corruption, another common cause we observed is machine reboots. In our env, the journal files might not be closed properly, or the active journal file might be recreated during startup, which leads to journalctl hanging. But I haven't dig deeper on the systemd-journald side.
Steps to Reproduce
I have no reliable way to reproduce it but we've observed this happens in many instances.
Expected Result
The journald receiver should retry to consume from the journal when it hangs for too long.
Actual Result
The journald receiver hangs forever until next restart.
Collector version
v0.125.0
Environment information
Environment
OS: Linux 6.12.56 kernel
Compiler(if manually compiled): go 1.23
OpenTelemetry Collector configuration
Log output
Additional context
Related journald issues:
systemd/systemd#36247
systemd/systemd#27489
Tip
React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding +1 or me too, to help us triage it. Learn more here.