[receiver/journald] add a watchdog goroutine to restart journalctl when it hangs

### Component(s)

receiver/journald

### What happened?

## Description

We've been experiencing a weird bug where the journald receiver stops consuming logs. We took a goroutine profile and noticed that it stuck at `ReadBytes` in the stdout reader goroutine:

```
goroutine 173 [IO wait, 11861 minutes]:
internal/poll.runtime_pollWait(0x7f1915a59820, 0x72)
	runtime/netpoll.go:351 +0x85
internal/poll.(*pollDesc).wait(0xc007b579e0?, 0xc007d6f000?, 0x1)
	internal/poll/fd_poll_runtime.go:84 +0x27
internal/poll.(*pollDesc).waitRead(...)
	internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc007b579e0, {0xc007d6f000, 0x1000, 0x1000})
	internal/poll/fd_unix.go:165 +0x27a
os.(*File).read(...)
	os/file_posix.go:29
os.(*File).Read(0xc00112acf0, {0xc007d6f000?, 0xc0019aa4b8?, 0x55f93f5fa660?})
	os/file.go:124 +0x4f
bufio.(*Reader).fill(0xc0010cff50)
	bufio/bufio.go:113 +0x103
bufio.(*Reader).ReadSlice(0xc0010cff50, 0xa)
	bufio/bufio.go:380 +0x29
bufio.(*Reader).collectFragments(0xc0010cff50, 0xa)
	bufio/bufio.go:455 +0x70
bufio.(*Reader).ReadBytes(0xc000beee10?, 0x90?)
	bufio/bufio.go:482 +0x17
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/journald.(*Input).runJournalctl.func2()
	github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza@v0.125.0/operator/input/journald/input.go:180 +0x21a
created by github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/journald.(*Input).runJournalctl in goroutine 170
	github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza@v0.125.0/operator/input/journald/input.go:174 +0x105
```

It waits for 11861 min, which is roughly 8.23 days. There surely were some logs produced in the past 8 days.

Since the journalctl process is still running, I took a look at the systemd-journald.service:
```bash
● systemd-journald.service - Journal Service
     Loaded: loaded (/usr/lib/systemd/system/systemd-journald.service; static)
    Drop-In: /etc/systemd/system/systemd-journald.service.d
             └─override.conf
     Active: active (running) since Mon 2025-07-14 02:29:42 UTC; 1 week 1 day ago
 Invocation: ba97a48cca2b43c68eb0fa46f5efd2f9
TriggeredBy: ● systemd-journald.socket
             ● systemd-journald-dev-log.socket
       Docs: man:systemd-journald.service(8)
             man:journald.conf(5)
   Main PID: 16228 (systemd-journal)
     Status: "Processing requests..."
         IO: 25G read, 46.7G written
      Tasks: 1 (limit: 307720)
   FD Store: 37 (limit: 4224)
     Memory: 55.8M (peak: 87.3M)
        CPU: 4min 57.588s
     CGroup: /system.slice/systemd-journald.service
             └─16228 /usr/lib/systemd/systemd-journaldJul 14 02:41:35 36ssds1683 systemd-journald[16228]: /var/log/journal/f69d0000f59211f0800010ffe032c644/system.journal: Jo>
Jul 14 02:41:35 36ssds1683 systemd-journald[16228]: Failed to create new system journal: No such file or directory
Notice: journal has been rotated since unit was started, output may be incomplete.
```

It seems there is something wrong on Jul 14 02:41:35, which almost matches the time when the otel collector goroutine stops receiving anything from it.

I ran `journalctl --verify` and found a corrupted file too:
```bash
3b2eb0: Data object references invalid entry at 319d5d0
File corruption detected at /var/log/journal/f69d0000f59211f0800010ffe032c644/system.journal:52024480 (of 58720256 bytes, 88%).
FAIL: /var/log/journal/f69d0000f59211f0800010ffe032c644/system.journal (Bad message) 
```

To summarize, it seems there could be file corruptions when journald rotates log files, and that results in the journalctl hanging, but not crashing. We had a patch to upstream to catch the journalctl crash and restart journalctl https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/35635, but it couldn't catch this case because the journalctl is technically "running", but just not doing anything.

Besides the journal file corruption, another common cause we observed is machine reboots. In our env, the journal files might not be closed properly, or the active journal file might be recreated during startup, which leads to journalctl hanging. But I haven't dig deeper on the systemd-journald side.

## Steps to Reproduce
I have no reliable way to reproduce it but we've observed this happens in many instances.

## Expected Result

The journald receiver should retry to consume from the journal when it hangs for too long.

## Actual Result

The journald receiver hangs forever until next restart.

### Collector version

v0.125.0

### Environment information

## Environment
OS: Linux 6.12.56 kernel
Compiler(if manually compiled): go 1.23


### OpenTelemetry Collector configuration

```yaml

```

### Log output

```shell

```

### Additional context

Related journald issues:

https://github.com/systemd/systemd/issues/36247
https://github.com/systemd/systemd/issues/27489

### Tip

<sub>[React](https://github.blog/news-insights/product-news/add-reactions-to-pull-requests-issues-and-comments/) with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding `+1` or `me too`, to help us triage it. Learn more [here](https://opentelemetry.io/community/end-user/issue-participation/).</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[receiver/journald] add a watchdog goroutine to restart journalctl when it hangs #44007

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

Tip

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[receiver/journald] add a watchdog goroutine to restart journalctl when it hangs #44007

Description

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

Tip

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions