Auto-Heal Behavior in RabbitMQ 4.x (3-Node Cluster) #14932

uk1988 · 2025-11-11T09:17:07Z

uk1988
Nov 11, 2025

Community Support Policy

I have read RabbitMQ's Community Support Policy

RabbitMQ version used

4.0.x

How is RabbitMQ deployed?

Kubernetes Operator(s) from Team RabbitMQ

Steps to reproduce the behavior in question

Hej ,

We're trying to better understand the auto-heal functionality in RabbitMQ.
We are running a 3-node RabbitMQ cluster (v4.0.9) in an air-gapped environment that must survive node failures, such as unplanned shutdowns or complete network partitions. After extensive automated failover testing, we've determined that auto-heal mode is the best option for our use case. However, it exhibits one unexpected behavior.
Our main queues are quorum queues, and the application is a web frontend.

Here's the scenario:

VM with Node 1 goes down unexpectedly.
- → Users connected to Node 1 are notified of the failover (expected behavior).
After a few minutes, Node 1 rejoins the cluster, triggering auto-heal.
At this point, the user is now connected to Node 2, which has been running normally alongside Node 3.
During auto-heal, RabbitMQ determines that Node 3 represents the "winning" partition and restarts Node 2.

This is the part we don't understand:

Why does RabbitMQ restart Node 2, when both Node 2 and Node 3 were operational and in full contact during the outage?

Any insight into the auto-heal partition resolution logic would be greatly appreciated.

Thanks in advance!

Answered by michaelklishin

Nov 13, 2025

@uk1988

Why does RabbitMQ restart Node 2, when both Node 2 and Node 3 were operational and in full contact during the outage?

Because that's how the autoheal partition healing strategy was designed to work: it restarts all nodes except for the "winning" one. It's a very straightforward way to address Mnesia's highly opinionated approach to recovery from partitions.

pause_minority works differently. Team RabbitMQ recommends it over autoheal for most users.

Like I said, with classic mirrored queues removed in 4.0 and Khepri becoming the only metadata store in 4.3, partition handling strategies will be fairly soon gone from RabbitMQ: Khepri, quorum queues, stream coordinator all recover th…

View full answer

MirahImage · 2025-11-11T09:54:24Z

MirahImage
Nov 11, 2025
Maintainer

Hi, a few things to bring up:

RabbitMQ 4.0 is no longer under community support.
If you're running khepri (so, for example, with a newly deployed 4.2.x cluster, or you have enabled khepri on any 4.x cluster), then the partition handling strategy is no longer relevant, as it is only applicable to mnesia. Starting with 4.3 this configuration will be removed due to it no longer having an effect.

3 replies

michaelklishin Nov 11, 2025
Maintainer

It's somewhat relevant for as long as the partition handling strategies exist but the good news is that we might remove them as soon as Mnesia is gone (beyond the parts necessary for migration to Khepri), that is, 4.3.

uk1988 Nov 13, 2025
Author

So, as I understand it, described behavior in the ticket is expected.
We have given a try to 4.2.0 yesterday and we were able to break it very fast ... :)
We did not experience the extra failover as in this ticket, but node had problem to rejoin the cluster.
I guess the best way is to open a bug ticket for that one.

We are testing network failure on our RHEL nodes by running:

sudo bash -c 'ip link set eth0 down; sleep 180; ip link set eth0 up

lukebakken Nov 13, 2025
Maintainer

Do not open an issue, start a new discussion.

The information you've provided is incomplete as well. Please see the required information when discussing any RabbitMQ issue:

https://github.com/rabbitmq/support-tools/blob/main/docs/Reporting_RabbitMQ_Issues.md#gather-information

Attaching your complete RabbitMQ configuration files is also a must. Thanks.

michaelklishin · 2025-11-13T21:33:59Z

michaelklishin
Nov 13, 2025
Maintainer

@uk1988

Why does RabbitMQ restart Node 2, when both Node 2 and Node 3 were operational and in full contact during the outage?

Because that's how the autoheal partition healing strategy was designed to work: it restarts all nodes except for the "winning" one. It's a very straightforward way to address Mnesia's highly opinionated approach to recovery from partitions.

pause_minority works differently. Team RabbitMQ recommends it over autoheal for most users.

Like I said, with classic mirrored queues removed in 4.0 and Khepri becoming the only metadata store in 4.3, partition handling strategies will be fairly soon gone from RabbitMQ: Khepri, quorum queues, stream coordinator all recover the way the Raft consensus algorithm requires the recovery process to work.

1 reply

michaelklishin Nov 13, 2025
Maintainer

Knowing that partition handling strategies as a subsystem will go away likely as early as RabbitMQ 4.3, we won't spend any time "fixing" them without a very significant justification.

There is nothing to "fix" for autoheal, it is very opinionated and restarts all nodes. pause_minority is similar to classic queue mirroring, while you can introduce small improvements but the overall design (Mnesia's design when it comes to recovery) is considered to be a dead end by our team.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Auto-Heal Behavior in RabbitMQ 4.x (3-Node Cluster) #14932

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Auto-Heal Behavior in RabbitMQ 4.x (3-Node Cluster) #14932

Uh oh!

Uh oh!

uk1988 Nov 11, 2025

Community Support Policy

RabbitMQ version used

How is RabbitMQ deployed?

Steps to reproduce the behavior in question

Replies: 2 comments · 4 replies

Uh oh!

MirahImage Nov 11, 2025 Maintainer

Uh oh!

michaelklishin Nov 11, 2025 Maintainer

Uh oh!

uk1988 Nov 13, 2025 Author

Uh oh!

lukebakken Nov 13, 2025 Maintainer

Uh oh!

Uh oh!

michaelklishin Nov 13, 2025 Maintainer

Uh oh!

michaelklishin Nov 13, 2025 Maintainer

uk1988
Nov 11, 2025

Replies: 2 comments 4 replies

MirahImage
Nov 11, 2025
Maintainer

michaelklishin Nov 11, 2025
Maintainer

uk1988 Nov 13, 2025
Author

lukebakken Nov 13, 2025
Maintainer

michaelklishin
Nov 13, 2025
Maintainer

michaelklishin Nov 13, 2025
Maintainer