You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We're trying to better understand the auto-heal functionality in RabbitMQ.
We are running a 3-node RabbitMQ cluster (v4.0.9) in an air-gapped environment that must survive node failures, such as unplanned shutdowns or complete network partitions. After extensive automated failover testing, we've determined that auto-heal mode is the best option for our use case. However, it exhibits one unexpected behavior.
Our main queues are quorum queues, and the application is a web frontend.
Here's the scenario:
VM with Node 1 goes down unexpectedly.
→ Users connected to Node 1 are notified of the failover (expected behavior).
After a few minutes, Node 1 rejoins the cluster, triggering auto-heal.
At this point, the user is now connected to Node 2, which has been running normally alongside Node 3.
During auto-heal, RabbitMQ determines that Node 3 represents the "winning" partition and restarts Node 2.
This is the part we don't understand:
Why does RabbitMQ restart Node 2, when both Node 2 and Node 3 were operational and in full contact during the outage?
Any insight into the auto-heal partition resolution logic would be greatly appreciated.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hej,
We're trying to better understand the auto-heal functionality in RabbitMQ.
We are running a 3-node RabbitMQ cluster (v4.0.9) in an air-gapped environment that must survive node failures, such as unplanned shutdowns or complete network partitions. After extensive automated failover testing, we've determined that auto-heal mode is the best option for our use case. However, it exhibits one unexpected behavior.
Our main queues are quorum queues, and the application is a web frontend.
Here's the scenario:
This is the part we don't understand:
Why does RabbitMQ restart Node 2, when both Node 2 and Node 3 were operational and in full contact during the outage?
Any insight into the auto-heal partition resolution logic would be greatly appreciated.
Thanks in advance!
Beta Was this translation helpful? Give feedback.
All reactions