Restore RabbitMQ cluster¶
If you have some RabbitMQ cluster issues, this guide can help you to recover them.
This guide doesn’t cover all cases but may help to get a first look.
Useful commands:
Check cluster status:
rabbitmqctl cluster_status
Check the number of disk nodes and running nodes. This should fit to the number of pods.
Rebuild everything¶
If all members of your cluster are down and no RabbitMQ communication is working between the nodes and the clients using rabbit, this is the easiest approach.
Warning
BUT: It comes with loss of all messages, currently stored on the RabbitMQ Cluster!!
Delete all pvc of the RabbitMQ (e.g. data-cinder-yaook-cinder-mq-0.)
Force delete all Pods of the mq statefulset. As you delete all data, you also don’t need to gracefully shut down the Pods.
Wait till the statefulset recreates the Pods and check if they are healthy.
Restart all clients of this RabbitMQ, so they recreate required queues and exchanges.
Rebuild from healthy node¶
As long as you have one healthy node in the cluster, you can try to rebuild from that one.
First approach: Restart pods and autojoin¶
Restart the remaining pods:
Force delete all bad pods.
Wait and check if they come up and join the cluster again.
But if you have issues restarting RabbitMQ, this most probably don’t help.
Second approach: Remove nodes from the cluster:¶
Instead of restarting the Pods and automatic join remove the node from the MQ cluster and let it join again.
Stop the broken service:
rabbitmqctl -n <NODE> stop_app
This can either be done on the to be removed pod itself or via another pod of the Cluster. <NODE> is something like
rabbit@cinder-yaook-cinder-mq-0.cinder-yaook-cinder-mq-rdy.yaook.svc.cluster.local
Drop the node from the cluster.
Run on the pod of the healthy node:
rabbitmqctl forget_cluster_node <NODE>
On the to be removed pod, reset the node.
This can be done by deleting the PVC of the pod and restart it,
or via rabbitmqctl
rabbitmqctl force_reset
Join the node again to the cluster.
If you deleted the PVC and restarted the pod, this should be done automatically.
When you have used
force_reset
run:rabbitmqctl join_cluster <cluster> rabbitmqctl start_app
<cluster>
can be the name of one healthy node, e.g.rabbit@cinder-yaook-cinder-mq-1.cinder-yaook-cinder-mq-rdy.yaook.svc.cluster.local
Solve Pod stuck in failed startup probe¶
It can happen that a rabbitmq Pod fails to become ready. Besides checking the pod logs, also check the pod description if the startup probe fails.
Yaook has a startup probe for MQ pods, that checks if the Node has joined the cluster. If joining the cluster fails, the node comes up as standalone. If you/ yaook don’t see this, you can end up with separate standalone nodes and no one will notice (till some issues occur.)
This can happen, when the MQ PVC got deleted and the pod starts again. The cluster then still knows the old Node with same name and won’t allow the new one to join.
To resolve this, follow the steps above on Remove nodes from the cluster.