Restore RabbitMQ cluster

If you have some RabbitMQ cluster issues, this guide can help you to recover them.

This guide doesn’t cover all cases but may help to get a first look.

Useful commands:

Check cluster status:

rabbitmqctl cluster_status

Check the number of disk nodes and running nodes. This should fit to the number of pods.

Rebuild everything

If all members of your cluster are down and no RabbitMQ communication is working between the nodes and the clients using rabbit, this is the easiest approach.

Warning

BUT: It comes with loss of all messages, currently stored on the RabbitMQ Cluster!!

  1. Delete all pvc of the RabbitMQ (e.g. data-cinder-yaook-cinder-mq-0.)

  2. Force delete all Pods of the mq statefulset. As you delete all data, you also don’t need to gracefully shut down the Pods.

  3. Wait till the statefulset recreates the Pods and check if they are healthy.

  4. Restart all clients of this RabbitMQ, so they recreate required queues and exchanges.

Rebuild from healthy node

As long as you have one healthy node in the cluster, you can try to rebuild from that one.

First approach: Restart pods and autojoin

Restart the remaining pods:

  1. Force delete all bad pods.

  2. Wait and check if they come up and join the cluster again.

But if you have issues restarting RabbitMQ, this most probably don’t help.

Second approach: Remove nodes from the cluster:

Instead of restarting the Pods and automatic join remove the node from the MQ cluster and let it join again.

  1. Stop the broken service:

    rabbitmqctl -n <NODE> stop_app
    

    This can either be done on the to be removed pod itself or via another pod of the Cluster. <NODE> is something like rabbit@cinder-yaook-cinder-mq-0.cinder-yaook-cinder-mq-rdy.yaook.svc.cluster.local

  2. Drop the node from the cluster.

    Run on the pod of the healthy node:

    rabbitmqctl forget_cluster_node <NODE>
    
  3. On the to be removed pod, reset the node.

    This can be done by deleting the PVC of the pod and restart it,

    or via rabbitmqctl

    rabbitmqctl force_reset
    
  4. Join the node again to the cluster.

    If you deleted the PVC and restarted the pod, this should be done automatically.

    When you have used force_reset run:

    rabbitmqctl join_cluster <cluster>
    rabbitmqctl start_app
    

    <cluster> can be the name of one healthy node, e.g. rabbit@cinder-yaook-cinder-mq-1.cinder-yaook-cinder-mq-rdy.yaook.svc.cluster.local

Solve Pod stuck in failed startup probe

It can happen that a rabbitmq Pod fails to become ready. Besides checking the pod logs, also check the pod description if the startup probe fails.

Yaook has a startup probe for MQ pods, that checks if the Node has joined the cluster. If joining the cluster fails, the node comes up as standalone. If you/ yaook don’t see this, you can end up with separate standalone nodes and no one will notice (till some issues occur.)

This can happen, when the MQ PVC got deleted and the pod starts again. The cluster then still knows the old Node with same name and won’t allow the new one to join.

To resolve this, follow the steps above on Remove nodes from the cluster.