Fix broken queues

MessageNacked errors / Queues accumulate messages

MessageNacked messages inside AMQP client logs are usually caused by quorum queues that reached their message size limit, which we configure to 5% of the total available space of RabbitMQ.

Identify the affected queue(s):

rabbitmqctl list_queues | grep "[1-9][0-9]*$" | grep -v "fanout$"

Alternatively, the rabbitmq-exporter container will log the affected queues as well.

Ensure that the queue’s consumer is running, which usually corresponds with the name of the queue. This can also be caused by an incorrect oslo_messaging_notifications configuration. Check if all of the specified topics are used or if you might want to switch to the noop driver so no notifications are published. Also see the section “Removing Ceilometer” of Ceilometer.

If you want to permanently remove all messages from the queue without any possibility of recovery, run the following:

rabbitmqctl purge_queue <queue>

If run into this is issue, ensure that the alert RabbitMQQuorumQueuesAccumulatingMessages specified in RabbitMQ alerts is configured for the cluster, which detects the issue before the limit is reached.

Repair broken members

Before you investigate the queue, make sure that all nodes of the RabbitMQ cluster are up and running.

First, you need to retrieve the type of the queue to determine subsequent commands:

queue="<insert-here>"
rabbitmqctl list_queues name type | grep -P "^$queue\t"

Afterwards, inspect the state of the queue:

Quorum queues
 rabbitmq-queues quorum_status <queue>
Stream queues
 rabbitmq-streams stream_status <queue>

A queue is only functional if it has one member in Raft state leader and a quorum of its members (⌊cluster_size/2⌋) is running. You can identify non functional members by a Raft state that is not set to either leader or follower. If the queue still has a leader, but one or more members are in an undesired Raft state like noproc, you can repair this queue by removing and readding these members one after another:

Quorum queues
 rabbitmq-queues delete_member <queue> <node>
 rabbitmq-queues add_member <queue> <node>
Stream queues
 rabbitmq-streams delete_replica <queue> <node>
 rabbitmq-streams add_replica <queue> <node>

If a member is missing entirely, run the rabbitmq-queues add_member / rabbitmq-streams add_replica command for the affected queue and member.

If the queue cannot be recovered, it has to be deleted and recreated:

rabbitmqctl delete_queue <queue>

If that command does not work either, try to run:

queue="<queue_name>"
rabbitmqctl eval 'rabbit_db_queue:delete({resource,<<"/">>,queue,<<"'${queue}'">>},normal).'

Afterwards, restart the consuming service so the queue and necessary bindings get recreated.

Repair the RabbitMQ stream coordinator

Warning

This guide contains the usage of bash functions from the file /var/lib/rabbitmq/utils/troubleshooting-utils.sh which is only introduced with new infra operator versions. If this file is not available yet, you can copy it from the current devel branch. If you do not want to checkout the branch locally, you can find needed files here:

Afterwards copy the files to the pod filesystem:

amqpserver="<amqpserver>"
pod=$(kubectl -n "$YAOOK_OP_NAMESPACE" get pod -l "state.yaook.cloud/parent-name=$amqpserver,state.yaook.cloud/parent-plural=amqpservers" -o name | head -n1 | cut -f2 -d'/')
kubectl -n "$YAOOK_OP_NAMESPACE" cp ./yaook/op/infra/templates/amqp-rabbitmqadmin-wrapper.sh "$pod":/tmp/rabbitmqadmin-wrapper.sh
kubectl -n "$YAOOK_OP_NAMESPACE" cp ./yaook/op/infra/templates/amqp-troubleshooting-utils.sh "$pod":/tmp/troubleshooting-utils.sh

# run the following inside the RabbitMQ pod
source /tmp/rabbitmqadmin-wrapper.sh
source /tmp/troubleshooting-utils.sh

If you encounter repeated OpenStack error messages like the following, the stream coordinator [1] is possibly not working properly:

"message": "Failed to consume message from queue: (0, 0): (541) INTERNAL_ERROR",

To confirm that the stream coordinator is unhealthy, run:

rabbitmq-diagnostics coordinator_status

With a healthy coordinator cluster, the command will print all cluster nodes with exactly one of them being in “Raft State” leader. If it returns an error or does not terminate, the coordinator cluster is broken and needs to be rebuilt. Single noproc members will not cause stream operations to fail as long as there is still a leader, even though a quorum loss may be more likely in that case. If the coordinator is functional, but you still see the error message above, read the next section.

The following command will restart the stream coordinator by resetting the coordinator process on all nodes and triggering the coordinator start by declaring a nonexistent stream queue. Afterwards all stream queues will be recreated automatically since the stream queues often end up in an erroneous state afterwards:

restart_stream_coordinator

Again, but this time confirm that the coordinator is running:

rabbitmq-diagnostics coordinator_status

If the coordinator is still unhealthy, we recommend to rebuild the entire RabbitMQ cluster [2].

Resolve “stream_not_found” errors

This problem displays the same Failed to consume message from queue OpenStack logs as they appear in case of a dysfunctional stream coordinator, but the following messages can also be encountered in the RabbitMQ logs:

errorContext: child_terminated
reason: {{stream_not_found,
            {resource,<<"/">>,queue,<<"barbican.workers_fanout">>}},

If you already tried to delete the queue using rabbitmqctl delete_queue, you might have noticed that this does not work. In that case, using an internal delete call should still be functional and delete the queue data on all nodes. Afterwards, we will have to recreate the queue along with its exchange and binding manually:

recreate_stream_queue "<queue>"

If you want to recreate all stream queues, you can run:

recreate_all_stream_queues

This problem is caused by inconsistent information inside the RabbitMQ database, e.g. the queue might still exist inside MNESIA_DURABLE_TABLE but be missing inside the MNESIA_TABLE, there can, however, also be other causes. Because stream queues are not critical for OpenStack API operations, they can be deleted and recreated without backing up messages.