Fix broken stream (fanout) queues¶
Beginning with Yaook operator version 0.20250227.0, all queues which are created by OpenStack and contain the suffix _fanout are of type stream, unless the overwrite
rabbit_stream_fanout=false is set inside the OpenStack manifests. You can confirm the type by running:
rabbitmqctl list_queues name type | grep <queue_name>
If you want to know more about what RabbitmMQ streams are, head to the RabbitMQ documentation.
Commands provided in this guide need to be run inside the rabbitmq Kubernetes container.
Warning
This guide contains the usage of bash functions from the file /var/lib/rabbitmq/utils/troubleshooting-utils.sh which is only
introduced with new infra operator versions. If this file is not available yet, you can copy it from the current devel branch.
If you do not want to checkout the branch locally, you can find needed files here:
Afterwards copy the files to the pod filesystem:
amqpserver="<amqpserver>"
pod=$(kubectl -n "$YAOOK_OP_NAMESPACE" get pod -l "state.yaook.cloud/parent-name=$amqpserver,state.yaook.cloud/parent-plural=amqpservers" -o name | head -n1 | cut -f2 -d'/')
kubectl -n "$YAOOK_OP_NAMESPACE" cp ./yaook/op/infra/templates/amqp-rabbitmqadmin-wrapper.sh "$pod":/tmp/rabbitmqadmin-wrapper.sh
kubectl -n "$YAOOK_OP_NAMESPACE" cp ./yaook/op/infra/templates/amqp-troubleshooting-utils.sh "$pod":/tmp/troubleshooting-utils.sh
# run the following inside the RabbitMQ pod
source /tmp/rabbitmqadmin-wrapper.sh
source /tmp/troubleshooting-utils.sh
Repair the RabbitMQ stream coordinator¶
If you encounter repeated OpenStack error messages like the following, the stream coordinator [1] is possibly not working properly:
"message": "Failed to consume message from queue: (0, 0): (541) INTERNAL_ERROR",
To confirm that the stream coordinator is unhealthy, run:
rabbitmq-diagnostics coordinator_status
With a healthy coordinator cluster, the command will print all cluster nodes with exactly one of them
being in “Raft State” leader. If it returns an error or does not terminate, the coordinator cluster is broken and needs to be rebuilt.
Single noproc members will not cause stream operations to fail as long as there is still a leader, even though a quorum loss may be more likely in that case. If the coordinator is functional, but
you still see the error message above, read the next section.
The following command will restart the stream coordinator by resetting the coordinator process on all nodes and triggering the coordinator start by declaring a nonexistent stream queue. Afterwards all stream queues will be recreated automatically since the stream queues often end up in an erroneous state afterwards:
restart_stream_coordinator
Again, but this time confirm that the coordinator is running:
rabbitmq-diagnostics coordinator_status
If the coordinator is still unhealthy, we recommend to rebuild the entire RabbitMQ cluster [2].
Resolve “stream_not_found” errors¶
This problem displays the same Failed to consume message from queue OpenStack logs as they appear in case of a dysfunctional stream coordinator, but the following messages can also be encountered in the RabbitMQ logs:
errorContext: child_terminated
reason: {{stream_not_found,
{resource,<<"/">>,queue,<<"barbican.workers_fanout">>}},
If you already tried to delete the queue using rabbitmqctl delete_queue, you might have noticed that this does not work.
In that case, using an internal delete call should still be functional and delete the queue data on all nodes. Afterwards, we will have to recreate the queue
along with its exchange and binding manually:
recreate_stream_queue "<queue>"
If you want to recreate all stream queues, you can run:
recreate_all_stream_queues
This problem is caused by inconsistent information inside the RabbitMQ database, e.g. the queue might still exist inside MNESIA_DURABLE_TABLE but be missing inside the MNESIA_TABLE,
there can, however, also be other causes. Because stream queues are not critical for OpenStack API operations, they can be deleted and recreated without backing up messages.