Troubleshoot stream (fanout) queues
===================================

Beginning with Yaook operator version ``0.20250227.0``, all queues which are created by OpenStack and contain the suffix ``_fanout`` are of type stream, unless the overwrite
``rabbit_stream_fanout=false`` is set inside the OpenStack manifests. You can confirm the type by running:

.. code:: bash

    rabbitmqctl list_queues name type | grep <queue_name>

If you want to know more about what RabbitmMQ streams are, head to the `RabbitMQ documentation <https://www.rabbitmq.com/docs/streams>`_.


Commands provided in this guide need to be run inside the ``rabbitmq`` Kubernetes container.

Repair the RabbitMQ stream coordinator
--------------------------------------

If you encounter repeated OpenStack error messages like the following, the stream coordinator [1]_ is possibly not working properly:

.. code::

    "message": "Failed to consume message from queue: (0, 0): (541) INTERNAL_ERROR", 


To confirm that the stream coordinator is unhealthy, run:

.. code:: bash

   rabbitmq-diagnostics coordinator_status

With a healthy coordinator cluster, the command will print all cluster nodes with exactly one of them
being in "Raft State" leader. If it returns an error or does not terminate, the coordinator cluster is broken and needs to be rebuilt.
Single ``noproc`` members will not cause stream operations to fail as long as there is still a leader, even though a quorum loss may be more likely in that case. If the coordinator is functional, but
you still see the error message above, read the next section.

The probably easiest and most reliable way to achieve a restart of the coordinator is by resetting the coordinator process on all nodes and triggering the coordinator start by declaring a nonexistent stream queue.

.. code:: bash

    nodes=$(rabbitmqadmin list nodes -f bash -u "${RABBITMQ_DEFAULT_USER}" -p "${RABBITMQ_DEFAULT_PASS}")
    
    reset_command=""
    for node in $nodes; do
        reset_command+="ra:force_delete_server(coordination, {rabbit_stream_coordinator, '${node}'}),"
    done
    reset_command="${reset_command%?}."

    rabbitmqctl eval "${reset_command}"

    queue_name="dont-mind-me-just-triggering-a-coordinator-start"
    rabbitmqadmin declare queue name="${queue_name}" durable=true auto_delete=false arguments='{"x-queue-type": "stream"}' -u "${RABBITMQ_DEFAULT_USER}" -p "${RABBITMQ_DEFAULT_PASS}"
    rabbitmqctl delete_queue "${queue_name}"

Again, but this time confirm that the coordinator is running:

.. code:: bash

   rabbitmq-diagnostics coordinator_status

If the coordinator is still unhealthy, we recommend to rebuild the entire RabbitMQ cluster [2]_.
Otherwise, you should confirm that you see no more ``Failed to consume message from queue`` messages inside the logs. If this is still the case keep reading the next section.

.. [1] The stream coordinator is a set of multiple coordinator processes that run on each RabbitMQ node and use the Raft algorithm.
.. [2] If resolving the issue is not a matter of urgency, we instead recommend to find a fix for your case and contribute to the documentation.

Resolve "stream_not_found" errors
---------------------------------

This problem displays the same ``Failed to consume message from queue`` OpenStack logs as they appear in case of a dysfunctional stream coordinator, but the following messages can also be encountered in the RabbitMQ logs:

.. code:: bash

    errorContext: child_terminated                                       
    reason: {{stream_not_found,                                          
                {resource,<<"/">>,queue,<<"barbican.workers_fanout">>}},


If you already tried to delete the queue using ``rabbitmqctl delete_queue``, you might have noticed that this does not work.
In that case, using an internal delete call should still be functional and delete the queue data on all nodes. Afterwards, we will have to recreate the queue
along with its exchange and binding manually:

.. code:: bash
    
    stream="insert-your-queue-name-here"

    if [[ ${stream} != *fanout ]]; then
        echo "Error: The recreation is only supported for stream queues created by OpenStack with the suffix '_fanout.' Not doing anything."
        echo "If this queue was created by OpenStack, the documentation you consulted may be out of date."
    else
        rabbitmqctl eval 'rabbit_db_queue:delete({resource,<<"/">>,queue,<<"'${stream}'">>},normal).'

        rabbitmqadmin declare queue name="${stream}" durable=true auto_delete=false arguments='{"x-queue-type": "stream"}' -u "${RABBITMQ_DEFAULT_USER}" -p "${RABBITMQ_DEFAULT_PASS}"
        rabbitmqadmin declare exchange name="${stream}" type=fanout durable=true auto_delete=true -u "${RABBITMQ_DEFAULT_USER}" -p "${RABBITMQ_DEFAULT_PASS}"
        rabbitmqadmin declare binding source="${stream}" destination="${stream}" routing_key="${stream%_fanout}" -u "${RABBITMQ_DEFAULT_USER}" -p "${RABBITMQ_DEFAULT_PASS}"
    fi

If you want to recreate all stream queues, you can run:

.. code:: bash

    streams=$(rabbitmqctl eval "Q=rabbit_db_queue:list()."| cut -d "\"" -f 4 | grep "fanout$")

    for stream in $streams; do
        if [[ ${stream} != *fanout ]]; then
            echo "Error: The recreation is only supported for stream queues created by OpenStack with the suffix '_fanout.' Skipping queue '${stream}'."
            echo "If this queue was created by OpenStack, the documentation you consulted may be out of date."
        else
            rabbitmqctl eval 'rabbit_db_queue:delete({resource,<<"/">>,queue,<<"'${stream}'">>},normal).'

            rabbitmqadmin declare queue name="${stream}" durable=true auto_delete=false arguments='{"x-queue-type": "stream"}' -u "${RABBITMQ_DEFAULT_USER}" -p "${RABBITMQ_DEFAULT_PASS}"
            rabbitmqadmin declare exchange name="${stream}" type=fanout durable=true auto_delete=true -u "${RABBITMQ_DEFAULT_USER}" -p "${RABBITMQ_DEFAULT_PASS}"
            rabbitmqadmin declare binding source="${stream}" destination="${stream}" routing_key="${stream%_fanout}" -u "${RABBITMQ_DEFAULT_USER}" -p "${RABBITMQ_DEFAULT_PASS}"
        fi
    done


This problem is caused by inconsistent information inside the RabbitMQ database, e.g. the queue might still exist inside ``MNESIA_DURABLE_TABLE`` but be missing inside the ``MNESIA_TABLE``,
there can, however, also be other causes. Because stream queues are not critical for OpenStack API operations, they can be deleted and recreated without backing up messages.