Fix broken queues
=================

MessageNacked errors / Queues accumulate messages
-------------------------------------------------

MessageNacked messages inside AMQP client logs are usually caused by quorum queues that reached their message size
limit, which we configure to 5% of the total available space of RabbitMQ.

Identify the affected queue(s):

.. code:: bash

    rabbitmqctl list_queues | grep "[1-9][0-9]*$" | grep -v "fanout$"

Alternatively, the ``rabbitmq-exporter`` container will log the affected queues as well.

Ensure that the queue's consumer is running, which usually corresponds with the name of the queue. This can also be caused by an incorrect  ``oslo_messaging_notifications`` configuration.
Check if all of the specified topics are used or if you might want to switch to the ``noop`` driver so no notifications are published.
Also see the section "Removing Ceilometer" of :doc:`../../ceilometer`.

If you want to permanently remove all messages from the queue without any possibility of recovery, run the following:

.. code:: bash

    rabbitmqctl purge_queue <queue>

If run into this is issue, ensure that the alert ``RabbitMQQuorumQueuesAccumulatingMessages`` specified in :doc:`../../alerts/rabbitmq` is configured for the cluster, which detects the issue before the limit is reached.

Repair broken members
---------------------

Before you investigate the queue, make sure that all nodes of the RabbitMQ cluster are up and running.

First, you need to retrieve the type of the queue to determine subsequent commands:

.. code:: bash

    queue="<insert-here>"
    rabbitmqctl list_queues name type | grep -P "^$queue\t"


Afterwards, inspect the state of the queue:

.. code-block:: bash
   :caption: Quorum queues

    rabbitmq-queues quorum_status <queue>


.. code-block:: bash
   :caption: Stream queues

    rabbitmq-streams stream_status <queue>

A queue is only functional if it has one member in Raft state ``leader`` and a quorum of its members (``⌊cluster_size/2⌋``) is running.
You can identify non functional members by a Raft state that is not set to either ``leader`` or ``follower``.
If the queue still has a leader, but one or more members are in an undesired Raft state like ``noproc``, you can repair this queue by removing and readding these members one after another:

.. code-block:: bash
   :caption: Quorum queues

    rabbitmq-queues delete_member <queue> <node>
    rabbitmq-queues add_member <queue> <node>

.. code-block:: bash
   :caption: Stream queues

    rabbitmq-streams delete_replica <queue> <node>
    rabbitmq-streams add_replica <queue> <node>

If a member is missing entirely, run the ``rabbitmq-queues add_member`` / ``rabbitmq-streams add_replica`` command for the affected queue and member.

If the queue cannot be recovered, it has to be deleted and recreated:

.. code-block:: bash

    rabbitmqctl delete_queue <queue>

If that command does not work either, try to run:

.. code-block:: bash

    queue="<queue_name>"
    rabbitmqctl eval 'rabbit_db_queue:delete({resource,<<"/">>,queue,<<"'${queue}'">>},normal).'

Afterwards, restart the consuming service so the queue and necessary bindings get recreated.


Repair the RabbitMQ stream coordinator
--------------------------------------

.. warning::

    This guide contains the usage of bash functions from the file ``/var/lib/rabbitmq/utils/troubleshooting-utils.sh`` which is only
    introduced with new infra operator versions. If this file is not available yet, you can copy it from the current ``devel`` branch.
    If you do not want to checkout the branch locally, you can find needed files here:

    * `amqp-rabbitmqadmin-wrapper.sh <https://gitlab.com/yaook/operator/-/blob/devel/yaook/op/infra/templates/amqp-rabbitmqadmin-wrapper.sh>`_
    * `amqp-troubleshooting-utils.sh <https://gitlab.com/yaook/operator/-/blob/devel/yaook/op/infra/templates/amqp-troubleshooting-utils.sh>`_

    Afterwards copy the files to the pod filesystem:

    .. code-block:: bash

        amqpserver="<amqpserver>"
        pod=$(kubectl -n "$YAOOK_OP_NAMESPACE" get pod -l "state.yaook.cloud/parent-name=$amqpserver,state.yaook.cloud/parent-plural=amqpservers" -o name | head -n1 | cut -f2 -d'/')
        kubectl -n "$YAOOK_OP_NAMESPACE" cp ./yaook/op/infra/templates/amqp-rabbitmqadmin-wrapper.sh "$pod":/tmp/rabbitmqadmin-wrapper.sh
        kubectl -n "$YAOOK_OP_NAMESPACE" cp ./yaook/op/infra/templates/amqp-troubleshooting-utils.sh "$pod":/tmp/troubleshooting-utils.sh

        # run the following inside the RabbitMQ pod
        source /tmp/rabbitmqadmin-wrapper.sh
        source /tmp/troubleshooting-utils.sh


If you encounter repeated OpenStack error messages like the following, the stream coordinator [1]_ is possibly not working properly:

.. code-block::

    "message": "Failed to consume message from queue: (0, 0): (541) INTERNAL_ERROR",


To confirm that the stream coordinator is unhealthy, run:

.. code-block:: bash

   rabbitmq-diagnostics coordinator_status

With a healthy coordinator cluster, the command will print all cluster nodes with exactly one of them
being in "Raft State" leader. If it returns an error or does not terminate, the coordinator cluster is broken and needs to be rebuilt.
Single ``noproc`` members will not cause stream operations to fail as long as there is still a leader, even though a quorum loss may be more likely in that case. If the coordinator is functional, but
you still see the error message above, read the next section.

The following command will restart the stream coordinator by resetting the coordinator process on all nodes and triggering the coordinator start by declaring a nonexistent stream queue.
Afterwards all stream queues will be recreated automatically since the stream queues often end up in an erroneous state afterwards:

.. code-block:: bash

    restart_stream_coordinator

Again, but this time confirm that the coordinator is running:

.. code-block:: bash

   rabbitmq-diagnostics coordinator_status

If the coordinator is still unhealthy, we recommend to rebuild the entire RabbitMQ cluster [2]_.

.. [1] The stream coordinator is a set of multiple coordinator processes that run on each RabbitMQ node and use the Raft algorithm.
.. [2] If resolving the issue is not a matter of urgency, we instead recommend to find a fix for your case and contribute to the documentation.

Resolve "stream_not_found" errors
---------------------------------

This problem displays the same ``Failed to consume message from queue`` OpenStack logs as they appear in case of a dysfunctional stream coordinator, but the following messages can also be encountered in the RabbitMQ logs:

.. code-block:: bash

    errorContext: child_terminated
    reason: {{stream_not_found,
                {resource,<<"/">>,queue,<<"barbican.workers_fanout">>}},


If you already tried to delete the queue using ``rabbitmqctl delete_queue``, you might have noticed that this does not work.
In that case, using an internal delete call should still be functional and delete the queue data on all nodes. Afterwards, we will have to recreate the queue
along with its exchange and binding manually:

.. code-block:: bash

    recreate_stream_queue "<queue>"


If you want to recreate all stream queues, you can run:

.. code-block:: bash

    recreate_all_stream_queues


This problem is caused by inconsistent information inside the RabbitMQ database, e.g. the queue might still exist inside ``MNESIA_DURABLE_TABLE`` but be missing inside the ``MNESIA_TABLE``,
there can, however, also be other causes. Because stream queues are not critical for OpenStack API operations, they can be deleted and recreated without backing up messages.