RabbitMQ alerts

The RabbitMQ cluster-operator provides the following alerts which serve as a good basis: https://github.com/rabbitmq/cluster-operator/tree/v1.7.0/observability/prometheus/rules/rabbitmq

In addition, we recommend to use the following alerts:

---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: rabbitmq
spec:
  groups:
    - name: Rabbitmq
      rules:
        # critical alert for repeated failed attempts
        - alert: FailedAuthenticationAttemptsCritical
          expr: |
            rate(rabbitmq_auth_attempts_failed_total[5m:30s]) > 50
          for: 2m
          annotations:
            summary: |
              "Critical rate of failed authentication attempts for RabbitMQ '{{ $labels.service }}',
              namespace '{{ $labels.namespace }}'."
            description: |
              Critical rate of failed authentication attempts for RabbitMQ '{{ $labels.service }}',
              namespace '{{ $labels.namespace }}'.
              Check which AMQPUsers are affected and check the AMQPUser status and infra-operator logs.
              Ensure that the configuration of the service uses the correct credentials and that the
              user exists inside RabbitMQ using `rabbitmqctl list_users`. If everything is correct,
              the RabbitMQ user might use a different password than specified inside the secret.
          labels:
            rulesgroup: rabbitmq
            severity: critical

        # minor alert for authentication issues that resolve on its own but might indicate an issue
        - alert: FailedAuthenticationAttemptsMinor
          expr: |
            rate(rabbitmq_auth_attempts_failed_total[5m:30s]) > 0
          for: 2m
          annotations:
            summary: |
              "Detected failed authentication attempts for RabbitMQ '{{ $labels.service }}',
              namespace '{{ $labels.namespace }}'. "
            description: |
              Detected failed authentication attempts for RabbitMQ '{{ $labels.service }}',
              namespace '{{ $labels.namespace }}'.
              Check which AMQPUsers are affected and check the AMQPUser status and infra-operator logs.
              Ensure that the configuration of the service uses the correct credentials and that the
              user exists inside RabbitMQ using `rabbitmqctl list_users`. If everything is correct,
              the RabbitMQ user might use a different password than specified inside the secret.
          labels:
            rulesgroup: rabbitmq
            severity: minor

        - alert: QueueNotRunning
          expr: |
            sum (rabbitmq_queue_not_running_count) by (service) > 0
          for: 5m
          annotations:
            description: |
              "At least one queue inside cluster '{{ $labels.service }}' is not running. Check the logs of the rabbitmq-exporter container or run `rabbitmq-diagnostics list_unresponsive_queues`
              inside the mq pod to find out which queues are affected. Delete the queue with `rabbitmqctl delete_queue <queue>` so it can be recreated. If that does not work try
              `queue="<queuename>"; rabbitmqctl eval 'rabbit_db_queue:delete({resource,<<"/">>,queue,<<"'${queue}'">>},normal).'`
              You might have to restart the service that created the queue, unless it uses the suffix of a pod that does not exist anymore. To ensure the queue was recrated, run
              `rabbitmqctl list_queues | grep <queue>`"
          labels:
            rulesgroup: rabbitmq
            severity: major

        - alert: QueueMissingMembers
          expr: |
            sum (rabbitmq_queue_missing_members_count) by (service) * on (service) group_left(rabbitmq_cluster_permanent_id, pod) (count by (service, rabbitmq_cluster_permanent_id, pod) (label_replace(rabbitmq_identity_info, "pod", "$1", "pod", "^(.*)..$")))
            > 0
          for: 5m
          annotations:
            summary: |
              "At least one queue inside cluster '{{ $labels.service }}' has missing members."
            description: |
              "At least one queue inside cluster '{{ $labels.service }}' has missing members. Check the logs of the rabbitmq-exporter container inside the mq pods to find out which queues are affected.
              For more information, see https://docs.yaook.cloud/user/guides/rabbitmq/troubleshooting/queues.html#repair-broken-members"
          labels:
            rulesgroup: rabbitmq
            severity: minor

        - alert: QueueOfflineMembers
          expr: |
            sum (rabbitmq_queue_offline_members_count) by (service) * on (service) group_left(rabbitmq_cluster_permanent_id, pod) (count by (service, rabbitmq_cluster_permanent_id, pod) (label_replace(rabbitmq_identity_info, "pod", "$1", "pod", "^(.*)..$")))
            > 0
          for: 15m  # offline members are expected during rolling restarts
          annotations:
            summary: |
              "At least one queue inside cluster '{{ $labels.service }}' has offline members. If RabbitMQ is stuck during a restart, this alert is expected and can be ignored."
            description: |
              "At least one queue inside cluster '{{ $labels.service }}' has offline members. If RabbitMQ is stuck during a restart, this alert is expected and can be ignored.
              For more information, see https://docs.yaook.cloud/user/guides/rabbitmq/troubleshooting/queues.html#repair-broken-members"
          labels:
            rulesgroup: rabbitmq
            severity: minor

        - alert: RabbitMQQuorumQueuesAccumulatingMessages
          expr: |
            sum (rabbitmq_quorum_queue_accumulating_messages_count) by (service) * on (service) group_left(rabbitmq_cluster_permanent_id, pod) (count by (service, rabbitmq_cluster_permanent_id, pod) (label_replace(rabbitmq_identity_info, "pod", "$1", "pod", "^(.*)..$")))
            > 0
          for: 5m
          annotations:
            summary: |
              "At least one queue inside cluster '{{ $labels.service }}' is accumulating messages which means they are not being consumed."
            description: |
              "At least one queue inside cluster '{{ $labels.service }}' is accumulating messages which means they are not being consumed.
              For more information, see: https://docs.yaook.cloud/user/guides/rabbitmq/troubleshooting/queues.html#messagenacked-errors-queues-accumulate-messages"
          labels:
            rulesgroup: rabbitmq
            severity: major
          # In the future this should be critical, but currently this still causes false alarms, e.g because OpenStack publishes to queues
          # of deleted compute services.

        - alert: RabbitMQClusterSizeFlappingCritical
          expr: changes((sum(rabbitmq_build_info * on(instance) group_left(service) rabbitmq_identity_info) by(namespace, service))[15m:30s]) > 6
          for: 2m
          annotations:
            description: |
              "The RabbitMQ cluster behind service '{{ $labels.service }}' in namespace '{{ $labels.namespace }}' has changed size more than 6 times in the last 15 minutes.
              It's possible that one replica is in a single node cluster. Check this link for troubleshooting: https://docs.yaook.cloud/user/guides/rabbitmq/troubleshooting/amqp-restore.html#solve-pod-stuck-in-failed-startup-probe
              Use `rabbitmqctl reset` instead of `rabbitmqctl force_reset` for RabbitMQ 4.x.
              If the cluster is healthy, this alert could have been triggered by multiple rolling restarts in a row."
            summary: "The RabbitMQ  behind service '{{ $labels.service }}' in namespace '{{ $labels.namespace }}' has changed size more than 6 times in the last 15 minutes."
          labels:
            rulesgroup: rabbitmq
            severity: critical

        - alert: RabbitMQPodStuckInTerminating
          expr: count(kube_pod_deletion_timestamp{pod=~".*-mq-[0-4]"}) by (pod) > 0
          for: 10m
          annotations:
            summary: "The RabbitMQ pod '{{ $labels.pod }}' has been stuck in terminating for more than 10 minutes."
            description: |
              The RabbitMQ pod '{{ $labels.pod }}' has been stuck in terminating for more than 10 minutes.
              The pod will force terminate after 60 minutes. In that case some queues might become dysfunctional.
              This might be the case because the node is quorum critical. Run `rabbitmq-diagnostics check_if_node_is_quorum_critical` for more information.
              Otherwise check if the exporter container is still running, but the rabbitmq container has terminated. In that case stop the exporter container
              using `kill 1`.
          labels:
            rulesgroup: rabbitmq
            severity: critical
...