RabbitMQ alerts¶
The RabbitMQ cluster-operator provides the following alerts which serve as a good basis: https://github.com/rabbitmq/cluster-operator/tree/v1.7.0/observability/prometheus/rules/rabbitmq
In addition, we recommend to use the following alerts:
---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: rabbitmq
spec:
groups:
- name: Rabbitmq
rules:
# critical alert for repeated failed attempts
- alert: FailedAuthenticationAttemptsCritical
expr: |
rate(rabbitmq_auth_attempts_failed_total[5m:30s]) > 50
for: 2m
annotations:
summary: |
"Critical rate of failed authentication attempts for RabbitMQ '{{ $labels.service }}',
namespace '{{ $labels.namespace }}'."
description: |
Critical rate of failed authentication attempts for RabbitMQ '{{ $labels.service }}',
namespace '{{ $labels.namespace }}'.
Check which AMQPUsers are affected and check the AMQPUser status and infra-operator logs.
Ensure that the configuration of the service uses the correct credentials and that the
user exists inside RabbitMQ using `rabbitmqctl list_users`. If everything is correct,
the RabbitMQ user might use a different password than specified inside the secret.
labels:
rulesgroup: rabbitmq
severity: critical
# minor alert for authentication issues that resolve on its own but might indicate an issue
- alert: FailedAuthenticationAttemptsMinor
expr: |
rate(rabbitmq_auth_attempts_failed_total[5m:30s]) > 0
for: 2m
annotations:
summary: |
"Detected failed authentication attempts for RabbitMQ '{{ $labels.service }}',
namespace '{{ $labels.namespace }}'. "
description: |
Detected failed authentication attempts for RabbitMQ '{{ $labels.service }}',
namespace '{{ $labels.namespace }}'.
Check which AMQPUsers are affected and check the AMQPUser status and infra-operator logs.
Ensure that the configuration of the service uses the correct credentials and that the
user exists inside RabbitMQ using `rabbitmqctl list_users`. If everything is correct,
the RabbitMQ user might use a different password than specified inside the secret.
labels:
rulesgroup: rabbitmq
severity: minor
- alert: QueueNotRunning
expr: |
sum (rabbitmq_queue_not_running_count) by (service) > 0
for: 5m
annotations:
description: |
"At least one queue inside cluster '{{ $labels.service }}' is not running. Check the logs of the rabbitmq-exporter container or run `rabbitmq-diagnostics list_unresponsive_queues`
inside the mq pod to find out which queues are affected. Delete the queue with `rabbitmqctl delete_queue <queue>` so it can be recreated. If that does not work try
`queue="<queuename>"; rabbitmqctl eval 'rabbit_db_queue:delete({resource,<<"/">>,queue,<<"'${queue}'">>},normal).'`
You might have to restart the service that created the queue, unless it uses the suffix of a pod that does not exist anymore. To ensure the queue was recrated, run
`rabbitmqctl list_queues | grep <queue>`"
labels:
rulesgroup: rabbitmq
severity: major
- alert: QueueMissingMembers
expr: |
sum (rabbitmq_queue_missing_members_count) by (service) * on (service) group_left(rabbitmq_cluster_permanent_id, pod) (count by (service, rabbitmq_cluster_permanent_id, pod) (label_replace(rabbitmq_identity_info, "pod", "$1", "pod", "^(.*)..$")))
> 0
for: 5m
annotations:
summary: |
"At least one queue inside cluster '{{ $labels.service }}' has missing members."
description: |
"At least one queue inside cluster '{{ $labels.service }}' has missing members. Check the logs of the rabbitmq-exporter container inside the mq pods to find out which queues are affected.
For more information, see https://docs.yaook.cloud/user/guides/rabbitmq/troubleshooting/queues.html#repair-broken-members"
labels:
rulesgroup: rabbitmq
severity: minor
- alert: QueueOfflineMembers
expr: |
sum (rabbitmq_queue_offline_members_count) by (service) * on (service) group_left(rabbitmq_cluster_permanent_id, pod) (count by (service, rabbitmq_cluster_permanent_id, pod) (label_replace(rabbitmq_identity_info, "pod", "$1", "pod", "^(.*)..$")))
> 0
for: 15m # offline members are expected during rolling restarts
annotations:
summary: |
"At least one queue inside cluster '{{ $labels.service }}' has offline members. If RabbitMQ is stuck during a restart, this alert is expected and can be ignored."
description: |
"At least one queue inside cluster '{{ $labels.service }}' has offline members. If RabbitMQ is stuck during a restart, this alert is expected and can be ignored.
For more information, see https://docs.yaook.cloud/user/guides/rabbitmq/troubleshooting/queues.html#repair-broken-members"
labels:
rulesgroup: rabbitmq
severity: minor
- alert: RabbitMQQuorumQueuesAccumulatingMessages
expr: |
sum (rabbitmq_quorum_queue_accumulating_messages_count) by (service) * on (service) group_left(rabbitmq_cluster_permanent_id, pod) (count by (service, rabbitmq_cluster_permanent_id, pod) (label_replace(rabbitmq_identity_info, "pod", "$1", "pod", "^(.*)..$")))
> 0
for: 5m
annotations:
summary: |
"At least one queue inside cluster '{{ $labels.service }}' is accumulating messages which means they are not being consumed."
description: |
"At least one queue inside cluster '{{ $labels.service }}' is accumulating messages which means they are not being consumed.
For more information, see: https://docs.yaook.cloud/user/guides/rabbitmq/troubleshooting/queues.html#messagenacked-errors-queues-accumulate-messages"
labels:
rulesgroup: rabbitmq
severity: major
# In the future this should be critical, but currently this still causes false alarms, e.g because OpenStack publishes to queues
# of deleted compute services.
- alert: RabbitMQClusterSizeFlappingCritical
expr: changes((sum(rabbitmq_build_info * on(instance) group_left(service) rabbitmq_identity_info) by(namespace, service))[15m:30s]) > 6
for: 2m
annotations:
description: |
"The RabbitMQ cluster behind service '{{ $labels.service }}' in namespace '{{ $labels.namespace }}' has changed size more than 6 times in the last 15 minutes.
It's possible that one replica is in a single node cluster. Check this link for troubleshooting: https://docs.yaook.cloud/user/guides/rabbitmq/troubleshooting/amqp-restore.html#solve-pod-stuck-in-failed-startup-probe
Use `rabbitmqctl reset` instead of `rabbitmqctl force_reset` for RabbitMQ 4.x.
If the cluster is healthy, this alert could have been triggered by multiple rolling restarts in a row."
summary: "The RabbitMQ behind service '{{ $labels.service }}' in namespace '{{ $labels.namespace }}' has changed size more than 6 times in the last 15 minutes."
labels:
rulesgroup: rabbitmq
severity: critical
- alert: RabbitMQPodStuckInTerminating
expr: count(kube_pod_deletion_timestamp{pod=~".*-mq-[0-4]"}) by (pod) > 0
for: 10m
annotations:
summary: "The RabbitMQ pod '{{ $labels.pod }}' has been stuck in terminating for more than 10 minutes."
description: |
The RabbitMQ pod '{{ $labels.pod }}' has been stuck in terminating for more than 10 minutes.
The pod will force terminate after 60 minutes. In that case some queues might become dysfunctional.
This might be the case because the node is quorum critical. Run `rabbitmq-diagnostics check_if_node_is_quorum_critical` for more information.
Otherwise check if the exporter container is still running, but the rabbitmq container has terminated. In that case stop the exporter container
using `kill 1`.
labels:
rulesgroup: rabbitmq
severity: critical
...