Metrics integration issues

Incident Report for CloudAMQP

Postmortem

Broker metrics and metrics integrations degradation

Window: 2026-04-23 03:25 – 05:35 UTC (2h 10m)

Impact

Broker metrics (connections, channels, queues, consumers, message rates, node and netsplit state) were degraded, progressively dropping until recovery at 05:35 UTC. Downstream effects:

  • Queue, Consumer, Connection, Channel, Connection flow, and Netsplit alarms: threshold breaches during the window may not have triggered, or triggered late. Alarms already tripped before 03:25 UTC kept
    their state.
  • Metrics Integrations (Datadog, CloudWatch, New Relic, Splunk, Dynatrace, etc.): delivery of broker metrics was reduced for the duration of the window. Alarms configured on the receiving side may
    likewise have missed or lagged.

Not affected: server metrics (CPU, memory, disk) and their corresponding alarms, Notice alarms, broker availability, and the metric graphs shown in the CloudAMQP console (these use a separate data path
and continued to update normally).

Timeline (UTC)

  • 03:25 — broker-metrics collection begins degrading
  • 03:35 — ~30% of broker-metric samples missing
  • 03:50 — ~50% plateau
  • 05:25 — collector workers cycled and re-initialised
  • 05:35 — full throughput restored

Root cause

Broker metrics are polled from each cluster's management HTTP API by a pool of collector workers and published to an internal message bus that both the Alarms service and Metrics Integrations consume
from. At 03:25 UTC this service became unresponsive without raising an error. Affected workers silently stopped polling while remaining alive to the platform, so automatic restart did not trigger. On-call
staff began investigating around 05:00 UTC; force-restarting the service restored metric flow.

The exact trigger has not been identified. Our focus is on ensuring the condition is detected and handled promptly if it recurs.

What we are changing

  • Added metrics and alarms specific to this service, with high-urgency paging for on-call.
  • Reliability improvements in the service itself to detect and restart stalled work.

What you can do

CloudAMQP offers a new generation of metrics integrations based on Prometheus, with a re-engineered pipeline that does not rely on these centralised services — each server forwards data directly to your
endpoint. These have been running in production for some time and have proved very reliable.

Posted Apr 24, 2026 - 08:48 UTC

Resolved

This incident has been resolved.
Posted Apr 23, 2026 - 06:00 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Apr 23, 2026 - 05:25 UTC

Investigating

There appears to be an issue regarding sending metrics to all Legacy metrics integrations. Alarms based on broker metrics also got affected.

We are currently investigating the issue

Prometheus based metrics (Datadog v3 etc) not affected.
Posted Apr 23, 2026 - 03:30 UTC
This incident affected: Metrics.