System metrics not forwarded to legacy integration

Incident Report for CloudAMQP

Postmortem

Legacy host metrics integrations degraded

Incident window: 2026-05-21 18:37 UTC – 2026-05-22 04:05 UTC (9h 28m)
Affected service: Legacy metric integrations (CloudWatch, Datadog, Librato, New Relic, Splunk, Stackdriver). Host metrics affected, not broker metrics, nor was internal monitoring or console graphs.

Summary

For ~9.5 hours, the service that collects host-level metrics (CPU, memory, disk, network) for legacy third-party integrations entered a crash loop and stopped shipping data.

Root Cause

An internal credential-signing service was migrated to a new container runtime the afternoon before. Due to a bug with environment variables the collector could not renew it's credentials. The collector's existing credential remained valid for several hours, so the failure only surfaced when it tried to renew.

Resolution

On-call staff detected the failure early morning, developers helped restore the service and collector recovered at 04:05 UTC.

Prevention

  • All services, both signing and collector report failed authentication attempts earlier and with higher severity
  • Metrics pipeline alarm thresholds was tightened so a similar drop in data triggers alarms faster.
Posted May 22, 2026 - 13:52 UTC

Resolved

This incident has been resolved.
Posted May 22, 2026 - 04:50 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted May 22, 2026 - 04:08 UTC

Investigating

Since around 8pm UTC last night, some metrics are not being forwarded to legacy integration. Prometheus integrations not affected. We are investingating.
Posted May 22, 2026 - 03:16 UTC
This incident affected: Metrics.