Incident window: 2026-05-21 18:37 UTC – 2026-05-22 04:05 UTC (9h 28m)
Affected service: Legacy metric integrations (CloudWatch, Datadog, Librato, New Relic, Splunk, Stackdriver). Host metrics affected, not broker metrics, nor was internal monitoring or console graphs.
For ~9.5 hours, the service that collects host-level metrics (CPU, memory, disk, network) for legacy third-party integrations entered a crash loop and stopped shipping data.
An internal credential-signing service was migrated to a new container runtime the afternoon before. Due to a bug with environment variables the collector could not renew it's credentials. The collector's existing credential remained valid for several hours, so the failure only surfaced when it tried to renew.
On-call staff detected the failure early morning, developers helped restore the service and collector recovered at 04:05 UTC.