us-east-1

Incident Report for CloudAMQP

Postmortem

Who was affected

Customers with dedicated clusters (not shared), in AWS US-East-1, where the clusters were created more than 2 months ago. New clusters and customers on the Little Lemur and Tough Tiger plans were not affected, nor was anyone else outside AWS US-East-1.

Long story short

All dedicated instances who's name started with "CloudAMQP-" in our AWS US-East-1 account where terminated, including disks.
We stopped our provisioning service.
We rolled back a commit that changed the deprovisioning process as we suspected that it might have been a bug in it.
We recommend users to create new cluster to be able to resume messaging.
We wrote a script that created new clusters from the information in our database.
We hit the Amazon instance limit, while we where below the limit it seems like Amazon included terminated instances in their measuring.
All clusters back online but without any queues or messages.
It turns out that we had a string interpolation bug.

The long story

On Friday I (Carl Hörberg, the CEO) was going through our list of AWS servers and compared to our records of what we should have. I noticed that we had more servers than we were supposed to. Over the time we've had different way of naming our servers, and when deprovision servers we look for this naming conventions, but one naming convention was missing, so that's why we had too many servers.

So I went in to our deprovision code and changed this code:

insts_old = @ec2.instances.with_tag('Name', "CloudAMQP-#@name-*")
insts_new = @ec2.instances.with_tag('Service', 'CloudAMQP').with_tag('Name', "#@name-*")
insts = insts_new.map{ |i| i } + insts_old.map{ |i| i }

to this

insts = [
  @ec2.instances.with_tag('Name', "CloudAMQP-#@name0*").to_a,
  @ec2.instances.with_tag('Name', "CloudAMQP-#@name-*").to_a,
  @ec2.instances.with_tag('Service', 'CloudAMQP').with_tag('Name', "#@name-*").to_a,
].flatten

(The code then terminates all instances found)

So what happened here was the string interpolation with the instance variable shortcut.

irb(main):001:0> @name = 'yadda'
=> "yadda"
irb(main):002:0> "CloudAMQP-#@name0*"
=> "CloudAMQP-*"

It tries to resolve the variable @name0, which instead returns nil, in this case with disastrous results.

How we're preventing it from ever happening again

Servers aren't terminated anymore, they are stopped and they're only terminated after a team member has manually confirmed its status.
All changes to our provisioning/deprovisioning code now requires a code review.

Refunding

We will refund 50% of the cluster cost to all affected customers on their next bill.

Posted Nov 22, 2014 - 10:42 UTC

Resolved

All servers are back online, but unfortunately without the queues or messages. A full post mortem is coming later today. Very are, needless to say, devastated over this accident. Don't hesitate to contact us on support@cloudamqp.com if you have any questions.

Posted Nov 22, 2014 - 03:13 UTC

Update

All but 3 servers are back online

Posted Nov 22, 2014 - 00:38 UTC

Monitoring

We are re provissioning all servers. A very rough ETA is 30 min.

Posted Nov 21, 2014 - 22:52 UTC

Identified

Today we introduced a bug that deleted almost all dedicated servers in us-east-1. We are working on restoring all servers, this will take some time. This is of course a disaster and we will post a full post mortem at a later time. Fastest way to get back up is to create a new instance. But we will restore all deleted servers.

Posted Nov 21, 2014 - 20:46 UTC

Update

Sorry for the faulty information, disks were also deleted.

Posted Nov 21, 2014 - 20:27 UTC

Update

All disk seems to be left, we are working on contacating AWS and getting the servers online

Posted Nov 21, 2014 - 20:13 UTC

Investigating

Almost all servers have been terminated from the us-east-1 datacenter

Posted Nov 21, 2014 - 20:12 UTC