Customers with dedicated clusters (not shared), in AWS US-East-1, where the clusters were created more than 2 months ago. New clusters and customers on the Little Lemur and Tough Tiger plans were not affected, nor was anyone else outside AWS US-East-1.
On Friday I (Carl Hörberg, the CEO) was going through our list of AWS servers and compared to our records of what we should have. I noticed that we had more servers than we were supposed to. Over the time we've had different way of naming our servers, and when deprovision servers we look for this naming conventions, but one naming convention was missing, so that's why we had too many servers.
So I went in to our deprovision code and changed this code:
insts_old = @ec2.instances.with_tag('Name', "CloudAMQP-#@name-*")
insts_new = @ec2.instances.with_tag('Service', 'CloudAMQP').with_tag('Name', "#@name-*")
insts = insts_new.map{ |i| i } + insts_old.map{ |i| i }
to this
insts = [
@ec2.instances.with_tag('Name', "CloudAMQP-#@name0*").to_a,
@ec2.instances.with_tag('Name', "CloudAMQP-#@name-*").to_a,
@ec2.instances.with_tag('Service', 'CloudAMQP').with_tag('Name', "#@name-*").to_a,
].flatten
(The code then terminates all instances found)
So what happened here was the string interpolation with the instance variable shortcut.
irb(main):001:0> @name = 'yadda'
=> "yadda"
irb(main):002:0> "CloudAMQP-#@name0*"
=> "CloudAMQP-*"
It tries to resolve the variable @name0
, which instead returns nil
, in this case with disastrous results.
We will refund 50% of the cluster cost to all affected customers on their next bill.