A typographical error re-configuring Amazon Web Services led to an outage that took down millions of Internet sites for hours this week.
The Amazon Web Services Inc. (AWS) Simple Storage Service (S3) team was debugging a problem that caused S3 billing to run more slowly than expected. An S3 team member executed a command at 9:37 am PT intended to remove a small number of servers for one of the S3 subsystems used for billing, according to a statement Thursday by Amazon. The team member typed incorrectly, and removed a larger number of servers than intended. Outages in the cloud service continued until 1:54 pm PT, but full recovery took longer as some services had to go through a backlog of work, Amazon says.
Affected sites included Netflix, Airbnb, Slack, and Light Reading.
Amazon apologized for the outage, and says it is making changes to its procedures to prevent recurrence. Among these: Changing the AWS Service Health Dashboard to reduce dependency on Amazon S3. Amazon had difficulty communicating the status of the outage to the public, because the status board depended on the problematic S3 service.
The outages cost $150 million to S&P 500 companies and $160 million to US financial services companies using the affected S3 infrastructure, according to Cyence, a firm that works with insurance companies to estimate cyber risk.
With budget shifting away from IT, CIOs need to find allies in their enterprises to help get cloud computing projects, as well as other tech initiatives, implemented. It's time to be buddies with the CFO.