Ugh! Typo to blame for server blackout
Wrong coding caused 4-hour outage at AWS @eweise USA TODAY
The major outage that hit tens of thousands of websites using Amazon’s AWS cloud computing service on Tuesday was the result of a simple typo — just one incorrectly-entered command.
The four-hour outage at Amazon Web Services’ S3 system, a giant provider of backend services for close to 150,000 websites, caused disruptions, slowdowns and failure-to-load errors across the U.S.
Amazon’s Simple Storage Service (S3) lets companies use the cloud to store files, photos, video and other information they serve up on their website. When the system was down, websites could not access those photos, logos or data. While most of the sites didn’t go down, many had broken links and were only partly functional.
Thursday, Amazon published a public letter outlining what happened. Here’s the rundown:
Tuesday morning, an Amazon team was investigating a problem that was slowing down the S3 billing system. At 9:37 a.m. Pacific time, one of the team members executed a command that was meant to take a few of the S3 servers offline.
“Unfortunately,” Amazon said in its posting, one part of that command was entered incorrectly — it had a typo.
That mistake caused a larger number of servers to be taken offline than they’d wanted.
Two of those servers ran some important systems for the whole East Coast region, such as the ones that let all those trillions of files be placed into customers’ websites. To get it back, both systems required a full restart.
Amazon acknowledged that it hadn’t actually done a full restart on the main subsystems that went offline “for many years.”