Should We Worry About The S3 Outage?

Last week, Amazon Web Service’s S3 (Simple Storage Service) suffered what was termed “high error rates” and essentially became unusable for a number of hours to users in the US-East-1 (Virginia) region. There have been a lot of stories circulating talking about the number of SaaS-based applications that were affected, with people being quick to question the wisdom of organisations using only a single AWS region for their data. But is this really fair?

First of all, check out this post by Steve Chambers over at the ITSM.tools blog. He makes some good points, including the fact that S3 has dependencies on the US-East-1 region and that by default the region of choice without specifying bucket region is US-East-1. So, depending on the actual technical problem, it may have been hard not to have been affected in the first place.

Second, let’s think for a moment about the cost of building availability into an application versus the impact of downtime. Designing availability into any IT solution is always a trade-off between cost and risk. Most banks can’t afford even a few minutes’ downtime. The same applies to the airline industry and these organisations have built their infrastructure (a) on that basis and (b) with a knowledge of the impact of their system being down. That impact is directly measurable in terms of financial loss.

But what happens if you’re a SaaS application that’s built on AWS? What’s the impact to your business of every minute of downtime? When you calculate that cost and offset against the availability offered by the cloud provider, how much does that downtime mean financially? Here’s an example. S3 is designed for 99.99% uptime in any particular month, so assuming a 30 day month, that’s 43,200 minutes and 0.01% translates to 4.32 minutes of downtime per month. In this instance AWS clearly breached SLA, but in general, the service is expected to have a maximum downtime in any month of fewer than 5 minutes. If your business loses $1000 per minute when users can’t access the system because S3 isn’t working, the loss is typically expected to be $5000 a month, or $60,000 over one year, as a maximum. Now if the cost of adding additional replication comes in at greater than $60,000, then financially there’s no benefit implementing a higher level of resiliency. That is, of course, assuming AWS are as reliable as their SLA says they aim to be.

Pricing downtime for a lot of SaaS applications is really hard. Many end users might simply retry the service later and so to them the outage isn’t a big deal. With monthly subscriptions being paid, the loss may be hard to work out if users leave the service without providing a reason. Obviously sales-based sites that take regular payments will be able to see the impact of downtime more immediately because they know exactly how much money is taken on a daily and hourly basis.

Of course, the uptime of a SaaS application isn’t a purely financial discussion, there’s reputational damage to be considered. If your application is deemed to be unreliable, customers may leave. However, consider one other aspect. The 5 minutes of downtime we discussed isn’t guaranteed to be one single outage. Check out Amazon’s S3 SLA and you can see that downtime is calculated as a sum of rejected service calls over a 5 minute period. Sensible application design will automatically retry service request failures, with many transient S3 issues being resolved by retrying or delaying user requests. So defining and measuring downtime may be pretty difficult too.

The Architect’s View®

It’s probably not unreasonable for many companies to have focused on a single AWS region for their S3 data, bearing in mind the frequency of failure that we typically see occurring. As revenue and size of a company increases, then naturally there will be a justification to look at increasing resiliency. What could be done in this instance? S3 users could replicate their data between locations and outside of a single availability zone, however that introduces significant additional cost, which may (or may not) be justified. The alternative is to look for solutions that solve the problem but don’t increase the solution cost excessively. One example could be to use deduplication and a service like StorReduce. This could reduce S3 costs to the point where having two (or more) regions of data is justified. More data abstraction is a good idea. I’m sure we will start to see more solutions come to market that reduce dependencies on a single region, or cloud provider.

Five Things You Need to Know About The AWS S3 Outage (ITSM.tools blog, retrieved 7 March 2017)