So Your AWS-based Application is Down? Don't Blame Amazon

After a busy day in London I returned home to read the news of issues in one of Amazon’s US data centre locations causing problems with EC2 and database (RDS) instances. It seems the services of many Internet companies were affected including Reddit, Quora, Hootsuite and FourSquare, Is it fair that Amazon should shoulder the blame for the loss of service to the customer or is there an underlying issue of design here?

First of all, from an availability and resiliency standpoint, it’s worth having a look at Amazon’s definition of regions and availability zones. AWS is currently available in 5 regions classed as; US East (Northern Virginia), US West (Northern California), EU (Ireland), Asia Pacific (Singapore), and Asia Pacific (Tokyo). Within these regions there are multiple “availability zones” – separate locations which Amazon claim are “engineered” to be insulated from failures in other availability zones, presumably as physically separate data centres with independent networks, power delivery and so on. It seems on the face of it reasonable to assume that if Amazon claim resiliency within a region by using availability zones that designing an infrastructure that sits in a single zone should be acceptable; I disagree.

As far as I am aware, Amazon publishes no specific details on how their infrastructure is plugged together and inter-operates across geographic boundaries. Therefore it’s impossible to understand how availability zones actually work and how they have been engineered to isolate against failure. As we saw yesterday, the whole of region US East was affected (and at the time of writing still is) regardless of location, making it obvious that the availability zone protection isn’t guaranteed in all circumstances.

Design Planning

When organisations design their own data centres, they understand their business requirements and the infrastructure is based on that information, including how and where data centres should be sited. Financial organisations, for instance, are required to site their data centres a certain distance apart for resiliency. Features such as synchronous replication at the array level, high availability, application data replication can all be used to ensure service is not disrupted because the infrastructure team have (hopefully) engaged with the application owners to understand their specific requirements. If that requirement were (for example) 100% data integrity, then data would need to be synchronously replicated to another location to ensure it could be accessed in a recovery scenario.

Amazon have, with AWS, provided generic infrastructure without publishing specifics on how that infrastructure is delivered. This is fine, as AWS is delivered as a service, however availability zones are not guaranteed against all failures (merely engineered against it) and it would be foolish to assume any organisation could guarantee against all possible disaster scenarios.

If you are delivering a service using cloud infrastructure it is your responsibility to determine the level of failure you are prepared to accept. That could mean running services across multiple providers, a subject I discussed 2.5 years ago in this post; http://www.thestoragearchitect.com/2008/12/16/redundant-array-of-inexpensive-clouds-pt-ii/. Although this post was more storage focused, the concepts still apply to application design. If you’re starting a business from scratch, then there’s no excuse these days not to engineer across multiple regions or even multiple providers (in fact, the effort of going multi-region will be comparable to that of going multi-provider). Obviously some applications will be more difficult to implement in a diverse manner than others, however looking at the four web-based applications I quoted at the top of this article, I expect that all of them have a large degree of read-only traffic and a lot of “write-new” data with only a small percentage being updates. That being the case it would be relatively easy to distribute read I/O geographically and to stage writes in the same manner, synchronising data on a periodic basis.

The Architect’s View™

Basing your infrastructure in “the cloud” is not a bad thing to do
You must understand your business service requirements and design to them
You must understand the service offering of the cloud provider
Design around availability, resiliency, and therefore mobility, at the application layer
Using multiple providers is a good thing
Don’t let cost saving blind you to reducing service quality

There’s one other thing to bear in mind (as the final bullet point above alludes to); US East is also the cheapest location for Amazon services (and I presume the largest). The cynic in me wonders is some of the service implementations have been based on cost rather than service level availability, especially where these services are free to the end user.