AWS Outage – There’s always a dependency somewhere

This week, AWS experienced another “wobble” that affected a range of services including the AWS Console and Alexa. The issue in us-east-1 also caused problems for Disney and gaming websites. Unfortunately, these recurring problems are an issue we’re just going to have to get used to if we want the benefits of public cloud.

It’s hard to believe that in 2021, we’re still talking about AWS outages and being surprised by the results. Here’s a post I wrote ten years ago, warning not to be dependent on a single region or provider. My final bullet point observations were:

Basing your infrastructure in “the cloud” is not a bad thing to do
You must understand your business service requirements and design to them
You must understand the service offering of the cloud provider
Design around availability, resiliency, and therefore mobility, at the application layer
Using multiple providers is a good thing
Don’t let cost saving blind you to reducing service quality

Of course, many businesses operate this way today, consuming multiple services and using availability zones to build resiliency into applications. At the same time, many clearly do not, possibly because the final bullet point introduces costs that negates a lot of the benefits used to justify the move to the public cloud in the first place.

Unfortunately, having a policy of multiple regions within AWS wasn’t enough to avoid an issue in this instance. The problems experienced this week are, yet again, with us-east-1, the oldest AWS data centre. Many services are dependent on this single location, including the AWS console (although it is possible to access the console via other routes).

The problem of us-east-1 highlights several issues for AWS and customers. First, there are clear dependencies on the use of that location for a range of services that should otherwise be fault-tolerant. Customers have unknown dependencies on us-east-1 around which they cannot design services, because AWS owns and manages those components with no transparency.

This then represents a challenge for IT organisations building resilient services. How can they build in resiliency when potential single points of failure (SPoFs) aren’t clear to the customer?

Third, we have to question why AWS hasn’t resolved the dependency issues that still exist with us-east-1. AWS itself it appears, still has technical debt from early services built in that location.

Multi-Cloud

Could a multi-cloud approach have mitigated the AWS issues? It’s hard to say whether businesses running across multiple clouds could have avoided this issue. It is possible to design applications to run cross-vendor, but that would require data mobility, integrity, consistency, and replication to be solved within the boundaries of the SLAs the application must deliver. Building across clouds also requires developers to ensure application code updates are synchronised across multiple platforms at rollout time. Technically achievable, but also technically challenging.

Many businesses may conclude that the cost of building out an entire replica of their primary location into a second cloud isn’t justified in terms of the cost of design and deployment.

There’s Always a SPoF

Finally, there’s the question of the SPoF “elephant in the room”.

Every system has a Single Point of Failure somewhere. Most of the time we just haven’t worked out where it is.

As an example, we can build a resilient application that runs across two public clouds. However, DNS and traffic routing must be located somewhere. If DNS resolution fails, the dual cloud strategy is also down. As another example (which may be relevant to the current AWS outage), if we have resilient systems but a single management portal, then the portal becomes the SPoF. If that portal is a SaaS platform that the business doesn’t control, then the business has an immediate issue.

The Architect’s View™

We can continue to write opinion pieces and column inches on the problem of resiliency and cloud. However, the challenges of running IT infrastructure boil down to a few simple truths, whether they’re based in the public cloud or not.

All systems have a single point of failure somewhere. Many are subtle, complex and not always obvious.
Designing for failure will eliminate 99.9999% of problems but cannot guarantee 100% availability.
Businesses need to balance risk against cost. A small outage may be more acceptable than spending millions of dollars extra each year – or not. The business must decide, based on advice from IT.

The last point is an interesting one. Ten years ago, we may have suggested a business that needed guaranteed uptime should build its own infrastructure. However, the public cloud has proved surprisingly reliable. Cloud outages are rare, but generally have huge impact in terms of the breadth of businesses affected. A shared service level model, means compensation won’t be comparable to any loss experienced. As a result, perhaps IT organisations should be doing greater analysis on single points of failure and then taking a business decision on which services should be in-house or self-managed, compared to those in the public cloud. Even then, there is no guarantee that in-house managed services will outperform the public cloud.

There’s no easy answer here, other than to point out what we said at the very beginning of this post – there’s always a dependency somewhere.