The Risk of Shared Service Level Agreements

Wikipedia defines a service-level agreement (SLA) as an “official commitment that prevails between a service provider and a client. Particular aspects of the service – quality, availability, responsibilities – are agreed between the service provider and the service user”. In my experience, I’ve always seen an SLA as a contractual agreement for services, with penalties or redress if those services aren’t delivered within the agreed service terms. Imagine my surprise when (during a Tech Field Day presentation) I saw that IBM was using the term SLA policy within their backup software. What exactly is the nature here of the relationship between service provider and client?

SLO

I think IBM really meant SLO policy in this instance. A service-level objective sets out goals or targets to be achieved within data protection (and other parts of IT). These are then associated with standard backup terms like RTO (recovery time objective) and RPO (recovery point objective). If my data protection SLO is to take a backup every 4 hours, I know my RPO is going to be (at best) 4 hours. Obviously, there’s a wider range of definitions here that aren’t specifically set by the software. For example, internal terms agreed by the business and IT teams might say that a daily backup must be taken and if this fails, the backup has to be repeated and complete successfully within 12 hours. This is clearly a service-level agreement that IT has to aim to achieve. The impact of failing to deliver this SLA may be financial or simply reputational within the company.

Public Cloud SLAs

A quick exchange on Twitter this morning made me think that many people assume the only type of SLA available is one that delivers recourse in service credits. I think this is based mainly on experience with public cloud. This is perfectly reasonable. Look at Amazon’s S3 SLA and you see that, at best, the service credit refund from AWS will be 25% for any failure of less than 99% uptime. Uptime is very specifically defined and of course there are the usual force majeure exclusions like “things outside our control”.

However, just bear in mind that any service issue with public cloud providers will return only a fraction of your costs, not address your losses, and only apply if you continue to use their service. If you’re totally unhappy and move on, you get nothing.

Business Risk for the Provider

It’s understandable that public cloud providers don’t offer much in return for service failure. After all, these are highly shared services, so any failure can impact thousands if not millions of customers and if real money were involved, the public cloud provider would be out of business after the first major incident. This is the risk every one of us accepts in using the public cloud and should be built into the application design. This situation isn’t going to change any time soon.

Business Risk for the Customer

To date, public cloud has been highly reliable, but if there’s a catastrophe with one provider, there will be no financial claim to rebuild your failed business if it runs entirely on one cloud. What about if you build a private cloud? Vendors selling hardware or software might be prepared to offer better terms for an SLA. In the storage world, we have the myth of five, six or even seven 9’s availability. We discussed this on a recent Storage Unpacked podcast. A single storage appliance can’t guarantee 99.999% availability, but on average, across the systems deployed by the vendor, this level can be achieved – and higher. If any single storage array fails, then the vendor can afford to compensate the customer as the cost of compensation is spread across all the other customers with working products.

In this instance, the vendor is self-insuring, by adding a small premium to every array sold that covers the payout in the case of any failure. By the way, I don’t think that vendors literally add a premium into their calculations, but it is there implicitly. They simply just have a markup that includes coverage for these kinds of costs. So with a well-worded SLA, you may see more money back if your business runs on dedicated hardware, rather than as a service from a vendor. Incidentally, this also would be true if services were delivered on-site by a service provider offering contracts on premises because the equipment isn’t being used by multiple parties.

The Architect’s View

Whether you get money back or service credits, neither are likely to cover the cost of lost business or losing your business entirely. Good infrastructure and application design are still needed, to build in resiliency. However, SLAs may be a factor in design, such as whether to spread risk across multiple cloud providers compared to a highly resilient on-premises architecture. Either way, please don’t call it an SLA when it’s not.

SLO

Public Cloud SLAs

Business Risk for the Provider

Business Risk for the Customer

The Architect’s View

Further Reading