Instapaper (the SaaS-based solution for keeping track of bookmarks and interesting websites) suffered a catastrophic outage to their service last month. I’ve only just got round to reading the port-mortem account of the incident and it doesn’t make pretty reading. First of all, it’s worth saying that we’ve all been there and experienced that awful prickly heat feeling of realising either an application is down, or worse, you just took it down with a fat fingered command (a la AWS). So I have some sympathy here, however there are also lots of lessons to learn.
If you want the background to the problem, check out the post by Brian Donohue, ex Instapaper CEO and now a product engineer at Pinterest, after the 2016 acquisition by the company. From the details in the post, it appears that Instapaper’s entire system is based on one large MySQL database run on AWS’s RDS service. The problem experienced was due to the database hitting the 2TB file size limits that are in place on older MySQL instances running on RDS. What seems strange is that the limit was reached with no apparent warnings until the hard failure occurred.
Instapaper didn’t have an adequate disaster recovery plan in place, as the backup regime was based on using file system snapshots. We all know that snapshots are a “lightweight” backup and not a robust solution for recovering from a disaster. Without an adequate plan or proper backups, there was no way to work out how long a database rebuild would take. In the end, and only with the help of AWS engineers, the entire database was copied to a new ext4 file system that didn’t have the 2TB file limit.
I find it really scary that applications like Instapaper can be operated without any data management processes in place. It certainly doesn’t help when the “managed” database service doesn’t provide feedback or warnings when getting close to architectural limits. However data management is still the responsibility of the data owner, even with a managed service. Reading the RDS product description, it’s not clear whether the DB snapshots are stored on the same infrastructure as the primary data, or what levels of data protection are in place. Assuming the RDS instances are stored on standard EBS volumes, then the snapshots are thin and would share the same physical storage as the primary. EBS volumes have a high level of availability and durability, but aren’t 100% guaranteed (link). So there’s always a risk of losing the volume completely.
The Architect’s View™
Even with public cloud and managed services, you can’t avoid good old-fashioned data management processes. Remember that “cloud” is just somebody else’s computer running pretty much the same stuff as you would in your own data centre. That means all the standard processes for data protection still need to be done – don’t assume the cloud provider is doing them all on your behalf.
- Instapaper Outage Cause and Recovery (Medium, retrieved 26 March 2017)
Copyright (c) 2009-2017 – Post #b39a – Brookend Ltd, first published on https://www.architecting.it/blog, do not reproduce without permission.