I can only feel sympathy for the team at GitHub, after they suffered an outage recently, due to issues with database replication. The problem was caused by a mere 43-second break in network connectivity. We’ve all been there – keeping replicated data in sync is a thankless task and one that can be a nightmare to unpick when it goes wrong.
In the early nineties, I worked for a large rail operator (not the UK, thankfully). We experienced an issue where a database re-organisation job overlapped with a storage cleanup task, resulting in a hidden corruption that wasn’t exposed for 30 days. When it did appear, the system had a mix of the current database state and that from 30 days previously, resulting in tens of thousands of duplicate train reservations. The problem took weeks to diagnose and resolve, with a team from IBM and me working forensically through multiple logs and timelines to work out the root cause of the problem.
In the end, we determined the issue was a software bug, where a storage cleanup task took an incorrect decision to re-catalog an old file, leading to the re-import of old data. The software was changed provide the user with a choice of re-cataloging or reporting an error. Thankfully, I didn’t have to unpick the database duplications or deal with thousands of confused or irate customers.
One thing that is apparent from the GitHub issue is the problem of complexity. When solutions grow organically, it’s easy to simply tack another component on here or there, without thinking through the implications of making that change. What started out as simple, can easily become complex and no longer fit for purpose.
Keeping data consistent within a rapidly expanding ecosystem is a problem. Application-based replication gives transactional consistency but at the expense of complexity and a need to have very accurate documentation. Imagine operating thousands of SQL Server instances and having to know the individual relationships between databases and database servers in the event one gets out of sync or crashes.
Another alternative is to look at storage-based protection in the form of synchronous replication. We’ve talked about sync replication on recent podcasts and I get the feeling it’s fallen out of favour due to the use of virtualisation and of course, the cost. Synchronous replication requires relatively close duplicate storage, with good networking. It’s possible to mitigate some of the array cost issues, however, if good performance is needed, you can’t cut costs on the network or extend the distance between systems too far. Alternative solutions include having closely coupled systems and a third async replica or use a technology like Axxana.
- #39 – Garbage Collection: Storage Mythbusters Part I
- Soundbytes #009: FlashArray Update with Ivan Iannaccone at Pure Accelerate
- INFINIDAT InfiniSync – Infinite Sync Replication
Synchronous replication allows all related server volumes or datastores to be replicated consistently but immediately starts putting restrictions on design. However, these technologies do work, which is why so many large enterprises are slow to move away from them.
It may sound controversial, but I wonder how many enterprises like GitHub have robust processes around platform architecture changes. IT organisations need people who can assess the impact of making changes and determine what else will break as a consequence. However, there is a limit to how far humans can go in understanding and quantifying the risk of making architectural changes to applications and infrastructure. One solution could be to use digital twins. The technology comes more from the IoT world, however companies like Hitachi Vantara are looking to bring digital twin technology to the data centre.
The idea is to create a digital version of infrastructure and applications that can be stress tested and have “what if” scenarios applied to them. This could enable testing of scenarios like component failure or unexpected increases in traffic. Increasingly as we build more diverse architectures with data at the core, edge and in the public cloud, it will be harder to grasp the implications component failure and these technologies will be essential.
Digital Twin technology gets discussed briefly by Bob Madaio in the Innovation presentation from Hitachi NEXT 2018 (scroll to the bottom of the page for the video).
The Architect’s View
There are always going to be unexpected consequences from corner-case failure scenarios and as good friend Howard Marks says in this recent post, we should open and talk about potential problems in a constructive manner. No man (or woman) is an island, whether on social media or within an architecture team. Governance should give us the ability to question and ensure that as IT infrastructure grows, we’re not building in failure in the future.
Comments are always welcome; please read our Comments Policy. If you have any related links of interest, please feel free to add them as a comment for consideration.
Copyright (c) 2007-2019 – Post #7643 – Brookend Ltd, first published on https://www.architecting.it/blog, do not reproduce without permission.