It’s been reported in a few places that yesterday Barclays (UK bank) suffered an issue with a “disc array” (presumably they mean disk array) that took out their ATM and online banking systems. See the comments here and here.
Allegedly, Barclays now use USP-V arrays as their back-end storage devices, so presumably HDS USP-Vs were involved in yesterday’s problems. Systems seemed to have been down for a number of hours before normal service was resumed.
The first thing to say is that “stuff” happens. Hardware fails – arrays fail and it’s the same for all vendors. No vendor can ever claim that their hardware doesn’t fail once in a while. We all know that RAID is not infallible; in fact, it isn’t even necessary to have a hardware failure to experience service outage as many problems are caused by human error.
What surprises me with this story is the time Barclays appeared to take to recover from the original incident. If a storage array is supporting a number of critical applications including online banking and ATMs, then surely a high degree of resilience has been built in that caters for more than just simple hardware failures? Surely the data and servers supporting ATMs and the web are replicated (in real time) with automated clustered failover or similar technology?
We shouldn’t be focusing here on the technology that failed. We should be focusing on the process, design and support of the environment that wasn’t able to manage the hardware failure and “re-route” around the problem.
One other thought. I wonder if this problem would have been avoided with a bit of Hitachi HAM?