StorPool Review - Part 5 - Resiliency and Recovery

This is the fifth in a series of posts looking at the StorPool software-defined storage platform. In this post, we will look at failure modes and data recovery.

Background

Persistent storage forms the basis for ensuring the long-term retention of data in the enterprise. Before the development and formalisation of RAID in 1987, enterprise customers (mainly mainframe-based) had to rely on data recovery from backups in the event of a hardware failure. Modern IT systems offer highly resilient storage using RAID or other data redundancy techniques to ensure close to 100% uptime, with little or no need to recover data from backups after hardware failures.

Are RAID Rebuild Times Still Relevant?

Keeping systems running after a system or component failure is only one aspect of data resiliency. Although modern disk drives and SSDs are incredibly reliable, both types of devices can produce intermittent errors, failing to read disk sectors or producing unrecoverable bit errors (UBE), where a read request fails.

Media Resilience

Modern hard drives offer reliability ratings based on AFR – Annual Failure Rate, typically around 0.44%. This means in a set of 1000 unprotected drives that 4-5 devices will fail each year. This is, of course, an average, and some customers will see greater or lower failure rates in their infrastructure. Unrecoverable bit error rates (UBERs) are typically around 10^15, or one failed sector read in 500 reads of an entire 2TB HDD. The UBER risk may seem small but isn’t evenly distributed across drives and sectors. So, some drives may see many more errors than others and in a much shorter time frame.

Solid-state disks offer similar levels of AFR to hard drives. UBER rates are generally much better at around 10^17, although SSDs have limited endurance compared to HDDs.

In addition to device/component failures and media read errors, a third recovery scenario occurs with distributed storage solutions such as StorPool. In a distributed architecture, nodes communicate over the network to replicate and share data. If a node drops out of the system for a short time, either for planned work or due to a network or server error, any changed or new data is immediately out of synchronisation. Distributed systems must re-establish consistency without compromising data integrity. This scenario can also happen when individual drives are unexpectedly removed, either in error or due to a systems fault.

Data Management Processes

From the challenges already described in managing media and server nodes, we can summarise the expected tasks that need to be performed in a distributed storage solution.

Data resiliency – implementation of RAID or data mirroring.
Data recovery – automated recovery from device failure using redundant data.
Data consistency – recovering from node failure.
Data integrity – validating written data is correctly stored on persistent media.

We will look at each of these four requirements and show how StorPool addresses each of them.

Data Resiliency

The standard process for protecting data in modern storage systems is to use data redundancy, either through a RAID architecture or through data mirroring. In the first post of this series, we looked at data resiliency and placement across placement groups and disk sets.

This placement process ensures that logical volumes use all available storage performance (a process called wide striping) while maintaining resiliency. Data is distributed across nodes, so any single media or node failure will not result in data loss.

However, as clusters grow and become more complex, individual workloads may need additional protection, whereas some temporary workloads could run with no data protection in place. StorPool provides all these capabilities and allows changes to be made dynamically using Placement Groups and Templates.

It’s also possible to drop physical drives from logical volumes. This ability could be needed where a drive is known to be failing and exhibiting errors. This drive can be removed from critical or highly active volumes, while redundancy for that data is rebuilt elsewhere.

In the following screenshots, we reduce redundancy on a test volume fiovol-hybrid, which is has three mirrors. The drive currently stores around 51GB, which is moved to be distributed across a single NVMe drive and HDD. We then re-establish the 3-way resiliency by increasing the replication count. All of the rebuild work completes within a matter of a few minutes.

fiovol-hybrid replication dropped to 2 mirrors, 281GB of data to be synced.

replication increased to 3 mirrors – protection now degraded and in recovery – data also being moved

fiovol-hybrid now returned to 3 mirror protection

Data Recovery

Physical media will eventually fail or exhibit enough errors to justify replacement. Although “hard” failures can’t be avoided, a controlled media replacement process is preferred for several reasons. First, the extent of a problem can be understood and the relative impact on performance managed. Second, the recovery process can be scheduled to have the least amount of impact on production workloads. In some scenarios, for example, it may be desirable to rebuild data outside of a busy production window, so either recovery occurs more quickly, or performance is not affected (or both).

StorPool provides the capability to “soft” eject drive media, with the same impact as if a drive were removed by physical ejection or simply failing in place. At this point, logical volumes will be in a “degraded” state where redundancy is not at the required levels. The StorPool administrator has the capability to rebalance workloads across the remaining drives to restore redundancy to previous levels.

In this demonstration video, we “soft-fail” a single hard drive on one of three nodes in a StorPool complex. The failure is assumed to represent the loss of a drive, which then gets replaced. Two processes are followed. The first rebalancing restores redundancy for all volumes in a degraded state. The second rebalancing re-distributes data once a new drive is in place. The rebalancer process is a manual task that examines a cluster and makes recommendations on how to efficiently rebalance workloads. As demonstrated in the two videos in this post, the recommendations of the rebalancer task are committed by the administrator, essentially to validate that a correct assessment has been made and any changes won’t result in excessive data movement.

The wide striping process and data layout ensure that many drives are involved in the rebuild process, both as a source and target. This speeds up the recovery time and reduces the risk of receiving a second failure while the rebuild process is taking place. Drives are not paired or placed in RAID groups, which provides much greater flexibility in recovery while optimising the use of storage capacity.

Note: the use of a manual rebalancing process does result in administrator intervention to recover from failing media. However, this process has a second use to allow the expansion or replacement of storage in a cluster with no downtime. Operationally, an entire cluster could be replaced without disturbing running applications or impacting normal operations.

Data Consistency

In distributed systems, data does occasionally get out of consistency across a cluster. This can occur when media temporarily fails or if a server/node becomes unavailable, either from a network failure or simply with a server reboot. Distributed storage systems must provide the capability to validate and ensure data is consistent across all mirrors and all servers.

StorPool uses internal metadata to track the consistency of data across a cluster. In the event of an outage, inaccurate data can easily be rebuilt to put a cluster back into a consistent state – without a full rebuild. This last point is extremely important. Transient errors occur within complex storage systems, making full rebuilds impractical. StorPool provides the capability to validate and rebuild only the stale data across a device or cluster after an outage.

In the following video, we test this process by ejecting a single HDD which is storing data for an active test volume receiving a large amount of write I/O. In this scenario, the data on the ejected disk becomes increasingly out of date as new data is written to the logical volume. The recovery process reinstates the drive, then ensures only the changed data is replicated to restore redundancy.

One important aspect to note in this demonstration is the rate at which the recovery occurred. The rebuild of data continued as a background task that didn’t affect the front-end performance. The rebuild also continued to occur, even though the primary devices were NVMe drives (with higher performance), while the failed drive was a hard disk (with lower random performance).

Data Integrity

Both SSDs and HDDs can generate transient errors and unrecoverable read errors. StorPool provides several techniques to mitigate these risks. As we’ve already discussed, logical data is mirrored across multiple drives, with the capability to dynamically increase or decrease this protection per volume.

Data scrubbing is implemented as a background task, running periodically to validate the content of data on disk and highlight potential drive failures. All data is written to disk with sequence numbers and 64-bit checksums that provide integrity checking capabilities. With these processes, inconsistent data can be restored from other replicas, while data writes are only confirmed if the media provides the correct checksum data.

The Architect’s View™

Maintaining data integrity is the ultimate goal of all storage systems. Although modern media is highly reliable, complex storage systems need processes and tools to maintain 100% data integrity and availability. StorPool has implemented a range of architectural tools and processes to deliver data consistency.

Two important factors in these processes are;

Observability – the StorPool platform provides detailed information on the impact of a media or node failure, including reduced redundancy and the status of rebuilds.
Control – administrators have the ability to control the recovery processes, ensuring minimal impact to normal operations.

Where many modern storage systems look to obfuscate the degree of control available to administrators, StorPool has chosen to make recovery the responsibility of system administration. While small-scale storage appliances are ideally suited to automated recovery, multi-petabyte systems represent a different case entirely and justify the need for more manual control of recovery processes.

This work has been made possible through sponsorship from StorPool.