Making the Case for SAN 2.0

Pretty much everything in IT works in cycles or moves through a pendulum effect. Over the years we’ve seen major trends in centralisation and de-centralisation as companies moved away from mainframes and deployed departmental systems. We’ve seen outsourcing (or systems management) come and go, with a big trend for using public cloud in the current market.

The same paradigm applies to storage and the move to and from centralisation using storage area networks or SANs. From the early 2000s, centralised storage in the form of Fibre Channel and iSCSI SANs were hugely popular. In recent years, IT organisations have moved away from centralisation to a more distributed model, as micro-services and open source computing have increased in popularity.

However, there is a case to be made for the re-emergence of SAN technology, albeit in an evolved form that meets the requirements of modern application deployments.

The Evolution of SAN

Storage Area Networks became hugely popular in the late 1990s and early 2000s as IT organisations started to deploy smaller and more agile servers into the data centre. This isn’t to say that this was the only adoption model, but generally, the trend for centralised storage was driven by a number of factors that included the widespread deployment of storage across hundreds, if not thousands of individual servers. Managing server farms with independent disks was a big headache, so introducing a SAN resolved many issues and brought the following benefits:

Consolidation. Persistent storage (typically except for boot volumes) was now in one place. As HDDs became larger in capacity, it was easy to waste storage resources that were effectively stranded in each server. Combining storage into a single hardware appliance provided consolidation and reduced waste.
Performance. Consolidation had a secondary benefit in that data from one server could be spread across multiple HDD spindles. In the days when a single drive could deliver perhaps 200 random IOPS, distributing I/O across many devices has a significant performance benefit. It’s also worth remembering that shared storage improves performance by using DRAM (and now NAND) for I/O read and write caching.
Resiliency. SANs provide the ability to improve resiliency for applications. Most obviously this is through the use of RAID for data protection, but also by having multiple storage controllers that mitigate failure from hardware outages. In many cases, the five 9’s of availability was greater than that offered by the application server.
Availability. Extending on the availability benefits, SAN storage offers the capability to do in-place upgrades and replace failed components (drives, controllers) with minimal or no impact to the application. In clustered environments, the SAN storage offers a single image of data, without complex server-to-server replications. This capability has been exploited particularly efficiently with server virtualisation.
Management. Probably one of the greatest benefits has been in reducing management overhead. Servicing thousands of servers takes a lot of time and effort, as well as having significant risk. It’s easy to remove the wrong failed disk from a server or even choose the wrong server altogether. With hot spare disks shared across multiple servers, disk replacements can be managed weekly or even monthly, reducing the risk caused by engineers being in the data centre.

The Success of SAN

It’s no surprise that SANs were rapidly adopted and used for both single application servers running Linux and Windows as well as larger mid-range servers that ran proprietary operating systems like Solaris, HP-UX and AIX.

Fibre Channel switches and optical cabling overcame the physical restrictions of putting storage close to the server and provided the capability to implement remote replication. In EMEA, synchronous replication was (and still is) popular and used in metro or short-range scenarios to seamlessly protect data between data centres.

Storage appliances themselves evolved. Tiering introduced the ability to optimise storage performance and cost. Data could be automatically moved between cheap and expensive storage depending on access patterns. Data protection was implemented using snapshots and replication. This also provided the capability to take secondary data copies for backup or test/development work.

Data services were introduced to optimise capacity and included thin provisioning, de-duplication and compression. None of these features could have worked effectively on individual servers because of the additional CPU/memory requirements and lack of consolidated data.

Probably one of the biggest benefits of introducing SAN storage has been in taking advantage of the Wisdom of the (Storage) Crowd. Vendors rigorously tested solutions and components, while using telemetry from the field to identify trends and potential corner case issues. Today, this part of the market is highly developed and provides feedback on product reliability as well as customer data on growth patterns and performance issues.

The Legacy of SAN

However, SANs do have issues. The technology was always seen as expensive and inflexible. Configuration and mapping of storage to host servers could take days or weeks to achieve because of the manual nature of provisioning. Tuning and managing Fibre Channel and shared storage required skilled staff that could be expensive to recruit. These factors mean SAN storage doesn’t scale well, if not correctly deployed.

In the same way that server virtualisation simply takes a physical construct (the server) and abstracts it, so SANs implemented virtual versions of physical storage, retaining legacy features like LUNs and volumes that were associated with custom physical attachments like Fibre Channel HBAs.

None of these issues plays well with modern application development where DevOps delivery models demand much more agility from storage. Cost also becomes a huge factor when developers want to run many small or temporary application instances.

The Need for Persistence

Modern application development is changing across many levels. Developers are adopting new tools that allow applications to run as containers, as lightweight virtual machines or serverless as just code.

The initially perceived wisdom said containerisation would produce ethereal applications that could be restarted and scaled on demand. Persistent storage wouldn’t be needed because applications would provide data persistence and resilience. While this approach could work on a small scale, as organisations increase their maturity in the adoption of containers, it quickly becomes apparent that persistent storage is still needed.

Rethinking Storage for Microservices

To be fair, platforms like open-source Ceph do allow developers to build persistent storage without SANs. However, these platforms also require significant skills and experience, echoing some of the initial issues and objections to adopting SAN technology.

Any business where data is managed in a regulated way according to strict compliance rules, simply can’t countenance the idea of information living only in running applications. At some point, the persistence offered by storage is still required. The big question is whether that needs to be in the form of SAN or some other solution.

Changing Application Designs

If we look at how applications (typically databases) have changed, there has been a move to build resiliency into the database layer. NoSQL databases such as Cassandra and MongoDB are designed to work with local storage, as they shard and replicate data at the application layer. Server nodes become a failure domain, where if a node (or component) is lost, the data is redundantly available elsewhere.

However, recovering from a media or node failure means creating lots of east-west network traffic, or I/O between nodes to rebuild lost redundancy. This can negatively impact on the host (due to the processor load) and the application, due to increased network traffic and load on local media.

There is also the risk of an additional failure occurring while resiliency is re-established. In large enterprise environments, data recovery is likely to be in process somewhere in the infrastructure all the time. At these high levels of scale, minimising the impact of recovery is essential.

SAN 2.0 Challenges

The next generation of shared storage solutions needs to offer features that meet both the requirements of today’s modern applications and the capability of new media. Solid-state storage is replacing legacy hard drives for all but the most long-term archive needs. NAND Flash and persistent memory products like 3D-XPoint can offer hundreds of thousands of IOPS in a single device that easily swamp the traditional dual-controller design.

Today’s applications demand ultra-low latency measured in the single-digit microseconds that could only be achieved by using locally attached storage. IT organisations are building to rack-scale, with deployments of large numbers of generic servers per rack. This promises a future of disaggregated computing, where CPU, memory and storage are separated out and aggregated in pools. An entire rack effectively becomes a single large “mainframe”.

SAN 2.0 Requirements

So, with all this in mind, what are the requirements for “SAN 2.0”?

Automation. Automation is essential. SAN 2.0 simply can’t operate with storage administrators performing manual provisioning tasks. Instead, system administrators need to look at higher level issues like resolving failures and managing capacity and performance growth. Automation should provide the ability to dynamically create storage resources through APIs and CLIs. It should ensure these resources can be made available across a network wherever they are required.

Reduced I/O Path. Modern applications demand low latency. Solid-state media offers performance levels around 100µs for NAND flash and 10µs or less for 3D-XPoint. Storage protocols and I/O stacks need to add as little latency as possible to these figures, otherwise the benefit of fast media in shared storage is lost. This means getting out of the way of I/O, using hardware-focused approaches (like FPGAs) and disaggregating the I/O overhead.

Scalability. Shared storage needs to scale, not just in capacity but in performance. In modern architectures that means scaling internal and external bandwidth linearly and increasing compute. Dual controller architectures of the past just won’t cut it.

Commoditisation. Today’s customers simply won’t accept huge margins and mark-ups on storage media. The market for NAND flash and other solid-state media is growing rapidly both up with additional capacities and out with a range of performance profiles. Customers will want to use any and all available persistent media options (subject to reasonable HCL requirements).

Lightweight Data Services. Data services still have value in SAN 2.0. Data protection is best done at the array level, as it reduces east-west network traffic, maintains performance and exposes less risk to the application. Data protection using snapshots or for cloning also has value. However, higher level functions like array replication are moving to the application and have less relevance in modern application design.

The Architect’s View

Shared storage provided immense value in the enterprise, but like all technologies, it needs to evolve. There is still huge value in centralising storage, especially as part of a composable or disaggregated, rack-scale infrastructure design.

Storage needs to be invisible yet deliver on the requirements needed by modern applications. It will be essential in delivering a robust enterprise cloud strategy.

No reproduction without permission. Post #7134.