Weka 4 - The Data Platform?

Weka has announced Weka 4, the next iteration of the distributed file system first introduced in the summer of 2017. The company now describes its solution as a “data platform”. Is that accurate, and what features lift a distributed file system into the data management category?

Background

We’ve followed Weka since our initial introduction to the technology start-up at AWS Reinvent in November 2017. The company’s core solution, originally branded as Matrix (now Data Platform), is a scale-out, distributed file system (WekaFS). The design bypasses much of the standard Linux kernel storage and networking I/O paths, interfacing directly with PCIe-connected NVMe drives and using SR-IOV for networking.

At the host level, WekaFS appears as a local POSIX file system implemented through a virtual file system driver. The solution also supports NVIDIA GPUDirect Storage, NFS v3 (v4.1 in Weka 4), SMB (through SMB-W, a more efficient Weka-specific implementation) and S3 (AWS Simple Storage Service).

Performance

The performance credentials of WekaFS aren’t in doubt. We’ve discussed the high throughput and low latency capabilities in many blogs and podcasts (with some examples shown here).

These attributes make WekaFS suitable for a wide range of applications, but most notably, AI and analytics processes that create widespread random I/O across a mix of small and large-sized files. For a current view of the Weka Data Platform and use cases, check out this video from a recent Tech Field Day event, with all the videos available here.

New and Improved

Weka 4 introduces new features and improves upon existing ones. The most significant changes in Weka 4 include:

Additional cloud support. Weka 4 provides native integration with AWS, GCP, Azure and OCI (Oracle Cloud). In this instance, “native” refers to the ability of the Weka software to understand the deployed environment, operate with platform-specific storage, deploy with solutions like CloudFormation and scale up/down automatically on demand.
Improved data efficiency. Weka 4 supports QLC and TLC drives and has improved internal data reduction capabilities.
Improved and redesigned UI. A revision of the GUI for greater operational efficiency, with the ability to quick-start a Weka installation from the Self-Service Portal (https://start.Weka.io/) – currently AWS only.
Additional protocol support. Weka 4 now supports NFS v4.1 and SMB through SMB-W, a Weka-optimised implementation.

These two videos (featuring Joel Kaufman) give a good idea of the dynamic deployment capabilities, demonstrating how the software is installed and how data is migrated between clouds using S3 snapshots).

Data Platform

The introduction of Weka 4 repositions the technology as a data platform. Although the rebranding sounds good from a marketing perspective, what does it mean to be a true data platform, and what features does Weka 4 have to justify this stance?

First, we need a definition. Excluding the obvious ones relating to a dais or railway platform, the most appropriate explanation we have is as follows:

“A platform is a group of technologies that are used as a base upon which other applications, processes or technologies are developed.”

Typically, we think of operating systems or hardware as platforms. Increasingly, the definition is becoming more abstract as the public cloud is treated as a platform on which applications are developed and deployed.

What does it mean to be a data platform? The most obvious capability of any data platform is the ability to ingest, serve and manage data for many different I/O and usage profiles, in parallel and with multi-tenant isolation. I recently met up with the Weka team in San Jose, where we discussed this specific point. I learned Weka has many customers that routinely operate processes of data ingest, followed by AI model training. These customers also choose to operate data isolation using multiple file systems, all sitting on the same physical infrastructure.

“Requirement 1 – ingest, manage and deliver data from a diverse set of application use cases, with multi-tenant isolation and zero performance impact.”

This capability of Weka 4 is incredibly powerful because it abstracts the hardware and software management from the visualisation of data to the end user. System and storage administrators no longer have to build out separate environments for each line of business. Instead, a single platform can be built that spans all requirements. The only reason to deploy multiple Weka 4 solutions is for geographic diversity, resilience or when using the public cloud.

One important aspect of the Weka design is the ability to scale up and down. This feature is essential in the public cloud, where computing resources are elastic and will expect storage resources (both capacity and performance) to do the same.

Data Anywhere

With the ability to snapshot data to S3, Weka 4 provides the first level of mobility for multi-cloud. Data owners can snapshot data from one cloud provider via an S3 bucket and re-hydrate into another cloud (see Joel’s second video above to understand just how easy this is).

“Requirement 2 – data should be freed from the dependency of physical hardware and made available wherever it is needed, while ensuring data consistency, data efficiency and data protection.”

As we discussed in this post from 2019, businesses need to “own the data” and “rent the cloud”. This is only possible if data can be moved easily from one provider to another, taking advantage of services and solutions (for example) only available on one platform. Note: just because data can be moved easily doesn’t mean we envisage continuous data mobility. Cloud egress still attracts a charge, so data mobility between providers requires an awareness of the cost implications.

Self-Service

The public cloud has taken the concept of technology self-service to new levels. Two decades ago, businesses built service catalogues (and still do) that offer a menu of standardised solutions across storage, networking, compute, virtualisation and applications. The catalogue itself doesn’t translate to self-service. These features are delivered through workflow and automation, which have been implemented within the enterprise with varying degrees of success.

Modern infrastructure consumers (both developers and businesses) want to acquire and consume resources, pay for consumption, and give back resources when they’re no longer used. Making this process work in the on-premises data centre (as well as it does in the public cloud) is a challenge, especially in the storage arena.

Historically, scale-out file systems have been designed and deployed with the deep technical skills of experts. Once in place, the fragility of deployments means upgrades or amendments are approached with trepidation. In modern IT, this simply shouldn’t be the case. The public cloud vendors can clearly scale, upgrade, and improve services with little or no user downtime, so why should on-premises be any different?

“Requirement 3 – the deployment, operation and management of data platforms should be as frictionless as possible, while the user experience should enable resources to be consumed and returned on demand.”

How has Weka met this requirement? Since the last time we saw a demonstration of the Weka solution, the company has implemented a self-service portal that helps customers build and deploy directly into AWS. This capability includes assistance in picking the right AWS instance and an automated build process using CloudFormation.

The Architect’s View®

In this post, we’ve highlighted three requirements we expect to see in modern data platforms. In each case, the definition focuses on the content and user experience rather than the hardware and infrastructure. This is where we see the differentiation between a distributed file system and a data platform. A data platform should hide the complexity of implementation from the user and be focused on delivering features that make the best use of the data being stored. In that respect, Weka 4 meets the brief.

But we think there’s an opportunity to go further. Remember that Weka owns the file system and, as such, can modify or extend the way the file system works. This could mean adding extra metadata, building workflow that is triggered by specific types of content, discovering ransomware, auto archiving PII, the list is endless. AWS does this for S3 with Lambda, while HammerSpace implements metadata extensions into its platform. In this respect, Weka 4 truly is a platform onto which the company itself can build new features and functionality onto what is essentially a base. Weka has barely started on a journey that could transform the way we look at data management in the future.