Data migration is a thankless task that needs to be performed periodically during hardware refreshes, and of course, regularly to keep expensive storage systems tidy. Fortunately, for traditional block-based workloads, the introduction of technologies like server virtualisation have taken most of the pain out of the process. However, migrating unstructured data still seems to represent a challenge. As a result, we see solutions from companies like Hammerspace, Komprise and Datadobi looking to both ease the process and take advantage of hybrid multi-cloud configurations.
Why is the migration of unstructured data so hard? I’ve performed many data migrations and optimisations over the years, including unstructured and block-based storage. The challenges for unstructured data seem to fall into several categories.
- Access – file-based unstructured content on NAS (NFS/SMB) tends to get referenced by physical location – IP addresses and hard-coded physical server DNS names. Identifying the source of these use cases can be hard, as people have a habit of including links in Office documents or emails.
- Standards – many NAS platforms have minor discrepancies in file naming standards and acceptable file formats. The most obvious is the difference between SMB and NFS with upper/lower case file names.
- Security – NFS and SMB use different security models, and the implementation of users/groups across systems can differ. Migrations require thinking ahead on security challenges to ensure data doesn’t get orphaned.
The above list references technical issues, but we see other operational challenges too. In many cases, I’ve found it hard to identify the owner of data. File shares and other repositories become used by many teams over time, with the original owner lost to history. File and object stores can also become dumping grounds for data, with lots of content being unused from one month (or year) to the next. This sprawl makes it hard to do timely re-organisations or migrations without having an impact on the business.
When storage was expensive, there was justification for spending time trawling through file shares and optimising the content. However, as storage capacity has become cheaper, there’s a delicate balance to be struck between manual management and realising savings from equipment that already has sunk costs.
One strategy used by enterprises has been to offload data to the public cloud. This appears to be an excellent solution for storing infrequently used data. However, the initial ongoing price reductions offered by AWS/Azure/GCP have dried up. Cloud storage costs are pretty much stable. Instead, the new challenge is to use cheaper services like Wasabi or Backblaze B2.
There are two approaches IT organisations can follow to optimise their unstructured data costs.
- Migrate inactive content to a cheaper platform/medium.
- Migrate all data to a more economic platform.
The migration of inactive content has been a perennial IT problem. I first started managing data archiving back in 1989 with DFHSM on the IBM mainframe. Information Lifecycle Management (ILM) has been a recurring topic ever since. Moving individual files from a set of data can introduce problems.
- How does the system keep track of where files are located? (stubs or links)
- How is metadata kept consistent for data protection and searches?
- How is security maintained consistently across multiple platforms?
Many of these problems were resolved by retaining data on a single (logical) system. However, with the introduction of public cloud storage, APIs and security frameworks are very diverse (although many companies have standardised on protocols like S3).
Lift and Shift
An alternative strategy is simply to move all data from one location to another. An entire “lift and shift” from one platform to another can be an invasive process as, somehow, the migration needs to handle the requirement to maintain access to data while ensuring any updates get reflected in the target location.
With so much unstructured data created by non-human sources, we can expect to see anomalies in data like unusual or malformed file names. Log data, as an example, can create millions of small files, with no obvious way to assign ownership or detect corrupt or missing files.
Over the years there have been attempts to virtualise the file space, with solutions like Microsoft DFS, which essentially abstracts the file system address space and physical location. File Area Networks were popular for a while, although that technology has all but disappeared. Today we’re looking more at solutions like Nasuni, CTERA and Panzura to provide hybrid, global distributed data.
Earlier this year, we recorded a podcast with Hammerspace, a startup that grew out of work done at Primary Data. The Hammerspace global file platform implements a single global namespace, ingesting and abstracting data on existing storage platforms, or built on new hardware from scratch.
I particularly like the metadata aspects of the Hammerspace solution. In any file abstraction technology, we need to retain metadata to track the mapping of logical files to physical storage. Hammerspace has gone a step further and added a rich metadata “engine” into their platform. This enhancement allows users to extend the metadata model and build workflow around data for data management and performance optimisation purposes.
With these capabilities in place, customers can minimise (or substantially) eliminate the challenges of data migration. Instead, the administrator simply sets policies in place that make the best use of resources and dynamically move data to the right location based on security, regulatory and performance requirements. You can listen to the embedded podcast here, and another with Hammerspace SVP Douglas Fallstrom that talks about some of the more generic challenges of global file systems.
Komprise is another data management company that focuses on optimising large pools of unstructured data. Here, the approach is to sit adjacent to the storage platform, rather than be in the data path (as implemented with Hammerspace). The Komprise system deploys data movers (typically as virtual machines) that watch for inactive data and execute policies to move data to cheaper storage media, including the public cloud.
Komprise implements dynamic links to replace data that has been migrated to a new location. The link provides all of the underlying metadata required to do file searches and inquiries. If a file is then re-accessed, a local data mover either redirects the I/O or recovers the required file locally to satisfy the file request. Through policy, this content can then be kept local or discarded as required. This feature is known as Transparent Move Technology or TMT.
The use of links is an elegant solution as it adheres to file systems standards in both SMB and NFS. Rather than blindly recovering the data automatically, the Komprise system can “triage” that request and choose whether a full restore is really necessary. This technique reduces I/O overhead but more importantly, eliminates false restores and keeps metadata consistent. The trade-off is in the additional resources required by the data movers.
Komprise operates as a hybrid service through an online SaaS portal and data movers that get deployed on-premises. Recent feature upgrades have included support for object storage and public cloud file storage platforms.
Although Komprise doesn’t sit in the data path, the platform does collect large amounts of metadata that allow the process of “deep analytics”. Indexing and analysis of large data lakes are managed in the public cloud in a process that is more scalable than the customer could achieve onsite.
You can listen to a Storage Unpacked podcast we recorded just over 12 months ago with Krishna Subramanian from Komprise. I’d also recommend checking out the most recent presentations at Tech Field Day, which provide updates on cloud support and data migrations.
What if you don’t want to deploy and manage new tools and infrastructure? Datadobi, a startup from Belgium, could be the answer. The company was started by the founders of FilePool, which became EMC Centera, arguably one of the first object or content-based storage platforms for the enterprise.
The Datadobi “DobiMigrate” service offering is a suite of tools and processes that enable customers to move large volumes of data from one platform to another. This task might seem simple; after all, we can move data using existing tools built into Windows and Linux. Unfortunately, large-scale migrations are not that simple. Moving data from one location to another takes time, network bandwidth and risks interrupting normal operations. Therefore, efficient data management tools need to take these factors into consideration.
DobiMigrate uses a similar concept to Komprise through the deployment of proxies or data movers into the customer environment. These agents are responsible for both moving data and tracking changes to existing content. A migration workflow will consist of an initial full scan and transfer, followed by “incremental” updates of ongoing changed data. During the migration process, the customer has the opportunity to correct any file system anomalies for content that won’t make the migration process.
At cutover, access stops on the primary system, and “cutover” occurs, redirecting access to the new target. DobiMigrate has the capability to keep track of ongoing updates to the new target, should a scenario arise that requires reversion back to the source.
DataDobi recently introduced S3 to S3 object migration, for customers looking to repatriate data from the public cloud or to migrate to another cheaper provider.
Data migration is a complex process. We’ve described just three solutions. However, there are other techniques and products available in the market. Earlier this year, we discussed InfiniteIO, which also offers migration capabilities. NetApp recently acquired Talon Storage. We recorded a podcast in 2018 with CEO Shirish Phatak that provides more background on migration processes and challenges.
The Architect’s View®
The holy grail of ubiquitous data mobility and access isn’t quite here (yet). There are many solutions on the market that either integrate into enterprise environments to optimise storage or put in place a framework for a distributed architecture.
Copyright (c) 2007-2020 Brookend Limited. No reproduction without permission in part or whole. Post #f535.