Building a Golden Data Repository

Businesses are increasingly seeing data as one of their most valuable assets. At the most basic level, all organisations need to keep records on customer transactions and the products and services they sell. However, data management in the modern enterprise is about much more than simple bookkeeping. Information collected from online customer interactions or data from edge and IoT devices (to pick just two examples) can be used to develop insights that enable business advantage or the delivery of more efficient services. With so much valuable data in play, businesses need to develop a strategy for storing, protecting and using that data. One solution is to build a golden repository that centralises and optimises business data assets.

Data as an Asset

Most businesses will have three sources of data.

Current Data – typically data used in ongoing business transactions, such as customer records. This is likely to be a mix of structured and unstructured content. This data typically sits in online systems rather than a single large repository.
Historical Data – information from previous customer engagements, product development or other sources. This could also be a mix of structured and unstructured data.
New Data – information being collected from new sources that could include sensors, logs, platforms used to deliver services and engage with customers, or a range of other IoT and edge devices.

As businesses identify and create new data sources, we’re increasingly seeing much more unstructured data being created. Much of this will be stored in file systems and object stores.

Machine learning and Artificial Intelligence (ML/AI) provide tools for analysing large volumes of data. Typically, with more data available to process, AI models increase in accuracy. Therefore, businesses will want to keep as much data as possible. Of course, there’s no way to predict which data sources will be valuable in the future, so businesses may follow a strategy of “keep everything” and never throw content away.

Challenges

With so many data sources available, there are challenges in implementing efficient data management. Data sprawl results in content being kept across multiple locations, sometimes in many data centres and public clouds. Sprawl leads to inevitable issues. First, it introduces the risk of having no single point of truth, when multiple copies of the same data set have been updated in parallel in disparate systems. Second, duplication means increased cost, a problem that copy data management was expected to solve. Third, multiple copies of data introduce more attack vectors for accessing content and can make it difficult to apply a consistent security policy. This is especially true when many different data platforms are involved.

Golden Repository

A great solution to these challenges of data management is to create a golden or master repository. Rather than spread data across multiple platforms and solutions, data is centralised in one system. The concept of a single system doesn’t mean one monolithic rack of servers and storage, as implementations can vary. However, the development of a single repository does refer to metadata and creating a single logical namespace.

The Importance of Metadata

Why should we worry about our metadata strategy? Metadata describes data and holds valuable information on the content. This can include simple characteristics like file size, date created and so on. It can also be extended to include specific data attributes that relate directly to the content itself. For example, x-ray data could include patient name, age and hospital. Media content could include media format, date created and copyright information.

Efficient metadata makes content search much simpler and having metadata in a single location makes it easier to run search, rather than having to query multiple repositories and somehow aggregate the results.

Single Source of Truth

From a data management perspective, a single repository ensures there is only ever one “source of truth” for content. By this we mean, whenever data is processed, the current copy is (and should) always be taken from the repository.

#114 – CIO Storage and Data Challenges

Naturally, this doesn’t mean always processing data directly in the repository. Copies can be taken and used elsewhere, as long as procedures are in place for incorporating updates and managing multiple “check-outs”. This process is much easier when most data is effectively read-only and used for analytics purposes.

Dispersed Data Generation

Another aspect of the benefits of a golden copy can be seen with businesses that generate large amounts of dispersed data or content at the edge. Content generated by edge devices, sensors and other IoT devices may not initially be incorporated into a core repository either because of bandwidth constraints or when only some of the data is valuable to retain. However, eventually moving the data to a central repository provides the ability to index the content and ensure multiple copies are not being stored.

Cloud or On-Premises

With increasing data volumes, businesses need to choose where data should be kept. When building a single repository, IT organisations have a choice in whether to use public cloud or store data on-premises.

Public Cloud

In the early days of cloud computing, there were lots of negative stories discussing data loss, data exposed to public access and public cloud outages. It is inevitable with so many customers using public cloud services, that there will be some issues. However, we can be confident in saying that the major cloud storage providers offer safe and reliable data storage solutions. There are some aspects that need to be considered when choosing public cloud storage solutions.

Storage online is not cheap. A petabyte of object storage will cost around $22,000 per month to store, with additional charges for data access and data egress from the cloud provider. These costs continue to accrue, even if data is not being used.
Online storage offers lower availability than standard enterprise solutions. Although public cloud storage solutions offer high durability, services are designed to be typically at best 99.99% available, with 99.9% availability expected through service level agreements. SLAs usually only offer service credits for downtime and offer no compensation for data loss.
Cloud object stores are not designed for high performance. Naturally, the definition of performance is relative, however, time-to-first-byte in online object stores can range from a few milliseconds to 100 milliseconds depending on the data type. Cloud service providers are extending their services with native and 3^rd party file solutions that offer better latency and throughput characteristics.
Data transfer between clouds can be expensive. Egress charges make it a challenge to cost-effectively move between clouds to take advantage of different services offered by cloud service providers, without keeping multiple copies of data.
Data placement in public cloud may present some governance, compliance or regulatory challenges that require special treatment or management by the cloud provider.

Taking a positive view, using public cloud offers a solution with no management overhead in terms of infrastructure. Storage space is effectively unlimited and replicated by default across multiple redundant data centres. There’s no need to go through arduous capital purchases cycles and charging usage back to lines of business is relatively easy. Compute and analytics solutions in the public cloud are evolving quickly and arguably at a more agile and cost-effective way than buying on-premises hardware.

On-Premises

There are many software and hardware solutions for storing data on-premises. Object storage platforms are designed to grow on-demand by adding additional compute and storage nodes or servers). Vendors offer a mix of capacity and performance-biased products that scale into multi-petabyte address spaces. Characteristics to look at in on-premises solutions include:

Scalability – what is the limit of a single namespace and platform? Can nodes/servers be added dynamically and what is the impact on the system? For example, will data across nodes rebalance automatically or affect overall performance while the rebalancing takes place?
Granularity – what level of scaling is on offer? Can individual drives or servers be added or is it necessary to add groups of servers in each update?
Upgrades and Replacements – how are software upgrades managed? Do they affect performance and incur downtime? Can servers/nodes be easily replaced for upgrades or fix faults?
Licensing Policies – is charging based on nodes, servers or capacity? Is charging based on monthly, annual or perpetual licences?
Geo-dispersal – can data be protected across data centres? What features are available? Typically, solutions either replicate data or use erasure coding to reduce redundancy at the expense of performance.

We can see that on-premises solutions require more planning than public cloud and naturally more ongoing management. Platform costs will be a mix of capital and operational expenditure, depending on the licensing model. Some vendors are starting to offer storage-as-a-service solutions on-premises and this may be one way to mitigate against capital costs.

On-premises repository, multi-cloud support

Own Your Data, Rent the Cloud

If multiple clouds are likely to be an architectural strategy, one solution is to keep data on-premises and rent compute from the public cloud. In this scenario, data is copied temporarily to the public cloud to be processed, then discarded when no longer needed. Only derived or changed data is retained. This model works well where, for example, portions of a data lake are processed over time and only a subset of data is needed to be kept in public cloud at any point in time.

The Architect’s View

Building a golden repository is about ensuring data assets can be tracked and managed successfully. As new data sources are developed, having a single archive makes it easier to make these assets available to the entire business. Effective management reduces costs, risk and facilitates greater business value. This is an outcome that all CIOs and CTOs are incentivised and looking to deliver.