Object Storage Essential Capabilities #1 - Scalability

This is a series of posts covering critical features for object storage platforms and extends on this post from the beginning of 2017.

Object storage has been positioned as a platform for storing vast amounts of unstructured data, the main area in which we’re seeing significant industry growth. Systems need to have the capability to scale to multi-petabytes of capacity, while at the same time offering a reasonable entry point for adoption. When we talk about specific features around scalability, exactly what do we mean?

Object Size

Objects may range from a few kilobytes to multi-terabytes in size. Digital X-rays may be around 20MB each, whereas a one hour 4K video could be anything from 27-45GB depending on the compression codecs in use. Although it may seem surprising, smaller objects can be more difficult to manage, because data protection schemes like erasure coding work better with larger objects that can be chunked up and spread across multiple storage nodes. As an example, Cloudian’s HyperStore uses Cassandra for storing small objects, whereas larger objects are stored as normal on the HyperStore File System on disk.

Capacity Limits

Pretty much every object store on the market claims infinite scalability, with limits only on what has been tested in the lab or at their biggest customers. However, there’s more to think about than purely how much capacity a system can support. There are multiple aspects to consider.

Physical Storage Space Scaling – how can storage capacity be expanded? Can I extend an existing node with more disk/flash drives or do I need to add more nodes with extra CPU/memory performance?
Performance Scaling – to support more throughput do I need extra nodes?

As we look at scaling factors, we start to see some cost factors come into the equation. Look at open source Ceph as an example, one immediate statement made in the Hardware Recommendations section is:

Ceph metadata servers dynamically redistribute their load, which is CPU intensive. So your metadata servers should have significant processing power.

Look further and you can see recommendations like 1GB of DRAM per daemon per server, with 1GB per 1TB of DRAM for OSDs (daemons managing storage devices), which immediately puts a limiting factor on the physical capacity a single node can support. Obviously these are recommended numbers and in reality, with actual data, the results may vary, however scaling performance and capacity are interlinked and can’t be treated independently.

Many systems such as Cloudian and Scality use a RING architecture to distribute metadata and content across multiple nodes. The addressable range of object IDs (used to reference a piece of content) is divided across each of the nodes. As the system is expanded, new nodes re-divide the address space and take ownership of part of the range of object IDs. Where necessary, content can be rebalanced. One aspect to be aware of is what impact this rebalancing has on existing performance. Does the process occur immediately, in the background or can the process be scheduled?

Other platforms such as OpenIO use a directory architecture to keep track of data distributed across nodes. This relies on a load balancer, known as the Conscience, which keeps track of available storage capacity. The state of services on each available hosts is calcuated to produce a rating from 0 to 100, with higher numbers better choices. New hosts and capacity can simply be added into the list of available hosts with a higher rating value.

Object stores also have logical scaling limits. Typically data in object stores is divided up into logical units such as buckets. A bucket could represent a specific department or application. Platforms may put limits on the number of buckets that can be created in a single system (AWS limits accounts by default to 100 buckets).

Tiering and Caching

Initially, object storage was seen as a great platform for relatively inactive data such as backups and archives. While these are great use cases, they’re not really indicative of the way in which object storage is used today. Many businesses use active archives that are constantly storing and retrieving data that previously may have existed in a scale-out NAS solution. As a result, content goes through a lifecycle that has active and inactive periods.

Imagine an insurance company issuing PDF versions of policy documents. At the point of creation, these documents are likely to be active, either for the customer to log in and download or for printing and posting. Over the course of a few days, the documents may get amended (and so be created and deleted many times), then eventually move to a period of inactivity after a few weeks. At some point as the policy is succeeded, the documents move to archive status to be retained for a statutory period. This example shows that initially, data needs to sit on fast storage and eventually tier down to cheaper long term media.

Depending on document size, there may be performance advantages in initially using RAID protection and converting documents to erasure coding for long term storage. How does your object store manage this? Amazon Web Services’ S3 platform uses only three tiers – Standard, Standard – Infrequent Access and Glacier. Each tier is priced differently, with considerable savings between each level (infrequent access is about 1/2 standard pricing, Glacier drops the price by two-thirds of IA). The three tiers offer the same levels of durability (risk of data loss), but have different availability SLAs, meaning there is more potential downtime with Glacier and IA than standard. However this reduction in many cases is offset by the cost saving.

Object stores should support tiering, so data can be moved between media of different cost. Unlike the process in traditional block systems that try to be as proactive as possible, data doesn’t need to be moved at millisecond or even hourly, but could be processed on a daily basis. However, some process of re-promoting active content needs to be available. The actual process of tiering can be driven by metadata associated with each object. This may be system based (e.g. tier down everything not accessed for 30 days) or from user metadata (auto tier this object to cheaper storage after 1 week).

Metadata Management

In object stores, metadata provides information on the content being stored. Objects can have system metadata, which is typically items such as the file size and data created, or user metadata that provide additional information. Examples are the user creating the object, the application creating the object, details of the object format or content. Metadata can also be used to indicate how an object can be stored, so could be used for tiering or data protection. This represents an interesting difference between typical and object-storage systems where these metrics are assigned by the administrator.

With very large object stores, the use of metadata is critical in finding content (discussed more in post #3) and in many cases the amount of metadata can be significant. For this reason, object stores need to be efficient at handling metadata, including the ability to supplement metadata content once an object has been stored. We will discuss more in article #3, however one example of metadata storage is the use of Cassandra in Cloudian’s HyperStore. Two separate tables keep medata details, with one optimised specifically for read requests.

Summary

There’s more to scalability than just going big. As object stores become more pervasive, handling small objects will be just as relevant. Scaling means performance and capacity, and should be done with minimal or no end user impact. Today there are no real object storage benchmarks to help customers resolve some of these questions. This is an area the industry needs to address going forward.

Object Storage Essential Capabilities #1 – Scalability

Object Size

Capacity Limits

Tiering and Caching

Metadata Management

Summary

Related Links