Object Storage Essential Capabilities #3 - Searching, Indexing and Metadata

This is the third in a series of posts covering object storage requirements. Other posts in this series can be found at Object Storage Capabilities Series.

Object stores are intended to store large volumes of data. Capacities are usually measured in hundreds of terabytes or petabytes rather than smaller units. As a result, an object store can hold millions or billions of individual objects. Naturally this creates an issue for both naming and for finding those objects in the future.

Object Naming

There are typically two processes for naming objects. The first is to use a human readable form, which essentially looks like a file name. This could actually be the name of a file that is uploaded into the object store, but remember that objects are typically stored in large collections (generally called buckets). This means there is no file hierarchy to speak of. Platforms such as AWS S3 allow the inclusion of special characters like “/” to be included in an object name. However, this doesn’t imply any enforced file system hierarchy. Creating unique names could be a problem, when there are thousands of objects to be stored. So it’s important to think about a naming standard to avoid any name clashes, because storing an object with the same name as an existing one will simply overwrite it.

The second process in use is where the object store itself issues a system-generated object ID or OID. An OID is a long string of characters and numbers, pseudo generated by the object store. We use the term “pseudo” because OIDs may be generated using some information from the object store itself, like the node on which the data is initially stored or created.

Object name standards will be dictated by the platform. For example, AWS allows object names of up to 1024 bytes long, with many special characters not permitted.

Metadata

One disadvantage of using OIDs is the need to track exactly what each OID refers to. With millions of objects in a store, no-one will ever remember what the individually issued IDs were for. To a certain extent this problem also exists with user-assigned names, because inevitably an object store with millions of items will end up with many that have undecipherable names. The answer here is metadata. Metadata is described as “data about data”. Essentially it is information or attributes stored with an object that is used to help locate that object in the future.

Object stores implement system metadata and user metadata. System metadata are attributes such as object size, access permissions, storage tier and date stored. User metadata extends the information stored with each object by adding application specific information. For example, application name, content format and user creating the object could all be added as user metadata. Metadata itself is generally specified as a name/value pair and added to an object at creation time. Some platforms will allow metadata to be edited, however, others like S3 don’t allow this and force a copy of the object to edit existing user metadata content. It is possible to add new metadata though.

Most object stores allow very large amounts of metadata to be stored with any individual object. The more and detailed the metadata is, the better search will be. With AWS, metadata is limited to 2KB within the HTTP headers of the PUT statement used to store an object.

Search Capabilities

Retrieving data from an object store means either knowing the name of the object or finding the object through search. With detailed metadata, search can be very specific in nature. The performance of search on object store metadata is a critical capability because scanning individual objects is a totally impractical process. Naturally, search performance should not be a limiting factor on the size of an object store and should scale as the object store grows in size.

Vendors have started to introduce the ability to query metadata using SQL-like functionality. Amazon Athena is a feature of AWS for querying data within S3 using a SQL schema. The Zenko multi-cloud object storage controller from Scality has a feature called Clueso that provides the ability to do high-speed metadata searches. Cloudian HyperStore uses Cassandra as the backend metadata store. Two column families are used, to ensure that read performance for metadata searches can be successfully optimised. Caringo uses Elasticsearch to query metadata on the Swarm object store platform, as well as a number of operational system metrics.

Summary

Metadata and the ability to effectively search content is a key capability to delivering a scalable object store. Key questions to ask your vendor include:

How much metadata can I store, per object?
Is there an overall system metadata limit?
What tools are available for querying metadata (including export to search/query tools)?
What scaling/performance impacts are there with large amounts of metadata and content?

Remember, without good metadata features, an object store is next to worthless for storing unstructured content.

Object Storage Essential Capabilities #3 – Searching, Indexing and Metadata

Object Naming

Metadata

Search Capabilities

Summary

Further Reading