Amazon S3 Object Lambda provides dynamic access to data

AWS recently announced a new capability to S3 storage called S3 Object Lambda. S3 users can now attach Lambda functions to S3 GET requests that modify content before returning it to the calling application. This small feature could be hugely powerful as a more dynamic way to add highly granular controls to data access.

Background

S3 was the original Amazon Web Services service, first released in 2006. We’ve written many things on S3 over the years (see the links at the end of this post) that discuss enhancements and updates to the platform. AWS used to publish details on the volume of data stored in S3, estimated in the trillions of objects in 2012. The service is so popular that the S3 API protocol is de-facto across the industry.

Lambda, the ability to create and run code without managing infrastructure, was first introduced by AWS in 2015. Typical use-cases include responding to events, such as log processing, managing data triggers in S3 and DynamoDB or HTTP requests. It’s possible, for example, to build a web service entirely from Lambda functions. The beauty of the solution is in the flexibility to write custom actions in code and only be charged for usage when that code runs.

Lambda support within S3 has been available for some time and provides the capability to trigger an event when a new object is created, or an existing one is deleted.

Object Triggers

The idea of executing a piece of code in response to objects arriving into an object store isn’t a new one. OpenIO implemented serverless event processing for objects over five years ago. IBM Cloud has similar functionality that monitors for changes to object buckets. Scality implements event management through CloudServer (a Zenko feature).

In the computational storage world, vendors have developed solutions that trigger actions as content is written to physical storage media. We spoke to NGD Systems in 2019 on how their Newport architecture could process incoming data on demand. The podcast is embedded here.

In each of these examples, the processing occurs on objects either added to or deleted from a bucket (or SSD). The object itself isn’t changed; however, the functions may choose to create new objects as part of their processing. For example, as a video hits an object store, a trigger function could transcode that data into multiple formats.

Transformation

S3 Object Lambda operates differently to the examples we’ve described in that the Lambda functions have the ability to modify the requested content before returning it to the calling application. This process could amend the data in the following ways:

Changing the format or structure. At the most elementary level, this could be a function that transforms XML to JSON or returns a compressed image file from an uncompressed original.
Redact content. Internal details of the returned content can be modified, for example, to hide redacted information like PII. This process can depend on the authority of the calling application and validated by a secondary piece of information.
Enhance content. Content could be modified to add watermarks or encrypt the data based on encryption keys already provided by the application.

The Lambda functions don’t have to modify the content but simply add enhanced security checking or additional auditing and logging information, such as registering customer downloads in a secondary database. All of this processing occurs based on information passed in the originating request, plus data sources also available to the Lambda function.

Exits

The implementation of S3 Object Lambda harks back to my mainframe days and IBM’s use of system exits. An exit is a replaceable module that alters the behaviour of some fundamental function within the IBM z/OS operating system. For example, the DASDM pre- and post-allocation exits provide additional validation on the creation and access of files. RACHECK exits enable Systems Programmers to implement custom credentials checking, for example, to hide access to confidential content, even if the user has the OPERATIONS attribute.

Exits are challenging because they are implemented in assembler and loaded dynamically as supervisor-level calls (kernel functions). A badly written exit has the ability to take down the entire system. However, exits can be incredibly useful. I wrote one in the early 1990s for the IBM InfoMan (Information/Management) system that enabled processing to be managed via a script rather than rewriting assembler code. I subsequently sold the code to a UK financial institution.

Why Is Object Lambda So Important?

The features offered by S3 Object Lambda are deceptively important as we look at data mobility and hybrid models of access. There are several aspects to this.

Dynamic – the dynamic nature of making decisions in code allows much greater flexibility than could otherwise be achieved. For example, data can be kept un-encrypted (from the user perspective) on an S3 bucket and encrypted on-demand as it is accessed. If that user is then blocked, only that one specific encryption key has to be dropped. Lambda functions can respond dynamically based on other actions, changes in the environment or the request of the user.
Granularity – Object Lambda offers the ability to modify data at the sub-object level. For larger, more complex objects, this is a powerful way to share data without incurring large amounts of duplication. For example, human genome data could be redacted to obfuscate specific components. Today this is generally (with some exceptions) a manual process that can be impractical with multi-gigabyte files.

The benefit of dynamic redaction is a topic we’ve covered before – see this post looking at the Hedvig acquisition by Commvault. We’ve also talked about the need for data-level APIs in storage with this post back in 2019. Object Lambda is one way to address these requirements.

Gaps

Where are the gaps in the S3 Object Lambda model? First, there’s the obvious lock-in. The APIs offered by Object Lambda are AWS proprietary, and the capability of this feature isn’t widely available across all object storage vendors.

Second, there’s the point at which the processing occurs. The Lambda function is a one-time execution process. If, after the event, the security model changes, for example, the application or end-user already has that data. The content can’t have the rules retrospectively applied. This scenario may be a challenge in environments where data is copied for sharing rather than accessed from an object bucket, so as usual, the usage model affects the choice of solution.

The Architect’s View™

Data usage has always been dynamic and processed in response to the needs of the application and end-users. In an IT world that is increasingly defined by code, setting data policies with the flexibility offered by Object Lambda is a move in the right direction. Today, all of the power to provide dynamic access is centralised within S3. We need to see API standards developed that can expand this ecosystem to file systems and to offload some of the processing to the client so policies can be applied after the data leaves an object store.