Data-Centric Spotlight: PetaGene

Our ongoing series on data-centric architectures recently discussed data mobility and the ability of S3 Object Lambda to address dynamic content access. In this article, we take a look at PetaGene’s PetaSuite Cloud and Protect platforms as another way to securely access content and accelerate remote access.

Background

There are few achievements in human history comparable with the identification and mapping of our own DNA. The Human Genome Project, started in 1990 and completed in 2003, represents the culmination of decades of work, including the 1953 pivotal discovery of the double helix DNA structure.

Human genome data contains around 3 billion base pairs (DNA strands) in 23 chromosomes. Using a simple encoding, where 2 bits store the four nucleobases (Cytosine, Guanine, Adenine and Thymine) represented by their initial letters, a single genome would occupy around 750MB. Of course, we know from typical semi-structured media formats such as video and audio that bit mapping alone isn’t sufficient to document complex information. As a result, the bioinformatics industry has developed forms such as FASTA, FASTQ and SAM/BAM. These de-facto standards provide additional metadata and quality information to supplement the raw encoding.

Each human has an individual genome map, which could be gigabytes in size. This makes manipulating these file objects a challenge, both in terms of storage capacity (achieving efficient compression) and performance (achieving quick compression/decompression) when reviewing large sets of data.

One other aspect to bear in mind as we dig into the PetaGene technology is the requirement to provide secure access when sharing genomic data. This can be specific to individual sequences rather than the entire genomic file. As with any valuable asset, the access granted to content may change over time.

PetaGene Platforms

PetaGene has three main solutions.

PetaSuite – efficient compression and distribution of genomic data.
PetaLink Cloud Edition – virtual file access to cloud object stores with data efficiency.
PetaSuite Protect – encrypted data access management system.

The compression of data can be significantly improved through application awareness. PetaSuite delivers significantly better compression than other solutions like gzip. In addition, the compression modules are pluggable and can be adapted for a wide range of industry use-cases.

PetaLink Cloud Edition extends a local file system to provide remote access to object storage buckets. This capability includes compression and efficient data transfers.

PetaSuite Protect provides a centralised data access management solution, where individual pieces of content and sub-components of the content are assigned access rights. Permissions can be customised per individual on a project basis and revoked at any time.

Implementation Specifics

The key technology underlying the PetaGene solution is the use of LD_PRELOAD in Linux. The LD_PRELOAD variable allows system functions to be overridden with customised libraries that modify standard system functions. We’ve discussed this technique most recently in conjunction with MemVerge, where LD_PRELOAD is used to intercept memory management functions.

Without the preloaded PetaGene libraries, any PetaSuite managed data appears as an encrypted file. With PetaSuite in place (and the correct access token), the data appears unencrypted and is freely usable.

There are multiple benefits to this implementation model:

Data is always encrypted at rest and only decrypted as it is accessed. Of course, a user could copy the unencrypted copy elsewhere, but that is mitigated by the second benefit.
Any data access function is auditable. As each file access is processed through the PetaSuite software, a detailed audit trail is collected and centrally stored.
Entire files or file fragments can be selectively opened up for access. This feature is highly desirable for data types like genomic data, where sharing may be limited to specific sequences.
Access can be revoked at any time. Credentials are constantly validated, and data is never left unencrypted on disk.

Data-Centric Principles

Looking back at our data-centric principles, we can see that the PetaSuite and PetaLink solutions provide a constant metadata view but don’t directly offer a global file system (and aren’t designed as such). The solutions do deliver a standardised and consistent security model. Application-awareness is built-in and customisable for data mobility and efficiency (compression/decompression). Abstraction from physical media is provided through the modification of file system function calls and in the protocol translation from object to file.

The Architect’s View

The big question for PetaGene is how widespread and applicable the technology can become. The use of LD_PRELOAD means the solutions are currently limited to Linux systems, so a Windows implementation would be a good move. The ability to protect data to the extent that encrypted copies can be freely circulated is a powerful option that would be applicable across multiple industries.

I’m most interested in the application awareness capabilities, both from a data efficiency standpoint and for data mobility over distance. As we grapple with the idea of sharing petabytes of data over a wide area, a golden repository of data starts to become a reality. At present, all the pieces are available to build a logical repository. We just have to do the assembly work ourselves. As vendors co-operate to integrate their solutions, this task should become easier over time.

Disclaimer: Chris M Evans is an advisor to PetaGene Ltd.

Background

PetaGene Platforms

Implementation Specifics

Data-Centric Principles

The Architect’s View

Related Posts