Cloud Data Migration - Database Import Tools

This is one of a series of posts discussing techniques for moving data to and from public cloud. Previous posts have covered physical appliances and the shipping of virtual machines.

Amazon Web Services

Another option for moving data into the public cloud, specifically structured data, is to use a database import service. Amazon Web Services (AWS) provides the Database Migration Service (DMS), a series of tools that can import data to/from on-premises locations (or other providers, as long as there’s a public IP address). DMS runs within a dedicated instance, providing the capability to move data in either homogeneous (like to like) or heterogeneous (like to unlike) modes. For transfers that go from one database to another, the process is pretty simple; create a DMS replication instance and point it at source/target servers. Replication instances don’t show up in EC2, but are simply standard instances (in this case dms2.t2.medium) which you pay for by the hour. There’s a charge for the instance itself, any storage needed for caching data as part of the transfer and obviously network charges.

Trying it Out

I created a DMS replication instance and used it to copy data from this blog (running MariaDB) into an RDS instance running the Aurora engine. Obviously there’s a few things around security that need to be established; the source database needs to have an open firewall connection on port 3306 and credentials assigned for the replication instance task. Similarly the target database needs credentials too, which are established when the RDS instance is created.

The whole process took me around 30 minutes and went smoothly, apart from issues with networking access for the DMS instance, which needs to be connected to an Internet Gateway with external routing defined. DMS replication can be set up to be from any source/target and as the definitions are all DNS name or IP-based, I would expect the service could easily be used for non-AWS transfers, however this is not a supported configuration. At least one of the source/target databases must be in RDS or EC2. Replication can left in place and continually updated, so the solution could be used for DR, albeit with the overhead of paying for the DMS instance, although multiple tasks can be run from one replication instance.

Unlike Database Types

For heterogeneous environments where the source/target databases are different, AWS offers a Schema Conversion Tool (SCT) to ensure any views, stored procedures and functions are handled correctly between versions. The following table shows the currently supported platforms (correct at the time of publication, although if you’re reading this post weeks or months later, check this link for an updated list). SCT is available for a range of platforms (Mac, Windows, Linux) and is downloaded from AWS.

Overall I’d say DMS is pretty easy to use. Replication tasks can be created via API, so a degree of automation can be put in place. Obviously, coming back on-premises (or out of AWS) means building out the reverse replication procedures – there is no bi-directional support. This could be quite tedious to implement as endpoints can only be a source or target and not both, so duplicate definitions are needed. Currently DMS supports Oracle, MSFT SQL Server, MySQL, MariaDB, PostgreSQL, MongoDB and SAP ASE for source data. Specific versions can be found here (link).

Microsoft Azure

Microsoft’s offerings for getting data into Azure aren’t quite as slick. The options are to either manually export data using standard tools or to “stretch” an on-premises database to Azure, effectively archiving cold or warm data to a cheaper level of storage. Looking first at the manual process, a lot of the pre-requisite tasks for migration can be performed with the Data Migration Assistant. This is a software tool that runs close to the source SQL Server database and performs a number of validation checks on the source data compared to the managed Azure platform. The latest release of DMA (3.2, released 25 August 2017) also appears to perform the migration process, although I haven’t yet tested it. Alternatively, the source SQL database can be exported to a BACPAC file, which is then shipped to Azure and imported manually.

Obviously this manual migration process has one big flaw; from the point the export is taken, the source database needs to be down or in read-only mode, to ensure updates aren’t lost. There is no obvious incremental update feature here (other than manually tracking the changes with sqlpackage), which makes the whole process a lot more complicated than AWS.

Google Cloud Platform

GCP supports structured data through Cloud SQL, including MySQL and PostgreSQL. There are also other options available such as Cloud Bigtable and Datastore for NoSQL and Cloud Spanner for large-scale mission critical databases that need to span multiple data centres and geographies. Like Azure, GCP data migration is basically manual, with even fewer options and features to support the migration workflow. The recommended process from Google is to create either a CSV or dump file from the source database and upload this to a Cloud Storage bucket. From here the data can be imported using the Import option on the Cloud SQL Instances page of GCP or through the CLI.

The Architect’s View

If these tools are aimed at seeding environments, such as test/development then networking and storage costs could be quite high, because there’s no inherent de-duplication or other optimisation in place. Twenty 100GB databases will be charged at 2TB capacity, even if 99% of the data is the same. That’s a side effect of using native cloud technologies, where the cloud vendor benefits from most of the storage optimisations.

Getting structured data into the public cloud is still fairly manual, although AWS has the best set of vendor tools to date. They also have the widest support available. Obviously there’s nothing to stop end users from using native replication tools within the data platform itself (including an instance running a database directly), which still means a lot of manual setup that you’d hope the cloud provider tools would avoid. GCP and Azure still have a way to go it seems, however if you have seen other tools or techniques offered by the cloud providers (not 3rd party) then do let me know.

Cloud Data Migration – Database Import Tools