Use EMR Spark Relational Cache to Synchronize Data Across Clusters

Much like how you can use relational cache to accelerate EMR spark in data analysis, you can also use EMR Spark relational cache to synchronize data across clusters. This article shows you how.

Before we get too deep into things, let’s first discuss a few things. First of all, let’s look into what Relational Cache actually is. Well, it works as an important feature in EMR Spark that accelerates data analysis mainly by pre-organizing and pre-computing data and provides functionality similar to the materialized view in traditional data warehousing. In addition to speeding up data processing, Relational Cache can also be applied in many other scenarios. This article will mainly describe how to use Relational Cache to synchronize data tables.

Managing all data through a unified data lake is a goal for many companies. However, the existence of different data centers, network regions, and departments in reality inevitably leads to different big data clusters, and data synchronization across different clusters is a common need. In addition, synchronization of older and newer data after cluster migration and site migration is also a common problem. Data synchronization is very complicated and a laborious task. Lots of customized development and manual intervention are required for tasks. This includes migration tool development, incremental data processing, synchronous read and write, and subsequent data comparison. With all of these problems and needs considered, Relational Cache can simplify data synchronization and allow you to implement cross-cluster data synchronization at a relatively low cost.

In the following section, a specific example is given to show how to use EMR Spark Relational Cache to implement cross-cluster data synchronization.

Use Relational Cache to Synchronize Data

Assume that we have two clusters (for convenience purposes, let’s call them Cluster A and B) and you need to synchronize the data in the activity_log table from Cluster A to Cluster B. Also, another thing is that, during the migration process, new data is continuously inserted into the activity_log table. Create an activity_log table in Cluster A:

Now, insert two entries as historical information:

Then, create a Relational Cache for activity_log:

In the above code, note that REFRESH ON COMMIT indicates that the cache data will be automatically updated when the source table is updated. Next, LOCATION specifies the storage location of the cache data. We point the cache location to HDFS in Cluster B to synchronize data from Cluster A to Cluster B. The Cache field and Partition information remain the same as the source table.

In Cluster B, we also create an activity_log table:

Following this, run MSCK REPAIR TABLE activity_log to automatically repair the corresponding meta information, and then run the query statement. We can see that the two entries inserted into the table in Cluster A can be queried in Cluster B.

Continue to insert new data in Cluster A:

Run MSCK REPAIR TABLE activity_log in Cluster B, and then query the activity_log table again. The result shows that the newly inserted data has been automatically synchronized to the activity_log table in Cluster B. For a partition table, when new partition table data is inserted, Relational Cache can synchronize new partition table data incrementally instead of synchronizing all the data again.

If the new data in activity_log of Cluster A is added by using Hive or other means to import to Hive tables externally instead of being inserted through Spark, you can run the REFRESH TABLE activity_log_sync statement to trigger data synchronization manually or through scripts. If new data is bulk exported by partitions, you can also use statements like REFRESH TABLE activity_log_sync WITH TABLE activity_log PARTITION (d_year=2018) to synchronize partition table data incrementally.

Relational Cache can ensure the data consistency of the activity_log tables in Cluster A and Cluster B. Downstream tasks or applications that are dependent on the activity_log table in Cluster A can be switched to Cluster B at any time. At the same time, you can suspend applications or services that write data to the activity_log table in Cluster A at any time, point them to the activity_log table in Cluster B and restart them to complete migration of upper-layer applications or services. After the migration is completed, you only need to clear activity_log and activity_log_sync in Cluster A.


In this article, we have described how to use Relational Cache to easily synchronize data in data tables of different big data clusters. In addition to this application scenario, Relational Cache can be applied in many other scenarios, for example, creating OLAP platforms with response in seconds, interactive BI, as well as Dashboard applications and accelerating ETL. In the future, we will share the Relational Cache best practices as it applies to more scenarios.

Original Source

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store