Most people who use cloud services use more than one product.These products can be from different cloud providers or may vary in terms of the underlying technology.
Therefore, it has become a basic necessity for users to transfer data among multiple cloud products.
• 1. A user migrates data from the Oracle database in an offline data center to RDS PPAS on the cloud.
• 2. A user uses RDS MySQL as the database for business transactions and HybridDB for PostgreSQL for their data warehouse.
• 3. A user imports massive data from ODPS into HybridDB for PostgreSQL for real-time analysis.
Data migration to the cloud is an inevitable necessity for the above situations. This article will present possible solutions for the above scenarios that I hope will prove helpful as you use cloud products.
Before we even begin migrating data, we need to gain a certain understanding of the related hardware so that we can find the most appropriate solutions.
1. Same zone
If your data is on the cloud and in the same zone as the destination system, congratulations! You’re in the ideal situation. Do your best to only migrate data within the same zone.
Current cloud products are commonly configured with Ethernet connections of at least 1 gigabit, and usually around 10 gigiabits. Intra-zone data migration has low latency, high bandwidth, and passes through the minimum required number of VSwitches, making throughput speeds ideal.
Therefore, we should do our best to make sure that the backend database, front-end ECS, and the OSS (which stores massive amounts of data) are in the same zone.
2. Cross-zone deployment and deployment in zones between cities
A portion of customers with high availability demands tend to choose multi-zone deployment within a single city, or even deployment in zones across cities. Furthermore, Alibaba Cloud has many data products to support native multi-zone deployment solutions.
Alibaba Cloud zones in the same city or between cities are connected through a leased line network. While transferring data under these conditions may not be as efficient as transferring within the same zone, network quality is still quite high.
Network communication efficiency is best in the following conditions (listed in descending order):
The same zone > Multi-zones in the same city > Multi-zones between cities
(Within East China I Zone B) > (Between East China I Zone B and East China I Zone C) > (Between East China I Zone B and North China I Zone B)
3. Internet and VPN network
This is the least efficient and unfortunately most common scenario for data migration to the cloud, as described in the background reading. The reason is that the data is transferred over a physical channel that is public and cannot be controlled. Latency is usually high and quality can fluctuate.
We can use software to make up for these deficiencies. We usually recommend that users select software and services with the following features.
•Has a support mechanism, supports resumable data transfers, and prevents large tasks from failing due to a single failure.
•Allows for high concurrency to increase throughput.
•Uses incremental data migration to reduce service downtime.
Now let’s talk about the data format in data exchanges.
There are usually two methods to transfer data between different data products:
1.Data migration on the cloud
Connect the software or service simultaneously to the source and target databases, pull data from the source database, convert it to the format recognized by the target database, and then immediately write it to the target database.
This method does not require an intermediate data storage space, but does require high network quality. If the data is too big, e.g. at the petabyte level, migration could take a very long time.
Alibaba Cloud open-source product rds_dbsync, CDP (dataX), and Data Transmission Service (DTS) all fall into this category.
2. Data migration using a common file format
If your data size is large, e.g. a data warehouse with dozens of terabytes of data, it is recommended to use offline migration to transfer your data.
Offline migration refers to the process of exporting all data from the source database into a common data format, and then importing it to the target database.
Offline data migration offers the following advantages over migration on the cloud:
•Data exported offline is usually compressed by a ratio of between 1:2 and 1:5, which can significantly save network expenses and improve overall efficiency.
•Offline migration can easily be done through parallel processing, which is the most effective way to increase efficiency.
For file-based data migration, data format is key. Data formats serve to clearly express the method by which data is organized in a file.
The following are the most commonly used file formats: XT/CSV, TSV, ORC, Parquet, protobuf, etc.
Some of these data formats, specifically ORC and Parquet, are data compression formats. For uncompressed data, e.g. CSV, we can select our preferred data compression format, e.g. gzip, bzip2, snappy, etc.
3. Data transfer through TEXT/CSV files
•For structured data, the ideal data format is CSV. CSV is a commonly used, simple and transparent data format. For more information, refer to: Common Format and MIME Types for Comma-Separated Values (CSV) Files
•PostgreSQL CSV parameters are specified in reference document 2. Applicable to the community and Alibaba Cloud’s PostgreSQL, Greenplum, and HybridDB for PostgreSQL.
CSV is a relatively simple text format, the advantage being that it has clearly defined semantics, which can easily be applied to even complex scenarios.
•The line delimiter of CSV files is ’n’, or the line break
•DELIMITER is the delimiter that defines columns
o When user data contains DELIMITER, QUOTE need to be used as well.
o The recommended column delimiters are ‘,’, ‘t’ , ‘|’ or other uncommon characters.
•The QUOTE character is used to enclose user data that contains special characters by column
o Strings that contain special characters will be enclosed by QUOTE to differentiate user data and the control characters.
o If not necessary, e.g. with integers, the data will not be enclosed by QUOTE characters (to optimize efficiency).
o The QUOTE cannot be the same as the DELIMITER, and double quotation marks are the default QUOTE.
o If the user data contains QUOTE characters, they need to be differentiated with an escape character.
•ESCAPE special characters
o Place an escape character before a special character that needs to be escaped to indicate that it is not to be counted as a special character.
o The default ESCAPE character is the same as the QUOTE character, that is, double quotation marks.
o You can use’’(default escape character for MySQL) or another character..
The following are the default control characters for TEXT and CSV files.
All control characters must be single-byte characters
4. Data transmission through Object Storage Service (OSS)
Similar to S3 of AWS, Object Storage Service (OSS) is a low price storage service that opens the communication channels between all cloud products. It is recommended for use in transferring massive data.
OSS supports transferring data between zones (cross-zone data replication), allowing users to efficiently transfer data from one zone to another.
So far, both PostgreSQL and HybridDB for PostgreSQL on the cloud support reading and writing the OSS data source.
We hope that the simple tips and tricks for data transfers on the cloud contained in this article make your experience on the cloud more productive and enjoyable! Our products are always evolving, we appreciate you feedback and questions as they help drive constant improvement.
1.Common Format and MIME Type for Comma-Separated Values (CSV) Files
3.PostgreSQL + OSS oss_fdw
4.HybridDB for PostgreSQL COPY
5.HybridDB for PostgreSQL + OSS oss_ext
6.Alibaba Cloud open-source software rds_dbsync
7.SLS supports delivery of CSV data to OSS