Fast and Cost-effective MySQL Analysis Solution Based on Alibaba Cloud DLA
By Shen Hong and Zuoshang
As the most popular open-source database, MySQL is used by many enterprises in various business systems. In addition to online business logic reading and writing, MySQL databases require data analysis, such as BI reports, visual dashboards, and big data applications. However, MySQL databases are struggling to cope with data analysis needs due to the MySQL architecture. The industry offers many solutions to solve this problem. Data Lake Analytics (DLA) is a good choice for low-cost scenarios. This article first introduces the DLA solution for readers who might be unfamiliar with it. The DLA service is a serverless, interactive, and federated query service. You may use standard SQL statements to analyze and integrate data from data sources, such as Object Storage Service (OSS) devices, databases (such as PostgreSQL and MySQL), and NoSQL, such as Table store.
MySQL databases are widely used in various business systems. In addition to online business logic reading and writing, MySQL databases require data analysis, such as BI reports, visual dashboards, and big data applications. When the data volume of a single MySQL database reaches a certain level, using MySQL for data analysis leads to poor performance, which will affect the read and write performance of online businesses. In this case, we need a new data analysis solution.
MySQL data needs to be analyzed with log data. Some companies use open-source big data systems (such as Hive, Hadoop, and Spark) to build data warehouses. This solution works but involves extremely high human and resource costs. We must find a way to analyze MySQL data with data from other systems at a low cost.
When the data volume in a MySQL database exceeds the limit of a single database instance, database administrators (DBAs) split the data in a database into multiple tables in multiple databases. This makes data analysis very complicated. Therefore, a new analysis solution is required.
Solution Evaluation Factors
Which of the following factors must be considered to resolve the problems in the MySQL analysis scenarios? Which solution should we select? The key factors to keep in mind while answering these questions include:
1) Costs: The costs here involve the cost in time, labor, and risk. While selecting a solution, consider its cost-effectiveness.
2) Capabilities: Capabilities include functions and performance. In terms of functions, check whether the solution provides a complete set of analysis capabilities and scalability. In terms of performance, check whether the solution meets the requirements for timeliness and parallelism, especially in the case of massive data.
3) Maintainability: A good product is maintainable. This means you can use it in a simple way, and when a problem occurs, it is easy to troubleshoot.
4) Ease of Use: The product is easy to use. You can use data analysis services easily.
A variety of solutions are available for MySQL data analysis, including direct analysis on read-only MySQL instances, user-created open-source data warehouses, and DLA construction solutions. Let’s take a closer look at the advantages and disadvantages of these solutions.
Analysis Based on Read-only MySQL Instances
Purchase an additional server to build a read-only secondary MySQL instance, and then perform data analysis based on the read-only instance. The advantages and disadvantages of this solution are as follows:
- This solution cannot meet the requirements of scenarios 2 and 3. Even in scenario 1, when the data volume increases, the analysis performance based on read-only instances is very poor (see the following SQL statement time consumption of Transaction Processing Performance Council Benchmark H (TPC-H) 10 GB data).
- The costs of additional read-only instances are high.
- This solution is simple and prevents impacts on online businesses. It has good usability and compatibility.
User-created Open-source Data Warehouse
Use open-source big data systems such as Hive, Hadoop, and Spark to build a data warehouse. Then, synchronize data from MySQL databases to the data warehouse and analyze the data based on Spark or Hive.
- Poor Usability: Open-source big data systems are difficult to learn and must be operated and maintained by dedicated big data engineers. Sqoop-based data synchronization does not support table structure changes, and adding or deleting columns will cause synchronization failures.
- Highest Cost: You need to purchase additional servers to build a system, which increases the hardware costs. The overall cost of this solution is the highest.
- This solution solves the problems of scenarios 1 and 2 with good analysis performance.
Use open-source or commercial AnalyticDB to synchronize data through data synchronization tools, and then perform SQL-based data analysis.
- This solution is less maintainable and requires dedicated O&M personnel.
- Additional resources need to be purchased, leading to high costs.
- This solution supports the analysis of large amounts of data.
DLA Construction Solution
Alibaba Cloud DLA construction solution supports MySQL data analysis at a low cost.
- Easy to Use: You can easily synchronize the entire database to DLA in one click.
- Strong Analysis Capabilities: DLA provides the same user experience as MySQL. The increase in data volume has little impact on analysis performance.
- Extremely Low Cost: You do not need to purchase any servers. You are billed based on the number of query requests, and no maintenance costs are charged.
- No Impact on Source Databases: Data analysis has no impact on online businesses.
Evaluation Data and Technical Principles of the DLA Construction Solution
Let’s take a closer look at the evaluation data and technical principles of the DLA construction solution.
Low Cost and High Performance
As shown in the following figure, the monthly subscription fee for an additional high-performance ApsaraDB RDS for MySQL instance is RMB 2,344. We can use TPC-H with 10 GB of data as an example. If you run 22 SQL statements for TPC-H once a day, the monthly cost of using DLA is only RMB 26.64, an average daily cost of less than RMB 1. You only need to pay 1% of the cost for high-performance analysis. In addition, the columnar storage of DLA only uses 3 GB, while the native MySQL storage may need about 20 GB.
After the DLA construction solution synchronizes data from the source database, the data is stored in the compressed column-storage format. For example, 10 GB of TPC-H data consumes about 20 GB of storage space in MySQL, but only about 3 GB of storage space in the compressed column-storage format. DLA helps achieve efficient analytics at a very low cost. We can use TPC-H with 10 GB of data as an example. It takes 5.5s to run the 22 SQL statements for TPC-H in DLA, but 345.5s in MySQL with four SQL statements failing. The following figure compares the time consumed by 22 SQL statements for TPC-H in MySQL and DLA.
Ease of Use
Compatibility with Multiple Data Sources
The DLA construction solution supports multiple data sources, including user-created MySQL, SQL Server, PostgreSQL, Oracle, ApsaraDB RDS for MySQL, Apsara PolarDB, and AnalyticDB data sources. Being different from traditional data warehouses, DLA is simple to use. Synchronize data to DLA through simple configurations and build data warehouses in one click.
Automatic Synchronization for Data Consistency
The DLA construction solution supports the automatic synchronization of updated data and metadata operations, such as adding and deleting tables and adding, modifying, and deleting columns. In database sharding and table sharding scenarios, use the DLA construction solution to merge the logical tables distributed across multiple databases into one table so that you analyze data in one table. In addition, this solution allows synchronizing an unlimited number of tables.
The DLA team is developing an incremental DLA construction solution to support the incremental synchronization of data from source databases, which eliminates all impact on the source databases and significantly improve the timeliness of data analysis. The incremental construction solution will be available soon.
Impact on Source Databases
DLA-based queries have no impact on the source databases. When the DLA solution is used to synchronize data from source databases, the impact on the source databases is maintained at less than 10%. The following figure shows the CPU utilization for different specifications of source databases. As the specifications of a database increase, the number of connections automatically increases, but the average CPU utilization of the source database is kept below 10%.
We comprehensively optimized the DLA construction solution to minimize the impact of synchronization on the source database:
- This solution automatically adjusts the number of connections based on the specifications of the source databases to guarantee that the impact on the source databases is less than 10%.
- During concurrent synchronization of a table, sharding is performed by the index column to quickly locate a data range based on indexes. This reduces the impact of synchronization on the source databases.
- By default, the DLA solution synchronizes data at off-peak hours to prevent impacting online businesses.
This minimizes the impact on the source databases. You may also manually increase the number of connections to speed up the synchronization.
Complete the following steps for using DLA to build a low-cost MySQL analytics solution. (For more information, visit the official documentation)
1) Log on to the DLA console. In the upper-left corner of the page, select the region where your DLA service is located.
2) In the left-side navigation pane, click Solutions. On the Solutions page, click Wizard in One-click Data Warehouse.
3) Configure the parameters as prompted.
4) Click Create. You can now use DLA.