Performance Breakthrough — Reconstruction of IoT System Based on Kafka + OTS + MaxCompute
The entrepreneurial team at SUYUN Technology focuses on supporting the operation of the hydrogen fuel cell eco-chain. Currently, our major business consists of real-time operations monitoring and analysis of new energy vehicles and hydrogen refueling stations, and vehicle safety operations support.
The major challenges that our Internet of Things (IoT) system is facing are the real-time parsing, storage, and analysis of high-frequency data. For example, during the real-time operations monitoring and analysis of vehicles, each vehicle reports its original packets at 1 kbit/s, and the system is required to parse, respond to, and store the packets with a delay of seconds. At the same time, the system needs to quickly query and further analyze the parsed packets of each vehicle at a transmission rate of 33 kbit/s. Considering the growing number of accessed vehicles in the future, we need to design the system in the most economical way while guaranteeing performance. Based on the transmission rate of 33 kbit/s for the parsed packets of each vehicle, each vehicle is expected to generate packet data of 30 GB per month (assuming that each vehicle is running for 10 hours per day).
Problems that exist in the original system are listed as follows:
- The scope of the online analytical processing (OLAP) and online transaction processing (OLTP) systems is not clearly defined in the system architecture. Java programs are used to collect task statistics for Open Table Service (OTS) tables on a regular basis. (Note that OTS is renamed Table Store in Alibaba Cloud.) The code is complex and the performance is extremely poor, affecting the normal operation of other OLTP systems on the server.
- The stored and parsed packet data is not optimized specifically in accordance with OTS pricing rules. A large JSON string contains too many redundant keys, each of which is also excessively long (with 30 character strings on average).
- OTS supports each company in designing its own tables. There is a risk that the number of tables created under an instance may exceed the OTS limit, which is 64 tables per instance.
- OTS divides a table into partitions based on vehicle-month. The size of a partition can be 30 GB, which is too large and exceeds the 1 GB recommended by OTS.
- OTS partitions of a vehicle are continuously distributed but not hashed, and accordingly, concurrent performance is not optimal at the physical machine level.
- The code is not optimized for the most important read scenario where packets can be queried by day and vehicle.
Reorganizing the System Architecture
Before optimizing the system, we need to reorganize the system architecture and clearly define the applicable scope of middleware products to be used in the data solution. The following figure shows the clear definition of data that flows in various phases:
We identified three key components for improvement, which include:
- Introducing Kafka to provide basic support for the asynchronous decoupling of multiple phases and speed up the response to packets from the terminals.
- Bringing in Alibaba Cloud MaxCompute, formerly known as Open Data Processing Service (ODPS), as the basic support for the OLAP system. Transfer complex business analysis to MaxCompute for processing.
- Reconstructing OTS models in accordance with OTS pricing principles (this improvement is not discussed in this document).
As a powerful data analysis tool of Alibaba Cloud, MaxCompute is familiar to us based on our previous cooperation. Therefore, we are giving more consideration to performance, costs, maintainability, and other aspects during this reconstruction.
Optimizing the New Architecture
First, let’s talk about costs. Based on the frequency of data usage, we divide data into three types: online, offline, and archived. Regarded as archived data, packet data reported by vehicles is archived and stored on Object Storage Service (OSS). Online data has a preset life cycle of N months and mainly includes the parsed packet data that needs to be queried in real time. Offline data mainly includes various intermediate results and report data produced after the offline analysis and statistics of the parsed packet data.
After defining data types based on data usage scenarios, we mainly consider whether offline data is stored on OSS, MaxCompute, or OTS. In the following table, we roughly compare the storage and computing costs of these three data solutions.
We have already considered compressing data stored on OTS to reduce the storage costs. MaxCompute charges are calculated based on the actual data size after compressed storage. According to official MaxCompute documentation, the compression ratio is 5:1. Owing to the characteristics of our data, the actual compression ratio can be in the range of 7:1 to 8:1. As a result, the direct storage of offline data on MaxCompute prevails as the most cost-effective data solution. MaxCompute also complies with edge computing principles.
After testing, we found that OTS external tables are not satisfactory data carriers for computing. (Currently, the MapReduce computing of OTS external tables on MaxCompute is based on OTS fragments, but not partitions. The entire table is scanned each time, which can be observed from the task details of MaxCompute.)
After MaxCompute has been determined through technical selection, we need to think about how to use MaxCompute to provide reliable and stable data services for the business. We pay special attention to data warehouse modeling, data integration, and work maintenance.
Data integration is mainly reflected in the two-way synchronization between MySQL and MaxCompute. We need to take into account the design of repeated data synchronization. Work maintenance refers to the monitoring of task running status and support for tasks that are running for a second time.
For data warehouse modeling, we need to consider the reuse of costs and models. First, we apply layered modeling to massive and low-quality underlying data and ensure that the upper-level business model relies only on intermediate results. The direct benefit is the significant reduction in computing costs. (It breaks our hearts each time we discover that some development colleagues are querying data based on an original table of hundreds of GBs.) Second, the intermediate model improves the performance of the system data complement, especially in scenarios where some reports need to be run again based on business needs or data requirements. Otherwise, we have to scan the original data again, which is very costly and time-consuming.
After the reconstruction of offline statistical analysis, the system can make the best of MaxCompute’s parallel computing capability and draw support from its powerful functions, especially window functions, to achieve impressive analysis performance. For example, when we help a customer with statistical analysis of a core component, a professional staff member used to spend a day in analyzing a part and it was easy to make mistakes. With the help of MaxCompute’s analytical capability, we can finish the data analysis of nearly 1,000 components in 10 minutes. The following curve chart shows the average values during each data fluctuation period. These values could hardly be calculated manually before and required very complex Java coding. With the support of MaxCompute, the system can now accomplish this task with ease.