Performance Breakthrough — Reconstruction of IoT System Based on Kafka + OTS + MaxCompute

The entrepreneurial team at SUYUN Technology focuses on supporting the operation of the hydrogen fuel cell eco-chain. Currently, our major business consists of real-time operations monitoring and analysis of new energy vehicles and hydrogen refueling stations, and vehicle safety operations support.

The major challenges that our Internet of Things (IoT) system is facing are the real-time parsing, storage, and analysis of high-frequency data. For example, during the real-time operations monitoring and analysis of vehicles, each vehicle reports its original packets at 1 kbit/s, and the system is required to parse, respond to, and store the packets with a delay of seconds. At the same time, the system needs to quickly query and further analyze the parsed packets of each vehicle at a transmission rate of 33 kbit/s. Considering the growing number of accessed vehicles in the future, we need to design the system in the most economical way while guaranteeing performance. Based on the transmission rate of 33 kbit/s for the parsed packets of each vehicle, each vehicle is expected to generate packet data of 30 GB per month (assuming that each vehicle is running for 10 hours per day).

Problems that exist in the original system are listed as follows:

  1. The scope of the online analytical processing (OLAP) and online transaction processing (OLTP) systems is not clearly defined in the system architecture. Java programs are used to collect task statistics for Open Table Service (OTS) tables on a regular basis. (Note that OTS is renamed Table Store in Alibaba Cloud.) The code is complex and the performance is extremely poor, affecting the normal operation of other OLTP systems on the server.
  2. The stored and parsed packet data is not optimized specifically in accordance with OTS pricing rules. A large JSON string contains too many redundant keys, each of which is also excessively long (with 30 character strings on average).
  3. OTS supports each company in designing its own tables. There is a risk that the number of tables created under an instance may exceed the OTS limit, which is 64 tables per instance.
  4. OTS divides a table into partitions based on vehicle-month. The size of a partition can be 30 GB, which is too large and exceeds the 1 GB recommended by OTS.
  5. OTS partitions of a vehicle are continuously distributed but not hashed, and accordingly, concurrent performance is not optimal at the physical machine level.
  6. The code is not optimized for the most important read scenario where packets can be queried by day and vehicle.

Reorganizing the System Architecture

We identified three key components for improvement, which include:

  1. Introducing Kafka to provide basic support for the asynchronous decoupling of multiple phases and speed up the response to packets from the terminals.
  2. Bringing in Alibaba Cloud MaxCompute, formerly known as Open Data Processing Service (ODPS), as the basic support for the OLAP system. Transfer complex business analysis to MaxCompute for processing.
  3. Reconstructing OTS models in accordance with OTS pricing principles (this improvement is not discussed in this document).

As a powerful data analysis tool of Alibaba Cloud, MaxCompute is familiar to us based on our previous cooperation. Therefore, we are giving more consideration to performance, costs, maintainability, and other aspects during this reconstruction.

Optimizing the New Architecture

After defining data types based on data usage scenarios, we mainly consider whether offline data is stored on OSS, MaxCompute, or OTS. In the following table, we roughly compare the storage and computing costs of these three data solutions.

We have already considered compressing data stored on OTS to reduce the storage costs. MaxCompute charges are calculated based on the actual data size after compressed storage. According to official MaxCompute documentation, the compression ratio is 5:1. Owing to the characteristics of our data, the actual compression ratio can be in the range of 7:1 to 8:1. As a result, the direct storage of offline data on MaxCompute prevails as the most cost-effective data solution. MaxCompute also complies with edge computing principles.

After testing, we found that OTS external tables are not satisfactory data carriers for computing. (Currently, the MapReduce computing of OTS external tables on MaxCompute is based on OTS fragments, but not partitions. The entire table is scanned each time, which can be observed from the task details of MaxCompute.)

After MaxCompute has been determined through technical selection, we need to think about how to use MaxCompute to provide reliable and stable data services for the business. We pay special attention to data warehouse modeling, data integration, and work maintenance.

Data integration is mainly reflected in the two-way synchronization between MySQL and MaxCompute. We need to take into account the design of repeated data synchronization. Work maintenance refers to the monitoring of task running status and support for tasks that are running for a second time.

For data warehouse modeling, we need to consider the reuse of costs and models. First, we apply layered modeling to massive and low-quality underlying data and ensure that the upper-level business model relies only on intermediate results. The direct benefit is the significant reduction in computing costs. (It breaks our hearts each time we discover that some development colleagues are querying data based on an original table of hundreds of GBs.) Second, the intermediate model improves the performance of the system data complement, especially in scenarios where some reports need to be run again based on business needs or data requirements. Otherwise, we have to scan the original data again, which is very costly and time-consuming.



Follow me to keep abreast with the latest technology news, industry insights, and developer trends.