How to Design a Storage Layer for Structured Data Storage Requirements
By Zhou Zhaofeng, nicknamed Muluo at Alibaba.
In today’s world, data processing is indispensable to any application system. Data is the core that drives business innovation and intelligent development, and holding true to this, data processing technology is at the core of several innovations and is the key to sustaining a company’s competitiveness. A complete technical architecture consists of an application system and a data system. The application system processes business logic, whereas the data system processes the remaining data. This article was written to inspire research and development engineers and data system architects in the field of big data and database systems.
It’s clear that the technologies behind big data are continuously changing and evolving. And, with the development of related industries in recent years, including several business innovations, and the explosive growth of data along with the wide application of open-source technologies in big data and data analytics scenarios in past decades, the core components and technical architecture behind big data have come a long way. Among other things, the development of cloud computing has helped to decrease several of the obstacles involved with using big data technology. And, we expect that, in the near future, data will continue to drive new levels of business innovation. With big data technology being at the core of this, the technology surrounding big data will gradually become a lightweight and intelligent technological system that will be applied virtually everywhere. And, as such, it will become an increasingly necessary skill for research and development engineers from a variety of industries is understanding the world of cloud computing, where big data will live and continue to evolve now and in the future.
As big data as evolved, application systems and data systems have also become gradually more integrated with each other. Nowadays, data systems may run through the business interaction logic but are they are not hidden behind the application systems. In traditional application systems, there is a focus on interaction, whereas in modern application systems, familiarity with users comes during interaction itself. So, one thing we can learn from this transition is that the development of data systems drives the development of business systems, in a transition from business-oriented systems to large-scale, intelligent systems.
- Business-oriented: The most basic business interaction logic is complete.
- Large-scale: The use of distributed and big data technologies meets business scale growth and data accumulation requirements.
- Intelligent: The use of artificial intelligence (AI) technology digs data value and drives business innovations.
Today there still exists a need to solve certain technical difficulties for the business system to become large-scale and intelligent. The application of mature open-source technologies makes the construction of big data systems simple, and big data architectures become common. For example, the well-known Lambda architecture decreases technical difficulties to a certain extent. However, for subsequent maintenance of data systems, such as large-scale application, operations and maintenance control, and cost reduction of big data components, you need to master big data technologies, distributed technologies, and fault locating in complex environments, which can still be very difficult.
The core components of a data system include data pipeline, distributed storage, and distributed computing. Businesses generally use combinations of these components to build a data system architecture. Each component performs its specific duties, and upstream and downstream components exchange data with each other. For architects, it is highly challenging to choose and combine components.
Building off this general discussion, this article will discuss the data system used at Alibaba Group to provide inspiration to industry research and development engineers and architects. In this article, we will first introduce the several open-source components and the Alibaba Cloud products and services involved in each core component in our data system. Then, this article will analyzes the storage technology of structured data in the data system and introduce the design principle used by Alibaba Cloud Tablestore to meet the requirements for structured data storage in the data system.
Data System Architecture
The above figure shows a typical technical architecture, including an application system and a data system. This architecture is not relevant to any specific business. Rather, it reflects the core components of a data application system and data stream relationships among components.
The application system implements the main business logic of applications and processes business data or application metadata. The data system summarizes and processes business data and other data and integrates it with business intelligence (BI), recommendation, or risk control systems.
The system architecture contains the following common core components:
- A Relational Database: This stores data for main business operations and processes transaction data, and is the core data storage of the application system.
- High-Speed Cache: This caches results of operations that are complex or costs a lot for redo to accelerate access.
- The Search Engine: This provides complex criteria-based querying and full-text retrieval.
- The Queue: This synchronizes data processing and integrates upstream and downstream components for real-time data exchange. It uses several core components for upstream and downstream interconnections between heterogeneous data storage systems, for example, data interconnection between database systems and cache systems or search systems. Also, it uses queues for real-time data extraction and real-time archiving from online storage to offline storage.
- Unstructured Big Data Storage: It stores a large amount of unstructured data, such as images or videos, and supports online query or offline computing data access.
- Structured Big Data Storage: This refers to an online database, which is more for online-offline connections and features high-throughput data writing and large-scale data storage. It helps to expand the storage and query performance linearly. The structured big data storage stores non-relational data for online queries, or archive historical data for relational databases to meet large-scale and linear expansion requirements. It also stores data written in real-time for offline analysis.
- Batch Computing: This analyzes unstructured data and structured data. Batch computing categorizes into interactive analysis and offline computing. Offline computing performs complex analysis on large-scale datasets, whereas interactive analysis performs real-time analysis on medium-scale datasets.
- Stream Computing: This performs stream analysis on unstructured data and structured data to generate real-time views with low latency.
Further analysis of data storage components shows that various data storage components are designed to meet the data storage requirements of different scenarios, and provide different data models and different optimization preferences for online and offline. The following table lists comparisons in detail:
Derived Data System
The data system architecture contains multiple storage components. Some of the data in these storage components are directly written from applications, and some are replicated from other storage components. For example, data in the business relational database is usually from businesses, whereas data in the high-speed cache and search engine is usually synchronized and replicated from business databases.
Storage components for different purposes have upstream and downstream data links of different types. You can classify these components into primary storage and secondary storage. The two storage types are designed for different objectives:
- Primary Storage: This stores data generated by businesses or computing, which is usually the first storage data. Transaction features, such as ACID, may be strong demands, querying low-latency business data required by online applications.
- Secondary Storage: This stores data synchronized and replicated from the primary storage. As a view of the primary storage, secondary storage generally optimizes data query, retrieval, and analysis.
At this point, it’s critical to address a few significant questions: Why are primary and secondary storage available? And, is it possible to store, read, and write data in a unified manner to meet the demands of all scenarios?
Currently, data cannot be stored, read, and written in a unified manner. The storage engine supports multiple technologies, including row store or columnar store, B+Tree or log-structured merge-tree (LSM-tree), storage of immutable data, frequently updated data, or time-based partition data, and designing for high-speed random query or high-throughput scanning. Moreover, database products are divided into the catagories of TP and AP. Although in the direction of Hybrid Transaction/Analytical Processing (HTAP), the underlying storage is still divided into row store and columnar store.
An example of the primary storage and secondary storage dichotomy in the actual architecture is the relationship between the primary table and a secondary index table in a relational database, which may be regarded as a primary-secondary relationship.
Data in the index table varies from the primary table, which features strong consistency and is optimized for query by combinations of specific conditions. The relationship between the relational database and the high-speed cache and search engine is also a primary-secondary relationship, providing high-speed querying and retrieval with eventual data consistency.
The relationship between an online database and the data warehouse is also a primary-secondary relationship. Data in the online database is replicated to the data warehouse in a centralized manner for efficient BI analysis. The architecture design of primary-secondary storage components that complement each other is called a derived data system. In the system, the greatest technical challenge is to synchronize and replicate data between the primary and secondary components.
The common data replication modes shown in the preceding figure are described as follows:
- Multi-writing at the Application Layer: This is the easiest implementation method with the least dependency. In this mode, data is first written to the primary storage and then to the secondary storage in the application code. This mode is not very reliable, however. So, it is generally applied in scenarios where the data reliability requirement is not high. This mode has many problems. First, it doesn’t ensure consistency between data in the primary and secondary storage and it doesn’t deal with data writing failures. Second, the consumption involved with data writing is accumulated at the application layer, therefore increasing the code complexity and computing workload at the application layer, ultimately making it is a poor-decoupling architecture. And, third, the scalability of this mode is poor and the data synchronization logic is fixed in code, making it difficult to add secondary storage flexibly.
- Asynchronous Queue Replication: This is a widely used architecture. The application layer asynchronizes and decouples the writing of derived data in queues. And, in this architecture, data is written to both primary and secondary storage, and only the secondary storage is asynchronized. The first mode must allow asynchronous data writing to the primary storage. Otherwise, only the second mode will be used. If you use the second mode, you will also encounter problems similar to those in the multi-writing at the application layer model. The application layer supports multi-writing only to the primary storage and queues. Queues address multi-writing and scalability problems of the secondary storage.
- Change Data Capture (CDC) Technology: This writes data to the primary storage, which then synchronizes the data to the secondary storage. This mode is the most user-friendly for the application layer, and you only need to deal with the primary storage. It uses the asynchronous queue replication technology to synchronize data from the primary storage to the secondary storage. However, the primary storage needs to support the CDC technology in this mode. A typical example is the combination architecture of MySQL and Elasticsearch. Elasticsearch data is synchronized in binlog files of MySQL, and binlog indicates the CDC technology of MySQL.
The data derivation system is an important technical architecture design principle. The CDC technology is the key means to better drive the data stream. The storage components with CDC technology effectively support the data derivation system, making the entire data system architecture more flexible and reducing the complexity of data consistency design for a rapid iteration design.
However, most storage components do not support CDC technology, such as HBase. Alibaba Cloud Tablestore supports this sophisticated CDC technology, and the application of CDC technology promoted innovations made in terms of our architecture, which will be described in detail in the following sections.
A good product uses a data derivation architecture to continuously expand its capabilities, making the derivation process transparent and solving data synchronization, consistency, and resource ratio problems. In reality, most of the technical architectures use the derived architectures of product combinations and need to manage data synchronization and replication, such as the combinations of MySQL with Elasticsearch and Hbase with Solr.
The biggest problems with these combinations are:
- How can we ensure data consistency
- How can we track data synchronization latency
- How can we ensure the same data writing capability of the secondary storage as the primary storage after the CDC technology replicates data in real-time.
Selection of Storage Components
When an architect designs a specific architecture, one of the biggest challenges he or she will have to face is the selection and combination of computing and storage components. Computing engines of the same type have a slight difference. Generally, mature and ecological computing engines are preferred, such as the batch computing engine, Spark and the stream computing engine, or Flink.
And, the selection of storage components is also challenging. Storage components include databases (SQL and NoSQL databases, with NoSQL databases divided into multiple types according to various data models), object storage, file storage, and high-speed cache. The main reason for the complexity of storage options on the market is that architects must comprehensively consider various factors such as data layering, cost reduction, and online and offline query optimization preferences.
In addition, the current technology is developing in a diversified way. No storage product meets the requirements of data writing, storage, query, and analysis in all scenarios. Consider the following observations, for example.
- Data models and query languages are the most significant differences between databases. Relational models and document models are relatively abstract models, whereas non-relational models, such as time series, graph, and key-value models, are relatively concrete abstractions. For example, you need to match a concrete graph model in a scenario to narrow down the selection scope.
- You need to divide storage components into different data layers, with optimization preferences for the scale, cost, query, and analysis performance. During selection, the core metrics required for the data storage must be clear.
- The data replication relationship must be clearly organized to distinguish between primary and secondary storage. The primary and secondary storage will be introduced in the next section.
- You need to build a flexible data exchange channel to achieve rapid data migration and switch between storage components. Building a fast iteration ability is more important than improving the scalability for unknown requirements.
The final trend of data storage architecture is as follows:
- Data must be layered.
- Data must be stored in the Object Storage Service (OSS).
- A unified analysis engine unifies the analysis portals and provides a unified query language.
Structured Big Data Storage
The structured big data storage, a critical component in the data system, assumes a major role in connecting online and offline data. As a structured data summary storage in the data mid-end, it summarizes data in online databases for offline data analysis and stores offline data analysis result sets to support online query or data derivation. Based on the positioning, the key requirements for structured big data storage are summarized as follows.
- Large-scale Data Storage: The structured big data storage is positioned as centralized storage. As a summary of online databases (in large wide table mode) or the input and output of offline computing, structured big data storage must support petabytes of data storage.
- High-throughput Writing Capability: Data is converted from online storage to offline storage with the ETL tool in T+1 synchronization or real-time synchronization mode. The structured big data storage needs to support the import of data from multiple online databases as well as the export of a large number of result sets from the big data computing engine. Therefore, structured big data storage must support high-throughput data writing. Generally, a storage engine for writing optimization is used.
- Rich Data Query Capabilities: The structured big data storage, as secondary storage for the data derivation system, needs to optimize efficient online queries. Common query optimizations include high-speed caches, high-concurrency and low-latency random queries, complex field-combined queries, and data retrieval. The technical means of the query optimizations are caching and indexing, where indexing support is diversified and different types of indexes are provided for different query scenarios, for example, B+Tree-based secondary indexes for queries by fixed combinations, R-Tree or BKD-Tree-based spatial indexes for location queries, or inverted indexes for multi-field queries and full-text retrieval.
- Separation of Storage and Computing Costs: Storage and computing separation is currently a popular architecture implementation. It is difficult for general applications to obtain the advantages of this architecture. In cloud-based big data systems, storage and computing separation give full play to its advantages. The biggest advantage of storage and computing separation in the distributed architecture is the flexible storage and computing resource management method, which greatly improves the storage and computing scalability. For cost management, the storage and computing costs are separated only for the products that are implemented based on the storage and computing-separated architecture. The advantage of separating storage and computing costs is more obvious in the big data system. For example, the storage amount in the structured big data storage will increase with data accumulation, but the data writing amount is relatively stable. Therefore, there is a requirement to constantly expand the storage needs. However, the computing resources required to support data writing or temporary data analysis are relatively fixed and on-demand.
- Data Derivation Capability: Multiple storage components must co-exist in a complete data system architecture. According to different requirements for query and analysis capabilities, the secondary storage needs to be dynamically expanded in the data derivation system. Therefore, for the structured big data storage, the derivation capability that expands the secondary storage is also required to expand the data processing capability. Whether a storage component has better data, the derivation capability depends on the mature CDC technology.
- Computing Ecosystem: Data value needs to be dug by computing. Currently, computing is divided into batch computing and stream computing. The requirements for structured big data storage are as follows:
- It must be able to connect to mainstream computing engines, such as Apache Spark and Flink, as input or output.
- It must have the data derivation capability to convert its data into analysis-oriented data of the columnar-store format and store the data in the data lake system.
- It must provide interactive analysis capabilities to discover data value more quickly.
The first requirement is the most basic, and the second and third ones are the bonus items.
Currently, HBase and Cassandra are well-known structured big data storage products in the open-source field. Cassandra is the top product of the wide column model under the NoSQL category and is widely used outside China. However, let’s focus on HBase as it is more popular than Cassandra in China.
HBase is a wide column model database based on the HDFS storage and computing separation architecture. HBase has excellent scalability and supports large-scale data storage, with the following advantages:
- Storage and Computing Separation Architecture: It is based on HDFS at the underlying layer. The separated architecture supports elastic scaling of storage and computing and shares computing resources with computing engines such as Spark to reduce costs.
- LSM Storage Engine: It is designed for writing optimization and provides high-throughput data writing.
- Mature Developer Ecosystem and Access to Mainstream Computing Engines: HBase is an open-source product that has been developed for many years. The developer community is mature and connects to several mainstream computing engines.
However, HBase has several major defects that we cannot ignore:
- Weak Query Capability: HBase provides efficient single-row random queries and range-based scans. Use scan+filter for query by complex condition combinations. Otherwise, the full table is scanned, which is extremely inefficient. Phoenix of HBase provides a secondary index to optimize queries. However, like a secondary index of MySQL, the secondary index of HBase is used to optimize queries only when the query criteria meet the leftmost matching principle. The number of query conditions that must be optimized is limited.
- Weak Data Derivation Capability: As mentioned above, the CDC technology is the core technology that supports the data derivation system. HBase does not implement CDC technology. HBase Replication has the CDC capability. However, it is only a mechanism for data synchronization between the primary storage and the secondary storage within HBase. Some open-source components, such as Lily Indexer for synchronization with Solr, use their built-in replication capabilities to try to expand the CDC technology of HBase. However, it is a pity that these components do not meet core requirements, such as data sequence preservation and eventual consistency guarantee, required by the CDC technology based on theoretical and institutional analysis.
- High Costs: As mentioned above, one of the key requirements for structured big data storage is the separation of storage and computing costs. The cost of HBase depends on the CPU core cost and disk storage cost for computing. In a deployment mode based on a fixed ratio of physical resources, the minimum ratio of CPU resources to storage resources cannot be reduced. That is, the CPU core cost increases accordingly with the storage space but is not computed based on the actual computing resources. Only the cloud-based serverless service mode achieves complete separation of storage and computing costs.
- Complex O&M: HBase is a standard Hadoop component, with the core dependencies, Zookeeper and HDFS. HBase O&M fails without a professional O&M team.
- Poor Capability of Processing Hotspot Issues: An HBase table is partitioned in range partitioning mode. Compared to the hash partitioning mode, the range partitioning mode has the biggest defect of serious hotspot issues. HBase provides a large number of best practices to guide developers to avoid hotspots while designing row keys for tables, such as using hash keys or salted-tables. However, these two modes ensure even data distribution but do not ensure the even popularity of data access. Access popularity depends on businesses. An automatic mechanism that splits or move a region based on the popularity is required.
Most senior players in China perform secondary development based on HBase. They work on various solutions to make up for the weak query capability of HBase, and develop their own index solutions, for example, self-developed secondary index solutions, full-text retrieval solutions with connection to Solr, and bitmap index solutions for datasets with small discrimination, based on their business query features. In general, HBase is an excellent open-source product, with many excellent design ideas worth learning.
Alibaba Cloud Tablestore
Tablestore is a structured big data storage product developed by Alibaba Cloud. For more information, visit the official website and read the corresponding guide. The design principle of Tablestore takes into account the requirements for structured big data storage in the data system, and designs and implements some distinctive features based on the design principle of the data derivation system.
The design principle of the Tablestore absorbs the design ideas of excellent open-source products and has developed some special features based on actual business needs. The technical principle of Tablestore is summarized as follows:
- Storage and Computing-separated Architecture: The storage and computing-separated architecture is used, with the underlying layer based on Apsara Distributed File System, which is the basis for separating storage and computing costs.
- LSM Storage Engine: LSM and B+Tree are two mainstream storage engines. LSM specially optimizes high-throughput data writing and effectively support hot and cold data separation.
- Serverless Product Form: The most critical factor for cost separation based on the storage and computing-separated architecture is serverless services. Only serverless services achieve the separation of storage computing costs. In a big data system, the structured big data storage usually requires regular large-scale data imports from online databases or offline computing engines. The structured big data storage requires a sufficient computing capability to achieve high-throughput writing in this case, whereas only a small computing capability is required in normal cases. Therefore, computing resources must be elastic enough. In addition, in the data derivation system, the primary storage and the secondary storage are usually heterogeneous engines, and their read and write capabilities are different. In some scenarios, you need to flexibly adjust the ratio of the primary storage to the secondary storage. In this case, storage and computing resources must be elastically adjustable.
- Index-based Query Capabilities: The LSM engine has shortcomings in query capabilities and requires indexes for query optimization. Different query scenarios require different types of indexes. Therefore, Tablestore provides diverse indexes to meet data query requirements in different types of scenarios.
- CDC Technology: The CDC technology of Tablestore is called Tunnel Service, which supports the real-time subscription of full and incremental data, and seamlessly integrates with the Flink stream computing engine for real-time stream computing of table data.
- Open-source Computing Ecosystem: Tablestore connects not only to computing engines developed by Alibaba Cloud, such as MaxCompute and Data Lake Analytics (DLA) but also to the Flink and Spark mainstream computing engines, without the need for data migration.
- Stream-batch Computing Integration: Tablestore connects to Spark. It allows Spark to perform batch computing on full data in a table and uses the CDC technology to interconnect with Flink for stream computing on new data in the table, implementing the integration of batch and stream computing.
Table Store Features
1. Diversified Indexes
Alibaba Cloud Tablestore provides a variety of index types, including the global secondary index and search index. Similar to secondary indexes of traditional relational databases, global secondary indexes optimize condition-based queries that meet the leftmost matching principle and provide low-cost storage and efficient random queries and range-based scans. The search index feature provides more query capabilities, including query by combinations of conditions in any columns, full-text retrieval, and spatial query. The search index feature also supports lightweight data analysis and provides basic statistical aggregate functions.
2. Tunnel Service
Tunnel Service is the CDC technology of Tablestore and is the core function supporting the data derivation system. Use Tunnel Service for data synchronization, event-driven programming, real-time subscription of incremental data in tables, and stream computing between heterogeneous storage components. Currently, you can seamlessly connect Tablestore to Blink in the cloud, and it is the only structured big data storage that is directly used as a stream source of Blink.
Big Data Processing Architecture
The big data processing architecture, a part of the data system architecture, has been developed over many years and contains some basic design ideas for core architectures, such as the most far-reaching Lambda architecture. Lambda architecture is a relatively basic architecture with some defects, based on which new architectures such as Kappa and Kappa+ are gradually introduced to solve some problems in Lambda architecture. Tablestore combines with the computing engine based on the CDC technology and designs a new Lambda Plus architecture based on Lambda architecture. The following figure shows the Lambda Plus architecture.
The core ideas of Lambda architecture are as follows:
- Write immutable data in parallel to batch and stream processing systems in append mode.
- Implement the same computational logic in the stream and batch computing systems, respectively.
- Merge and display both the stream and batch computing views in the query phase.
Based on the Tunnel Service of Tablestore, Tablestore is completely integrated with Blink as the stream source, dim, and sink of Blink. Let’s take a quick look at the essentials of the Lambda Plus architecture.
- In Lambda Plus architecture, data must be written only to Tablestore. The Blink stream computing framework directly reads the real-time updated data in tables through the Tunnel Service API, without the need for double queue writing or self-implementation of data synchronization.
- In terms of storage, Lambda Plus architecture directly uses Tablestore as the master dataset. Tablestore supports low latency read and write updates in online systems and provides indexing functions for efficient data queries and retrieval, resulting in high data utilization.
- In terms of computing, Lambda Plus architecture uses the Blink stream-batch integrated computing engine to unify stream-batch code.
- At the presentation layer, Tablestore provides diversified indexes, allowing to freely combine multiple indexes to meet query requirements in different scenarios.
This article has described the core components of the data system architecture, the selection of storage components, and the design principle of the data derivation system. The data derivation system helps to effectively sort out the data stream relationships between storage components, based on which the article raises several key requirements for the structured big data storage component. Alibaba Cloud Tablestore is designed based on this principle and has launched some special features. In the future, we will continue to follow this principle to develop more capabilities to facilitate analysis for structured big data in Tablestore. Tablestore will be more integrated with the open-source computing ecosystem and connected to more mainstream computing engines.