How to Design a Storage Layer for Structured Data Storage Requirements

  • Business-oriented: The most basic business interaction logic is complete.
  • Large-scale: The use of distributed and big data technologies meets business scale growth and data accumulation requirements.
  • Intelligent: The use of artificial intelligence (AI) technology digs data value and drives business innovations.

Data System Architecture

Core Components

  • A Relational Database: This stores data for main business operations and processes transaction data, and is the core data storage of the application system.
  • High-Speed Cache: This caches results of operations that are complex or costs a lot for redo to accelerate access.
  • The Search Engine: This provides complex criteria-based querying and full-text retrieval.
  • The Queue: This synchronizes data processing and integrates upstream and downstream components for real-time data exchange. It uses several core components for upstream and downstream interconnections between heterogeneous data storage systems, for example, data interconnection between database systems and cache systems or search systems. Also, it uses queues for real-time data extraction and real-time archiving from online storage to offline storage.
  • Unstructured Big Data Storage: It stores a large amount of unstructured data, such as images or videos, and supports online query or offline computing data access.
  • Structured Big Data Storage: This refers to an online database, which is more for online-offline connections and features high-throughput data writing and large-scale data storage. It helps to expand the storage and query performance linearly. The structured big data storage stores non-relational data for online queries, or archive historical data for relational databases to meet large-scale and linear expansion requirements. It also stores data written in real-time for offline analysis.
  • Batch Computing: This analyzes unstructured data and structured data. Batch computing categorizes into interactive analysis and offline computing. Offline computing performs complex analysis on large-scale datasets, whereas interactive analysis performs real-time analysis on medium-scale datasets.
  • Stream Computing: This performs stream analysis on unstructured data and structured data to generate real-time views with low latency.

Derived Data System

  • Primary Storage: This stores data generated by businesses or computing, which is usually the first storage data. Transaction features, such as ACID, may be strong demands, querying low-latency business data required by online applications.
  • Secondary Storage: This stores data synchronized and replicated from the primary storage. As a view of the primary storage, secondary storage generally optimizes data query, retrieval, and analysis.
  • Multi-writing at the Application Layer: This is the easiest implementation method with the least dependency. In this mode, data is first written to the primary storage and then to the secondary storage in the application code. This mode is not very reliable, however. So, it is generally applied in scenarios where the data reliability requirement is not high. This mode has many problems. First, it doesn’t ensure consistency between data in the primary and secondary storage and it doesn’t deal with data writing failures. Second, the consumption involved with data writing is accumulated at the application layer, therefore increasing the code complexity and computing workload at the application layer, ultimately making it is a poor-decoupling architecture. And, third, the scalability of this mode is poor and the data synchronization logic is fixed in code, making it difficult to add secondary storage flexibly.
  • Asynchronous Queue Replication: This is a widely used architecture. The application layer asynchronizes and decouples the writing of derived data in queues. And, in this architecture, data is written to both primary and secondary storage, and only the secondary storage is asynchronized. The first mode must allow asynchronous data writing to the primary storage. Otherwise, only the second mode will be used. If you use the second mode, you will also encounter problems similar to those in the multi-writing at the application layer model. The application layer supports multi-writing only to the primary storage and queues. Queues address multi-writing and scalability problems of the secondary storage.
  • Change Data Capture (CDC) Technology: This writes data to the primary storage, which then synchronizes the data to the secondary storage. This mode is the most user-friendly for the application layer, and you only need to deal with the primary storage. It uses the asynchronous queue replication technology to synchronize data from the primary storage to the secondary storage. However, the primary storage needs to support the CDC technology in this mode. A typical example is the combination architecture of MySQL and Elasticsearch. Elasticsearch data is synchronized in binlog files of MySQL, and binlog indicates the CDC technology of MySQL.
  • How can we ensure data consistency
  • How can we track data synchronization latency
  • How can we ensure the same data writing capability of the secondary storage as the primary storage after the CDC technology replicates data in real-time.

Selection of Storage Components

  • Data models and query languages are the most significant differences between databases. Relational models and document models are relatively abstract models, whereas non-relational models, such as time series, graph, and key-value models, are relatively concrete abstractions. For example, you need to match a concrete graph model in a scenario to narrow down the selection scope.
  • You need to divide storage components into different data layers, with optimization preferences for the scale, cost, query, and analysis performance. During selection, the core metrics required for the data storage must be clear.
  • The data replication relationship must be clearly organized to distinguish between primary and secondary storage. The primary and secondary storage will be introduced in the next section.
  • You need to build a flexible data exchange channel to achieve rapid data migration and switch between storage components. Building a fast iteration ability is more important than improving the scalability for unknown requirements.
  1. Data must be layered.
  2. Data must be stored in the Object Storage Service (OSS).
  3. A unified analysis engine unifies the analysis portals and provides a unified query language.

Structured Big Data Storage

Positioning

Key Requirements

  • Large-scale Data Storage: The structured big data storage is positioned as centralized storage. As a summary of online databases (in large wide table mode) or the input and output of offline computing, structured big data storage must support petabytes of data storage.
  • High-throughput Writing Capability: Data is converted from online storage to offline storage with the ETL tool in T+1 synchronization or real-time synchronization mode. The structured big data storage needs to support the import of data from multiple online databases as well as the export of a large number of result sets from the big data computing engine. Therefore, structured big data storage must support high-throughput data writing. Generally, a storage engine for writing optimization is used.
  • Rich Data Query Capabilities: The structured big data storage, as secondary storage for the data derivation system, needs to optimize efficient online queries. Common query optimizations include high-speed caches, high-concurrency and low-latency random queries, complex field-combined queries, and data retrieval. The technical means of the query optimizations are caching and indexing, where indexing support is diversified and different types of indexes are provided for different query scenarios, for example, B+Tree-based secondary indexes for queries by fixed combinations, R-Tree or BKD-Tree-based spatial indexes for location queries, or inverted indexes for multi-field queries and full-text retrieval.
  • Separation of Storage and Computing Costs: Storage and computing separation is currently a popular architecture implementation. It is difficult for general applications to obtain the advantages of this architecture. In cloud-based big data systems, storage and computing separation give full play to its advantages. The biggest advantage of storage and computing separation in the distributed architecture is the flexible storage and computing resource management method, which greatly improves the storage and computing scalability. For cost management, the storage and computing costs are separated only for the products that are implemented based on the storage and computing-separated architecture. The advantage of separating storage and computing costs is more obvious in the big data system. For example, the storage amount in the structured big data storage will increase with data accumulation, but the data writing amount is relatively stable. Therefore, there is a requirement to constantly expand the storage needs. However, the computing resources required to support data writing or temporary data analysis are relatively fixed and on-demand.
  • Data Derivation Capability: Multiple storage components must co-exist in a complete data system architecture. According to different requirements for query and analysis capabilities, the secondary storage needs to be dynamically expanded in the data derivation system. Therefore, for the structured big data storage, the derivation capability that expands the secondary storage is also required to expand the data processing capability. Whether a storage component has better data, the derivation capability depends on the mature CDC technology.
  • Computing Ecosystem: Data value needs to be dug by computing. Currently, computing is divided into batch computing and stream computing. The requirements for structured big data storage are as follows:
  1. It must be able to connect to mainstream computing engines, such as Apache Spark and Flink, as input or output.
  2. It must have the data derivation capability to convert its data into analysis-oriented data of the columnar-store format and store the data in the data lake system.
  3. It must provide interactive analysis capabilities to discover data value more quickly.

Open-source Products

  • Storage and Computing Separation Architecture: It is based on HDFS at the underlying layer. The separated architecture supports elastic scaling of storage and computing and shares computing resources with computing engines such as Spark to reduce costs.
  • LSM Storage Engine: It is designed for writing optimization and provides high-throughput data writing.
  • Mature Developer Ecosystem and Access to Mainstream Computing Engines: HBase is an open-source product that has been developed for many years. The developer community is mature and connects to several mainstream computing engines.
  • Weak Query Capability: HBase provides efficient single-row random queries and range-based scans. Use scan+filter for query by complex condition combinations. Otherwise, the full table is scanned, which is extremely inefficient. Phoenix of HBase provides a secondary index to optimize queries. However, like a secondary index of MySQL, the secondary index of HBase is used to optimize queries only when the query criteria meet the leftmost matching principle. The number of query conditions that must be optimized is limited.
  • Weak Data Derivation Capability: As mentioned above, the CDC technology is the core technology that supports the data derivation system. HBase does not implement CDC technology. HBase Replication has the CDC capability. However, it is only a mechanism for data synchronization between the primary storage and the secondary storage within HBase. Some open-source components, such as Lily Indexer for synchronization with Solr, use their built-in replication capabilities to try to expand the CDC technology of HBase. However, it is a pity that these components do not meet core requirements, such as data sequence preservation and eventual consistency guarantee, required by the CDC technology based on theoretical and institutional analysis.
  • High Costs: As mentioned above, one of the key requirements for structured big data storage is the separation of storage and computing costs. The cost of HBase depends on the CPU core cost and disk storage cost for computing. In a deployment mode based on a fixed ratio of physical resources, the minimum ratio of CPU resources to storage resources cannot be reduced. That is, the CPU core cost increases accordingly with the storage space but is not computed based on the actual computing resources. Only the cloud-based serverless service mode achieves complete separation of storage and computing costs.
  • Complex O&M: HBase is a standard Hadoop component, with the core dependencies, Zookeeper and HDFS. HBase O&M fails without a professional O&M team.
  • Poor Capability of Processing Hotspot Issues: An HBase table is partitioned in range partitioning mode. Compared to the hash partitioning mode, the range partitioning mode has the biggest defect of serious hotspot issues. HBase provides a large number of best practices to guide developers to avoid hotspots while designing row keys for tables, such as using hash keys or salted-tables. However, these two modes ensure even data distribution but do not ensure the even popularity of data access. Access popularity depends on businesses. An automatic mechanism that splits or move a region based on the popularity is required.

Alibaba Cloud Tablestore

Design Principle

  • Storage and Computing-separated Architecture: The storage and computing-separated architecture is used, with the underlying layer based on Apsara Distributed File System, which is the basis for separating storage and computing costs.
  • LSM Storage Engine: LSM and B+Tree are two mainstream storage engines. LSM specially optimizes high-throughput data writing and effectively support hot and cold data separation.
  • Serverless Product Form: The most critical factor for cost separation based on the storage and computing-separated architecture is serverless services. Only serverless services achieve the separation of storage computing costs. In a big data system, the structured big data storage usually requires regular large-scale data imports from online databases or offline computing engines. The structured big data storage requires a sufficient computing capability to achieve high-throughput writing in this case, whereas only a small computing capability is required in normal cases. Therefore, computing resources must be elastic enough. In addition, in the data derivation system, the primary storage and the secondary storage are usually heterogeneous engines, and their read and write capabilities are different. In some scenarios, you need to flexibly adjust the ratio of the primary storage to the secondary storage. In this case, storage and computing resources must be elastically adjustable.
  • Index-based Query Capabilities: The LSM engine has shortcomings in query capabilities and requires indexes for query optimization. Different query scenarios require different types of indexes. Therefore, Tablestore provides diverse indexes to meet data query requirements in different types of scenarios.
  • CDC Technology: The CDC technology of Tablestore is called Tunnel Service, which supports the real-time subscription of full and incremental data, and seamlessly integrates with the Flink stream computing engine for real-time stream computing of table data.
  • Open-source Computing Ecosystem: Tablestore connects not only to computing engines developed by Alibaba Cloud, such as MaxCompute and Data Lake Analytics (DLA) but also to the Flink and Spark mainstream computing engines, without the need for data migration.
  • Stream-batch Computing Integration: Tablestore connects to Spark. It allows Spark to perform batch computing on full data in a table and uses the CDC technology to interconnect with Flink for stream computing on new data in the table, implementing the integration of batch and stream computing.

Table Store Features

Big Data Processing Architecture

  • Write immutable data in parallel to batch and stream processing systems in append mode.
  • Implement the same computational logic in the stream and batch computing systems, respectively.
  • Merge and display both the stream and batch computing views in the query phase.
  • In Lambda Plus architecture, data must be written only to Tablestore. The Blink stream computing framework directly reads the real-time updated data in tables through the Tunnel Service API, without the need for double queue writing or self-implementation of data synchronization.
  • In terms of storage, Lambda Plus architecture directly uses Tablestore as the master dataset. Tablestore supports low latency read and write updates in online systems and provides indexing functions for efficient data queries and retrieval, resulting in high data utilization.
  • In terms of computing, Lambda Plus architecture uses the Blink stream-batch integrated computing engine to unify stream-batch code.
  • At the presentation layer, Tablestore provides diversified indexes, allowing to freely combine multiple indexes to meet query requirements in different scenarios.

Summary

Original Source:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com