Pravega, the Answer of Storage Layer in Flink Ecosystem
By Yu Teng, R&D Director from Dell EMC, Collated by Haikai Zhao, Dell EMC intern
This article introduces the evolution of big data architectures, Pravega, advanced features of Pravega, and scenarios of the Internet of Vehicles (IoV.) It focuses on why Dell EMC developed Pravega, challenges that Pravega tackled for big data processing platforms, and benefits from the combination of Pravega and Flink.
Evolution of Big Data Architectures
Challenges for Lambda Architecture
Effectively extracting and providing data is crucial to the success of big data architecture. Data extraction requires two kinds of policies due to differences in processing speed and frequency. The preceding figure shows a typical Lambda architecture, which divides the big data architecture into two separate sets of computing infrastructure: batch processing and real-time stream processing.
For real-time processing, data from sensors, mobile devices, or application logs are usually written to a message queuing system, such as Kafka. Message queuing provides temporary data buffers for stream processing applications. Spark Streaming reads the data from Kafka for real-time stream computing, but Kafka does not store historical data permanently. If your business logic is to analyze both historical data and real-time data, this pipeline cannot work. To compensate for this, we need to develop an additional pipeline for batch processing. This is the Batch part in the figure.
The batch processing pipeline aggregates many open-source big data components, such as ElasticSearch, Cassandra, Spark, and the Hadoop Distributed File System (HDFS.) This pipeline uses Spark to implement large-scale MapReduce operations. The results are more accurate because all historical data is computed and analyzed, but the latency is relatively large.
This classic big data architecture has the following three problems:
- A large latency difference exists between the two pipelines. They cannot be combined to facilitate quick aggregations and integrated processing of historical data and real-time data leads to low performance.
- Data storage costs are high. In the architecture in the preceding figure, the same data has one or more replicas in multiple storage components. Undoubtedly, data redundancy greatly increases the costs for enterprise customers. Fault tolerance and persistent reliability of open-source storage data are still questionable. For enterprise users sensitive to data security, we must ensure that no data is lost.
- Repeated development is involved. The same processing flow is performed twice by two pipelines and the same data is computed once respectively in different frameworks due to different processing time. This will undoubtedly cause repeated development to data developers.
Features of Streaming Storage
Let’s take a look at some features of streaming storage before we introduce Pravega. Mixed storage is required to obtain big data architecture that unifies stream processing and batch processing.
- High-throughput read performance (catch-up reads) is required for historical data from the older part of the sequence.
- Low-latency append-only tail writes and tail reads are required for real-time data from the newer part of the sequence.
Reconstructed Streaming Storage Architecture
For distributed storage components such as Kafka and Cassandra, their storage architectures from top to bottom adopt the pattern from dedicated log data storage to local file storage to distributed storage on clusters.
The Pravega team tried to reconstruct the streaming storage architecture by introducing Pravega Stream (Stream) as the basic unit of streaming storage. A Stream is a named, durable, append-only, and unbounded sequence of bytes.
As shown in the preceding figure, the bottom layer of the storage architecture is the scalable distributed cloud storage. At the middle layer, log data is stored as Streams that are shared storage primitives. Then, based on the Streams, different features are provided upward, such as message queuing, NoSQL, full-text streaming data search, and Flink-based real-time and batch analytics. In other words, Stream primitives avoid data redundancy caused by movements of the original data in multiple open-source storage and search products in the existing big data architecture. Stream primitives enable a unified data lake at the storage layer.
Reconstructed Big Data Architecture
Our big data architecture uses Flink as the computing engine and unifies batch processing and stream processing by using a unified model and APIs. Pravega used as the storage engine provides a unified abstraction for streaming storage, which allows consistent access to historical data and real-time data. This unification leads to a closed loop from storage to computing, allowing you to process both high-throughput historical data and low-latency real-time data. Meanwhile, the Pravega team also developed the Flink-Pravega Connector to provide the exactly-once semantics for the entire pipeline of computing and storage.
Introduction to Pravega
Pravega is designed to provide real-time streaming storage. Applications store data permanently to Pravega, where Streams can store an unbound volume of data for an unlimited period. The same Reader API supports tail reads and catch-up reads and effectively unifies offline and real-time computing.
Let’s introduce the basic concepts of Pravega based on the preceding figure:
Pravega organizes data into Streams. A Stream is a named, durable, append-only, and unbound sequence of bytes.
- Stream Segments
A Stream is split into one or more Stream Segments. A Stream Segment is a data shard in a Stream, which is an append-only data block. Pravega implements auto scaling based on Stream Segments. The number of Stream Segments is automatically and continuously updated based on data traffic.
Pravega’s client API allows applications to write data to and read data from Pravega in the form of Events. An Event is represented as a collection of bytes within a Stream. For example, writing a temperature record from an IoT sensor to Pravega is an Event.
- Routing Key
Every Event has a Routing Key. A Routing Key is a string used by developers to group similar Events. Events associated with the same Routing Key are written to the same Stream Segment. Pravega provides read and write semantics based on Routing Keys.
- Reader Group
It is used to load balance reads. You can dynamically increase or decrease the number of Readers in a Reader Group to change the concurrency of data reads. For more information, see the official documentation of Pravega by clicking this link.
At the control plane, Controller instances act as the primary nodes of a Pravega cluster to manage Segment Stores at the data plane. This provides the functionality to create, update, and delete Streams. In addition, it monitors the cluster status in real-time, acquires streaming data, and gathers metrics. A cluster usually has three Controller instances to ensure its high availability.
At the data plane, the Segment Store provides the API for reading data from and writing data to a Stream. In Pravega, data is stored in tiers.
- Tier 1 Storage
Tier 1 Storage is typically deployed within the Pravega cluster and used for low-latency and short-time hot data. Each Segment Store has the Cache to speed up data reading. Pravega uses Apache BookKeeper to implements Tier 1 Storage to provide low-latency log storage.
- Tier 2 Storage
Tier 2 Storage provides long-term storage. It is normally deployed outside the Pravega cluster and used for streaming data (cold data.) Pravega uses HDFS, Network File System (NFS), and enterprise-grade storage products, such as Dell EMC’s Isilon and Elastic Cloud Storage (ECS) to implement Tier 2 Storage.
Advanced Features of Pravega
When data is written to Tier 1 Storage, the data is stored in all relevant Segment Stores through BookKeeper, which ensures that the data is written successfully.
Read/write splitting helps optimize the read and write performance because it allows reading only from the Cache in Tier 1 Storage and Tier 2 Storage instead of reading from BookKeeper in Tier 1.
When you initiate a request to read data from Pravega, Pravega will decide whether to perform a low-latency tail read from the Cache in Tier 1 or a high-throughput catch-up read from Tier 2 Storage, such as object storage or NFS. If the data is not stored in the Cache, data is loaded to the Cache as needed. Read operations are transparent to you.
The BookKeeper in Tier 1 never reads but only writes data in the case of zero cluster failure.
The number of Stream Segments in a Stream can automatically grow and shrink over time based on the I/O load it receives. This feature is known as auto scaling. Let’s take the preceding figure as an example:
- A Stream starts at time t0. Based on Routing Keys, the data is routed to Segment 0 and Segment 1. If the rate of data written to the Stream is constant, the number of Stream Segments will not change.
- Segment 1 is split into Segment 2 and Segment 3 when the system senses the increase in the data writing rate of Segment 1 at time t1. At this point, Segment 1 is sealed and stops accepting writes. Data will be redirected to Segment 2 and Segment 3 according to Routing Keys.
- In addition to Scale-up, the system also allows Scale-down when the data writing rate slows. When less data is written to Segment 2 and Segment 5 at time t3, their ranges are merged to Segment 6.
End-to-End Auto Scaling
Pravega uses Kubernetes Operator to deploy stateful applications for components in a cluster, which makes auto scaling of applications more flexible and convenient. Currently, Pravega also works with Ververica on implementing Kubernetes pod-level auto scaling in Pravega and auto scaling by rescaling the number of tasks in Flink.
Pravega also supports transactional writes. Before a transaction is committed, data is written to different Transaction Segments based on Routing Keys. At this time, Transaction Segments are invisible to the Reader. After a transaction is committed, Transaction Segments are appended respectively to the tail of Stream Segments. At this time, Transaction Segments are visible to the Reader. The support for writing transactions is also the key to implementing the end-to-end exactly-once semantics with Flink.
Pravega vs. Kafka
First, the key difference lies in the positioning of Kafka and Pravega; the former acts as a message queuing system, whereas the latter serves as a storage engine. Pravega focuses more on storage characteristics such as dynamic scaling, security, and integrity.
In stream processing, data should be deemed continuous and infinite. As message-oriented middleware based on the local file system, Kafka appends a published message to the end of the log file and tracks its content (offset mechanism.) By using this method, Kafka simulates infinite data streams but is inevitably subject to the upper limit of file descriptors and the disk capacity of the local file system. Therefore, it cannot ensure infinity.
The comparison between Pravega and Kafka is detailed in the figure.
Pravega Flink Connector
To facilitate the use with Flink, we provide the Pravega-Flink Connector. The Pravega team also plans to contribute this connector to the Flink community. The connector has the following features:
- Provides the exactly-once semantics for both Readers and Writers to ensure the end-to-end exactly-once for the entire pipeline
- Couples seamlessly with the checkpoints and savepoints of Flink
- Supports concurrent data reading and writing with a high throughput and low latency
- Uses the Table API to unify stream processing and batch processing of Streams
Use in Internet of Vehicles (IoV)
Let’s take connected autonomous vehicles that can generate petabyte-level data as an example. This scenario has the following needs:
- Processes the vehicle and road data in real-time quickly for microscopic prediction and planning for routes.
- Runs machine learning algorithms on long-term driving data for macroscopic prediction and planning for routes. This belongs to the scope of batch processing.
- Combines real-time and batch processing and uses the machine learning model and real-time data feedback generated by historical data to optimize the detection results.
The key indicators that concern customers are:
- How can efficient end-to-end processing be ensured
- How can training time of the machine learning model be minimized
- How can the consumption of saved data and the cost be minimized
Here is a comparison of solutions before and after the introduction of Pravega.
Pravega greatly simplifies the big data architecture:
- As an abstract storage interface, Pravega implements a data lake in the Pravega layer. Batch processing, real-time processing, and full-text search only need data from Pravega. In the first solution, you must save data in Kafka, ElasticSearch, and long-term storage, respectively. Now, data is stored only in Pravega, which greatly reduces the data storage cost for enterprise users.
- Pravega can automatically tier down data without introducing Flume or other components for additional extract-transform-load (ETL) development.
- Components are simplified from Kafka, Flume, HDFS, ElasticSearch, Kibana, Spark, and Spark Streaming to Pravega, Flink, Kibana, and HDFS. This simplification reduces the pressure on the O&M personnel.
- Flink can unify stream processing and batch processing without providing two separate sets of processing code for the same data.
Flink has become a shining star among stream computing engines, but there is still a gap in the streaming storage field. Pravega is designed to fill this gap for the big data architecture. “All problems in the computer field can be solved by adding an extra middle layer.” Essentially, Pravega acts as a decoupling layer between the computing engine and the underlying storage and aims to resolve the challenges for the new generation of big data platforms at the storage layer.