Pravega, the Answer of Storage Layer in Flink Ecosystem

Evolution of Big Data Architectures

Challenges for Lambda Architecture

  • Data storage costs are high. In the architecture in the preceding figure, the same data has one or more replicas in multiple storage components. Undoubtedly, data redundancy greatly increases the costs for enterprise customers. Fault tolerance and persistent reliability of open-source storage data are still questionable. For enterprise users sensitive to data security, we must ensure that no data is lost.
  • Repeated development is involved. The same processing flow is performed twice by two pipelines and the same data is computed once respectively in different frameworks due to different processing time. This will undoubtedly cause repeated development to data developers.

Features of Streaming Storage

Let’s take a look at some features of streaming storage before we introduce Pravega. Mixed storage is required to obtain big data architecture that unifies stream processing and batch processing.

  • Low-latency append-only tail writes and tail reads are required for real-time data from the newer part of the sequence.

Reconstructed Streaming Storage Architecture

Reconstructed Big Data Architecture

Introduction to Pravega

Pravega is designed to provide real-time streaming storage. Applications store data permanently to Pravega, where Streams can store an unbound volume of data for an unlimited period. The same Reader API supports tail reads and catch-up reads and effectively unifies offline and real-time computing.

Basic Concepts

Architecture

Advanced Features of Pravega

Read/Write Splitting

Auto Scaling

  • Segment 1 is split into Segment 2 and Segment 3 when the system senses the increase in the data writing rate of Segment 1 at time t1. At this point, Segment 1 is sealed and stops accepting writes. Data will be redirected to Segment 2 and Segment 3 according to Routing Keys.
  • In addition to Scale-up, the system also allows Scale-down when the data writing rate slows. When less data is written to Segment 2 and Segment 5 at time t3, their ranges are merged to Segment 6.

End-to-End Auto Scaling

Transactional Writing

Pravega vs. Kafka

Pravega Flink Connector

To facilitate the use with Flink, we provide the Pravega-Flink Connector. The Pravega team also plans to contribute this connector to the Flink community. The connector has the following features:

  • Couples seamlessly with the checkpoints and savepoints of Flink
  • Supports concurrent data reading and writing with a high throughput and low latency
  • Uses the Table API to unify stream processing and batch processing of Streams

Use in Internet of Vehicles (IoV)

  • Runs machine learning algorithms on long-term driving data for macroscopic prediction and planning for routes. This belongs to the scope of batch processing.
  • Combines real-time and batch processing and uses the machine learning model and real-time data feedback generated by historical data to optimize the detection results.
  • How can training time of the machine learning model be minimized
  • How can the consumption of saved data and the cost be minimized

Solution Comparison

  • Pravega can automatically tier down data without introducing Flume or other components for additional extract-transform-load (ETL) development.
  • Components are simplified from Kafka, Flume, HDFS, ElasticSearch, Kibana, Spark, and Spark Streaming to Pravega, Flink, Kibana, and HDFS. This simplification reduces the pressure on the O&M personnel.
  • Flink can unify stream processing and batch processing without providing two separate sets of processing code for the same data.

Summary

Flink has become a shining star among stream computing engines, but there is still a gap in the streaming storage field. Pravega is designed to fill this gap for the big data architecture. “All problems in the computer field can be solved by adding an extra middle layer.” Essentially, Pravega acts as a decoupling layer between the computing engine and the underlying storage and aims to resolve the challenges for the new generation of big data platforms at the storage layer.

Original Source:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store