Pravega, the Answer of Storage Layer in Flink Ecosystem

Evolution of Big Data Architectures

Challenges for Lambda Architecture

  • A large latency difference exists between the two pipelines. They cannot be combined to facilitate quick aggregations and integrated processing of historical data and real-time data leads to low performance.
  • Data storage costs are high. In the architecture in the preceding figure, the same data has one or more replicas in multiple storage components. Undoubtedly, data redundancy greatly increases the costs for enterprise customers. Fault tolerance and persistent reliability of open-source storage data are still questionable. For enterprise users sensitive to data security, we must ensure that no data is lost.
  • Repeated development is involved. The same processing flow is performed twice by two pipelines and the same data is computed once respectively in different frameworks due to different processing time. This will undoubtedly cause repeated development to data developers.

Features of Streaming Storage

  • High-throughput read performance (catch-up reads) is required for historical data from the older part of the sequence.
  • Low-latency append-only tail writes and tail reads are required for real-time data from the newer part of the sequence.

Reconstructed Streaming Storage Architecture

Reconstructed Big Data Architecture

Introduction to Pravega

Basic Concepts

  • Stream
  • Stream Segments
  • Event
  • Routing Key
  • Reader Group

Architecture

  • Tier 1 Storage
  • Tier 2 Storage

Advanced Features of Pravega

Read/Write Splitting

Auto Scaling

  • A Stream starts at time t0. Based on Routing Keys, the data is routed to Segment 0 and Segment 1. If the rate of data written to the Stream is constant, the number of Stream Segments will not change.
  • Segment 1 is split into Segment 2 and Segment 3 when the system senses the increase in the data writing rate of Segment 1 at time t1. At this point, Segment 1 is sealed and stops accepting writes. Data will be redirected to Segment 2 and Segment 3 according to Routing Keys.
  • In addition to Scale-up, the system also allows Scale-down when the data writing rate slows. When less data is written to Segment 2 and Segment 5 at time t3, their ranges are merged to Segment 6.

End-to-End Auto Scaling

Transactional Writing

Pravega vs. Kafka

Pravega Flink Connector

  • Provides the exactly-once semantics for both Readers and Writers to ensure the end-to-end exactly-once for the entire pipeline
  • Couples seamlessly with the checkpoints and savepoints of Flink
  • Supports concurrent data reading and writing with a high throughput and low latency
  • Uses the Table API to unify stream processing and batch processing of Streams

Use in Internet of Vehicles (IoV)

  • Processes the vehicle and road data in real-time quickly for microscopic prediction and planning for routes.
  • Runs machine learning algorithms on long-term driving data for macroscopic prediction and planning for routes. This belongs to the scope of batch processing.
  • Combines real-time and batch processing and uses the machine learning model and real-time data feedback generated by historical data to optimize the detection results.
  • How can efficient end-to-end processing be ensured
  • How can training time of the machine learning model be minimized
  • How can the consumption of saved data and the cost be minimized

Solution Comparison

  • As an abstract storage interface, Pravega implements a data lake in the Pravega layer. Batch processing, real-time processing, and full-text search only need data from Pravega. In the first solution, you must save data in Kafka, ElasticSearch, and long-term storage, respectively. Now, data is stored only in Pravega, which greatly reduces the data storage cost for enterprise users.
  • Pravega can automatically tier down data without introducing Flume or other components for additional extract-transform-load (ETL) development.
  • Components are simplified from Kafka, Flume, HDFS, ElasticSearch, Kibana, Spark, and Spark Streaming to Pravega, Flink, Kibana, and HDFS. This simplification reduces the pressure on the O&M personnel.
  • Flink can unify stream processing and batch processing without providing two separate sets of processing code for the same data.

Summary

Original Source:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com