Apache Flink Fundamentals: Basic Concepts

Definition, Architecture, and Principles of Flink

Flink Application

  • Streams: Streams are divided into two types: bounded streams and unbounded streams. Unbounded streams have a start but no defined end. Bounded streams have a defined start and an end. Unbounded stream data continuously increases over time and the computations are continuous with no end. Contrary to unbounded streams, bounded stream data has a fixed size and an end when you complete the computations.
  • State: State is data information generated during computations and plays a significant role in fault tolerance, failure recovery, and checkpoints. In essence, stream computing is incremental processing. Therefore, stream computing requires keeping and querying a state continuously. Also, to ensure the exactly-once semantics, data needs to be written into the state. Persistent storage ensures exactly-once when the entire distributed system fails or crashes. This is another state role.
  • Time: Time in Flink includes Event time, Ingestion time, and Processing time. Because unbounded data processed in Flink is a continuous process, time is an important basis to judge upon whether the state experiences latency and whether data is processed on time.
  • API: APIs are usually divided into three layers` from top to bottom: SQL/Table API, DataStream API, and ProcessFunction. APIs show powerful expressiveness and abstraction. However, the closer it is to the SQL layer, the weaker the expressiveness. On the contrary, APIs in the ProcessFunction layer has strong expressiveness and support various flexible operations, but the abstraction capability is smaller.

Flink Architecture

  • Unified Data Processing Framework: Flink has a unified framework for processing bounded and unbounded data streams.
  • Flexible Deployment: The underlying layer of Flink supports many resource schedulers, including YARN and Kubernetes. The built-in Flink standalone scheduler also supports flexible deployment.
  • High Scalability: Scalability is very important for distributed systems. During Alibaba’s Double 11 events, Flink helps to process large amounts of data and delivers good performance while processing up to 1.7 billion data entries per second.
  • Excellent Stream Processing Performance: The biggest advantage of Flink over is that it completely abstracts state semantics into the framework and supports reading state locally. This avoids a large amount of network I/O and significantly improves the performance of reading and saving state.

Flink Operation

  • Thanks to the consistent checkpoints in the Flink implementation, Flink has a 24/7 available Service Oriented Architecture (SOA). Checkpoints are the core mechanism to handle fault tolerance in Flink, which regularly records the state of Operators during computations and generates snapshots for persistent storage. When a Flink job fails and crashes, Flink allows you to selectively restore the job from checkpoints to ensure computational consistency.
  • Flink provides functions and interfaces for monitoring, O&M, as well as built-in Web UI. It also provides DAG graphs and various metrics for running jobs and helping you manage job state.

Flink Scenarios

Flink Scenario: Data Pipeline

  • Real-time Data Warehouse
  • Search Engine

Flink Scenario: Data Analytics

Flink Scenario: Data-Driven

Stateful Stream Processing

Traditional Batch Processing

Features of Ideal Solutions

  • State: An ideal solution ensures that the engine is capable of accumulating and maintaining state. The cumulative state represents all events that have been received in the past and affects the output.
  • Time: With time, the engine has a mechanism to control data integrity. After receiving data, computational results are returned.
  • Results Available in Real-time: Producing real-time results is an ideal solution. More importantly, a new model for processing continuous data is needed to process real-time data to fully comply with the characteristics of continuous data.

Stream Processing

Distributed Stream Processing

Stateful Distributed Stream Processing

Advantages of Apache Flink

State and Fault Tolerance

  • How to ensure that the state has the exactly-once fault-tolerance guarantee?
  • How to produce global consistent snapshots for multiple operators with the local state in distributed scenarios?
  • How to create a snapshot without interrupting the operation?

Exactly-Once Fault Tolerance Guarantee in Simple Scenarios

Distributed State and Fault Tolerance

  • Global Consistency Snapshot
  • Failure Recovery

Distributed Snapshots

State Maintenance

  • JVM Heap State Backend: This backend is suitable for maintaining a small number of states. The JVM Heap state backend performs Java object reads/writes each time an operator value reads the state, without having high overhead. There is a need for serialization when a checkpoint needs to put the local state of individual operation values into distributed snapshots.
  • RocksDB State Backend: It is an out-of-core state backend. When a user reads the state by using the local state backend of the runtime, you can perform and maintain the state in the disk. Accordingly, you require serialization and deserialization each time the state is read. Upon snapshotting, you only need to serialize the application. Serialized data is transferred to the central shared DFS.

Event Time

Different Types of Time

Event Time Processing

Watermarks

Save and Migrate State

  • When changing application logic or fixing bugs, how to move the state of the previous execution to a new execution?
  • How to redefine the parallelism level?
  • How to upgrade the version number of the operation cluster?

Summary

Original Source:

--

--

--

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Templates in Golang -Part 2

Implementing Competing Consumers Pattern with WSO2 Micro Integrator and RabbitMQ

The Benefits of GitHub

The process behind search a website, what an amazing trip!

Cisco DNA Center Release 2.2.2.6 with ISE

Notification: argoCD to Slack

Build Your Own Microservice

Adobe Experience Manager (AEM): One Place For Your CMS And DAM

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

More from Medium

Automatic Semver Versioning Using Github Actions and deploying to AWS Pipeline

How to setup an NGINX reverse proxy on Google Cloud ☁️

Data processing leveraging Event driven, Micro-services and Apache Kafka

event driven, microservices, kafka based data processing

Parsing CVS file from Google Storage using Cloud Function, DataProc and java application using…