Apache Flink Fundamentals: Basic Concepts

Definition, Architecture, and Principles of Flink

Apache Flink is a distributed processing engine for big data that performs stateful or stateless computations over both bound and unbound data streams. It’s also deployed in various cluster environments for fast computations over data of different sizes.

Flink Application

To better understand Flink, you need to know basic Flink processing semantics such as Streams, State, and Time as well as APIs that provide significant flexibility and convenience.

  • Streams: Streams are divided into two types: bounded streams and unbounded streams. Unbounded streams have a start but no defined end. Bounded streams have a defined start and an end. Unbounded stream data continuously increases over time and the computations are continuous with no end. Contrary to unbounded streams, bounded stream data has a fixed size and an end when you complete the computations.
  • State: State is data information generated during computations and plays a significant role in fault tolerance, failure recovery, and checkpoints. In essence, stream computing is incremental processing. Therefore, stream computing requires keeping and querying a state continuously. Also, to ensure the exactly-once semantics, data needs to be written into the state. Persistent storage ensures exactly-once when the entire distributed system fails or crashes. This is another state role.
  • Time: Time in Flink includes Event time, Ingestion time, and Processing time. Because unbounded data processed in Flink is a continuous process, time is an important basis to judge upon whether the state experiences latency and whether data is processed on time.
  • API: APIs are usually divided into three layers` from top to bottom: SQL/Table API, DataStream API, and ProcessFunction. APIs show powerful expressiveness and abstraction. However, the closer it is to the SQL layer, the weaker the expressiveness. On the contrary, APIs in the ProcessFunction layer has strong expressiveness and support various flexible operations, but the abstraction capability is smaller.

Flink Architecture

The Flink architecture has the four following features:

  • Unified Data Processing Framework: Flink has a unified framework for processing bounded and unbounded data streams.
  • Flexible Deployment: The underlying layer of Flink supports many resource schedulers, including YARN and Kubernetes. The built-in Flink standalone scheduler also supports flexible deployment.
  • High Scalability: Scalability is very important for distributed systems. During Alibaba’s Double 11 events, Flink helps to process large amounts of data and delivers good performance while processing up to 1.7 billion data entries per second.
  • Excellent Stream Processing Performance: The biggest advantage of Flink over is that it completely abstracts state semantics into the framework and supports reading state locally. This avoids a large amount of network I/O and significantly improves the performance of reading and saving state.

Flink Operation

We will provide special courses on this topic later. This section only describes operations and maintenance (O&M) and service monitoring in Flink.

  • Thanks to the consistent checkpoints in the Flink implementation, Flink has a 24/7 available Service Oriented Architecture (SOA). Checkpoints are the core mechanism to handle fault tolerance in Flink, which regularly records the state of Operators during computations and generates snapshots for persistent storage. When a Flink job fails and crashes, Flink allows you to selectively restore the job from checkpoints to ensure computational consistency.
  • Flink provides functions and interfaces for monitoring, O&M, as well as built-in Web UI. It also provides DAG graphs and various metrics for running jobs and helping you manage job state.

Flink Scenarios

Flink Scenario: Data Pipeline

  • Real-time Data Warehouse
  • Search Engine

Flink Scenario: Data Analytics

Flink Scenario: Data-Driven

Stateful Stream Processing

Traditional Batch Processing

Features of Ideal Solutions

  • State: An ideal solution ensures that the engine is capable of accumulating and maintaining state. The cumulative state represents all events that have been received in the past and affects the output.
  • Time: With time, the engine has a mechanism to control data integrity. After receiving data, computational results are returned.
  • Results Available in Real-time: Producing real-time results is an ideal solution. More importantly, a new model for processing continuous data is needed to process real-time data to fully comply with the characteristics of continuous data.

Stream Processing

Distributed Stream Processing

Stateful Distributed Stream Processing

Advantages of Apache Flink

State and Fault Tolerance

When we consider fault tolerance, we may think of exactly-once fault tolerance. Whether it is state accumulated, when applications perform computations, each input event reflects state or state changes. If there are multiple modifications, results generated from the data engine may be not reliable.

  • How to ensure that the state has the exactly-once fault-tolerance guarantee?
  • How to produce global consistent snapshots for multiple operators with the local state in distributed scenarios?
  • How to create a snapshot without interrupting the operation?

Exactly-Once Fault Tolerance Guarantee in Simple Scenarios

For example, if the number of occurrences of a user is not accurately calculated and does not have the exactly-once guarantee, the result is not reliable. Before considering precise fault-tolerance guarantees, let’s consider some very simple use cases.

Distributed State and Fault Tolerance

As a distributed processing engine, Flink generates only one global consistent snapshot when performing multiple operations on the local state. To create a global consistent snapshot without interrupting the operations, use distributed state and fault tolerance.

  • Global Consistency Snapshot
  • Failure Recovery

Distributed Snapshots

State Maintenance

State maintenance means to maintain the state value by using code. A local state backend is required to support a very large state value.

  • JVM Heap State Backend: This backend is suitable for maintaining a small number of states. The JVM Heap state backend performs Java object reads/writes each time an operator value reads the state, without having high overhead. There is a need for serialization when a checkpoint needs to put the local state of individual operation values into distributed snapshots.
  • RocksDB State Backend: It is an out-of-core state backend. When a user reads the state by using the local state backend of the runtime, you can perform and maintain the state in the disk. Accordingly, you require serialization and deserialization each time the state is read. Upon snapshotting, you only need to serialize the application. Serialized data is transferred to the central shared DFS.

Event Time

Different Types of Time

Before the emergence of Flink and other advanced stream processing engines, processing engines for big data only supported processing time. Assume there’s a defined hourly event time window. For example, if running on processing time, data engines perform operations on the data received between 3 o’clock and 4 o’clock. When making reports or analyzing results, we want to know the output results of the data generated between 3 o’clock and 4 o’clock. To do this, use event time.

Event Time Processing

Watermarks

Save and Migrate State

Since stream processing applications run all the time, we need to consider several important O&M factors.

  • When changing application logic or fixing bugs, how to move the state of the previous execution to a new execution?
  • How to redefine the parallelism level?
  • How to upgrade the version number of the operation cluster?

Summary

This article covers the definition, architecture, and basic principles of Apache Flink and explains some basic concepts about stream computing for big data. Additionally, it provides a recap of the evolution of big data processing methods and principles of stateful stream processing.

Original Source:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com