Flink Checkpoints Principles and Practices: Flink Advanced Tutorials
By Tang Yun (Chagan)
Relationship between Checkpoints and States
A checkpoint in Apache Flink is a global operation that is triggered by the source nodes to all downstream nodes. As shown in the red box in the following figure, a total of 569,027 checkpoints are triggered and all are completed successfully.
A state is the data for persistent backup made by a checkpoint. As shown within the red box in the following figure, the state size of Checkpoint 569116 is 9.17 KB.
What Is a State?
Let’s understand the definition of a state. The following figure shows a typical word count code, which is used to monitor the data of the local port 9000 and count the word frequency of network port input. If we execute Netcat locally and enter “hello world” on the command-line interface (CLI), what kind of output will we get?
The output includes
(hello, 1) and
If we enter “hello world” on the CLI again, what kind of output will we get?
The output includes
(hello, 2) and
(world, 2). When "hello world" is entered for the second time, Flink determines that these two words already appeared once each by checking a type of state called keyed state, which stores previous word frequency statistics.
Let’s review the word count code. When the keyBy interface is called, a KeyedStream is created to divide the keys, which is a prerequisite for using a keyed state. Then, the sum method calls the built-in StreamGroupedReduce to implement the keyed state.
What Is a Keyed State?
A keyed state has two features:
- It is used in functions and operators on a KeyedStream, such as keyed user-defined functions (UDFs) and window state.
- A keyed state is already partitioned or sharded, and each key belongs to only one keyed state.
It will be better to understand the concept of partitioning by looking at the keyBy semantics. The following figure shows three concurrencies on the left and three concurrencies on the right. The keyBy interface distributes incoming words, as shown in the left part of the figure. For example, when we enter “hello world”, the word “hello” is always distributed to the concurrent task in the lower-right corner by the hash algorithm.
What Is an Operator State?
- An operator state, also called a non-keyed state, is bound to only one operator instance.
- A common type of operator state is the source state, which is used to record the current source’s offset.
The following figure shows the word count code that uses an operator state.
fromElements calls the
FromElementsFunction class. An operator state of the list state type is used. The following figure shows a classification of states by type.
The classification of states is also based on whether they are directly managed by Flink.
- A managed state is a type of state which is managed by Flink. The states mentioned earlier are all managed states.
- The raw state only stores data from Flink-provided streams. Flink views the raw state as a series of bytes.
We recommend using managed states in actual production environments. It is the focus of this article.
How to Use States in Flink
The following figure shows how to use the keyed state in code. Consider the
StreamGroupedReduce class used by the sum method in the preceding word count code as an example.
The following figure shows a detailed description of the
FromElementsFunction class and illustrates how to use the operator state in code.
Before introducing the checkpoint execution mechanism, we need to understand the storage of states because states are essential for the persistent backup of checkpoints.
Classification of State Backends
The following figure shows three built-in state backends in Flink.
FsStateBackend are stored in Java heap during runtime.
FsStateBackend persistently saves data in the form of files to remote storage only when a checkpoint is executed.
RocksDBStateBackend uses RocksDB (an LSM database and combines memory and disks) to store states.
Following are the ways to execute
- Support for asynchronous checkpoints (default): The storage format is CopyOnWriteStateMap.
- Support for synchronous checkpoints only: The storage format is NestedStateMap.
HeapKeyedStateBackend is used in MemoryStateBackend, by default, the maximum data volume is 5 MB during checkpoint-based data serialization.
RocksDBKeyedStateBackend, each state is stored in a separate column family. The keyGroup, Key, and Namespace are serialized and stored in the database in the form of keys.
Detailed Explanation of Checkpoint Execution
This section explains the step by step checkpoint execution process. As shown in the following figure, a checkpoint coordinator is located on the left, a Flink job that consists of two source nodes and one sink node is in the middle, and persistent storage is located on the right, which is provided by the Hadoop Distributed File System (HDFS) in most scenarios.
Step 1) The checkpoint coordinator triggers a checkpoint to all source nodes.
Step 2) The source nodes broadcast a barrier downstream. The barrier is at the core of the Chandy-Lamport distributed snapshot algorithm. The downstream tasks execute the checkpoint only after receiving the barriers of all inputs.
Step 3) After completing state backup, the tasks send the backup data address, namely, the state handle, to the checkpoint coordinator.
Step 4) After receiving the barriers of the two upstream inputs, the downstream sink node executes the local snapshot. The following figure shows the process of executing the RocksDB incremental checkpoint. RocksDB flushes full data to the disk, as shown by the red triangle. Then, Flink implements persistent backup for the non-uploaded files, as shown by the purple triangle.
Step 5) After executing its checkpoint, the sink node returns the state handle to the checkpoint coordinator.
Step 6) After receiving the state handles of all tasks, the checkpoint coordinator determines that the checkpoint is globally completed and then backs up a checkpoint metafile to the persistent storage.
EXACTLY_ONCE Semantics of Checkpoints
To implement the EXACTLY ONCE semantics, Flink uses an input buffer to cache data during alignment and processes the data after alignment is completed. To implement the AT LEAST ONCE semantics, Flink directly processes the collected data without caching it. In this case, data may be processed multiple times when being restored. The following figure shows the schematic diagram of checkpoint alignment, which is sourced from the official Flink documentation.
The checkpoint mechanism only ensures EXACTLY ONCE for Flink computing. End-to-end EXACTLY ONCE requires the support of source and sink nodes.
Differences between Savepoints and Checkpoints
Savepoints and checkpoints are used to resume jobs. The following table lists the differences between them.