Analysis of Flink Job Problems and Optimization Practices

  1. Mechanism Explanation
  2. Performance Orientation
  3. Classic Scenario Optimization
  4. Memory Tuning

Checkpoint Mechanism

1. What Is a Checkpoint?

2. Case Analysis

Analysis Process of CK

3. Backpressure Generation and Flink’s Backpressure Processing

  1. First, deduce the cutoff part, which is 25% by default, so the available memory is 8 GB multiplied by 0.75, which is 6.144 GB.
  2. Network buffers occupy 10% of the available memory, so it is 6.144 GB multiplied by 0.1, which is 0.6144 GB.
  3. The on-heap or off-heap memory is the available memory minus network buffers, and then is multiplied by 0.8.
  4. The memory allocated to users is the remaining 20% of the heap memory.

Troubleshooting of Flink Jobs

1. Problem Locating Formula

  • Backpressure: Typically, the down stream of the last sub-task with a high pressure is one of the job bottlenecks.
  • Checkpoint duration: The checkpoint duration affects the overall throughput of the job to some extent.
  • Core metrics: The metrics are the basis for accurately determining the performance of a task. Among these metrics, latency and throughput are the most critical.
  • Resource Utilization: The ultimate goal is to improve resource utilization.
  1. The performance problems caused by data serialization and deserialization are often ignored when backpressure is the concern.
  2. For some data structures, such as HashMap and HashSet, their keys require hash calculations. If the keyby operation is performed on these data structures and the data volume of these data structures is large, performance is significantly affected.
  3. Data skew is a classic problem, which will be discussed later.
  4. If the down stream is MySQL or HBase, perform a batch operation to store the data into a buffer and send it when certain conditions are met. This aims to reduce interactions with external systems and reduce network overhead.
  5. Frequent GC, either CMS or G1, will stop the running job during GC. In addition, the long GC time will cause the JobManager and TaskManager to be unable to send heartbeats on time. In this case, the JobManager considers that the connection to Task Manager is lost and starts a new TaskManager.
  6. A window function is used to divide infinite data into blocks. For example, when sliding windows are used, although the window with a size of five minutes is not a large window, a step value of 1s means that data is processed once in one second. This results in a high data overlap and a large data volume.

2. Flink Job Optimization


Original Source:



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website: