Deploy Apache Flink Natively on YARN or Kubernetes?

Apache Flink Standalone Clusters

  • Isolation: When multiple jobs run in one cluster, tasks of different jobs may be executed on the same TM, causing the resources used by the threads (CPU/MEM) cannot be controlled and affect each other, and even one task can cause the entire TM to be out of memory and affects the jobs on it. The scheduling of multiple jobs is also in the same JM, which also has the problem of being affected by the problematic job.
  • Multi-Tenant Resource Usage (Quota) Management: The total amount of Job resources used by users cannot be controlled, and the coordinated management of resources among tenants is lacking.
  • Cluster Availability: Although the JM can be deployed with Standby machines to support high availability, the lack of maintenance of JM and TM processes inevitably causes too many processes to be down and the entire cluster is unavailable, due to the above isolation problems.
  • Cluster O&M: For version upgrade, capacity scaling, and other, complex O&M operations are required.

Native Integration of Flink and YARN

Apache Flink Standalone Clusters on YARN

  • Multiple YARN applications can be started for multiple jobs. Each app is a standalone cluster and runs independently. Moreover, through isolation, such as with cgroups, supported by YARN, the mutual influence between multiple tasks is avoided, and the isolation problem is solved.
  • Apps of different users can also be run in different YARN scheduling queues to solve the multi-tenant problem through the Queue Quota management capability.
  • At the same time, the YARN policy for restarting, retrying and rescheduling the app process, can be used to make the Flink Standalone cluster have high availability.
  • With the simple modification for parameters and the configuration file, Flink jars can be distributed through the Distributed Cache of YARN, thus facilitating upgrade and capacity scaling.
  • The Cluster scale (the number of TMs) is statically specified by the parameters when the YARN app is started. The compilation optimization of Flink itself makes it difficult to estimate the resource demand before running, so it is difficult to rationalize the number of TMs. If the number of TMs is large, resources will be wasted. Otherwise, if it is small, the job execution speed will be affected or may not even be able to run.
  • The resource size owned by each TM is also statically specified by parameters, and it is also difficult to estimate the actual needs. Different TM sizes cannot be applied dynamically for different task resource requirements, and only TM with the same specification size can be set, which makes it difficult to place an exact integer number of tasks, and waste the remaining resources.
  • App startup (1. Submit YARN App) and Flink job submission (7. Submit Job) needs to be completed in two stages, which will make the submission efficiency of each task low, and reduce the resource flow rate of the cluster.

FLIP-6: Deployment and Process Model

  • Dispatcher: It is responsible for communicating with the Client to receive job submission and generate JobManager. The lifecycle can span across jobs.
  • ResourceManager: It integrates with different resource scheduling systems to implement resource scheduling (application/release), and manage Container/TaskManager. Similarly, the lifecycle can span across jobs.
  • JobManager: It is responsible for scheduling and execution of the computing logic of the job. Each job has one instance.
  • TaskManager: It registers with RM to report the status of resources. It receives task execution from JM, and reports status.

Native Integration of Apache Flink and YARN



Resource Profile

Native Integration of Flink and Kubernetes

Apache Flink Standalone Clusters on Kubernetes

Native Integration of Apache Flink and Kubernetes


Original Source




Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium


Derailed Productivity: A Case Study in Triaging Engineering Interruptions

Passwords are like underwear — Keep them hidden using Azure MSI

Classes in Python: Fundamentals for Data Scientists

Hash Data Structures Quiz

Wen Fixed-Rate ETH 2.0 Staking?

Fighting Coronavirus: Freshippo Reveals 12 Key Technologies to Achieve 0 Faults over a Year (Part…

Intelligent O&M Platforms on the Cloud Help Enterprises in Innovation and Iteration

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:

More from Medium

Set up Multi-Datacenter Cassandra Clusters in GKE with K8ssandra and Cloud DNS

Service Mesh, Istio and Why Do We Need It

What is an Operator in K8s and why FPGAs need one in Data Centers

Scaling Airflow Workers in EKS