Kernel Optimization: How to Improve the Scheduling Performance of Elasticsearch Master Nodes by 40x

Image for post
Image for post

By Xingfeng, a Senior Development Engineer, Alibaba Cloud Elasticsearch Team

Released by ELK Geek



Before the introduction of pr#39499, the engine of data nodes with disabled indexes would be shut down, so it would no longer provide the query and write services. As a result, shard relocation and data replication were not allowed. When a data node is deprecated, the shard data for which the index has been disabled is lost.

After the introduction of pr#39499, the master nodes retain the disabled index in the cluster state and schedule the shards of the index. The data node uses NoOpEngine to open the index and does not support the query and write services. In this case, the overhead is much less than that of a common engine. When there are many shards in the cluster state, master node scheduling is very slow.


  • Elasticsearch version: v7.4.0
  • Dedicated master nodes: 3× 16-core 64 GB
  • Data nodes: 2× 16-core 64 GB

Created 5,000 indexes. Each index with 5 primary shards and 0 replica shards for a total of 25,000 shards. During the test, we found that it took 58s to create an index. The CPU utilization of the master node cluster was always 100%. We ran the top -Hp $ES_PID command to get IDs of busy threads. By using jstack to call stack information for master nodes, we found that this problem occurs because the masterServices thread keeps calling shardsWithState.

"elasticsearch[iZ2ze1ymtwjqspsn3jco0tZ][masterService#updateTask][T#1]" #39 daemon prio=5 os_prio=0 cpu=150732651.74ms elapsed=258053.43s tid=0x00007f7c98012000 nid=0x3006 runnable  [0x00007f7ca28f8000]  java.lang.Thread.State: RUNNABLE
at java.util.Collections$UnmodifiableCollection$1.hasNext(java.base@13/
at org.elasticsearch.cluster.routing.RoutingNode.shardsWithState(
at org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDecider.sizeOfRelocatingShards(
at org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDecider.getDiskUsage(
at org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDecider.canRemain(
at org.elasticsearch.cluster.routing.allocation.decider.AllocationDeciders.canRemain(
at org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator$Balancer.decideMove(
at org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator$Balancer.moveShards(
at org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator.allocate(
at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(
at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(
at org.elasticsearch.cluster.metadata.MetaDataIndexStateService$1$1.execute(
at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(
at org.elasticsearch.cluster.service.MasterService.executeTasks(
at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(
at org.elasticsearch.cluster.service.MasterService.runTasks(
at org.elasticsearch.cluster.service.MasterService.access$000(
at org.elasticsearch.cluster.service.MasterService$
at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(
at org.elasticsearch.cluster.service.TaskBatcher$
at org.elasticsearch.common.util.concurrent.ThreadContext$
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@13/
at java.util.concurrent.ThreadPoolExecutor$


The complexity of shard traversal for the outermost layer is O(n) while that for the inner layer to loop shards on each node is O(n/m), where n indicates the total number of shards in the cluster and m indicates the number of nodes. The overall complexity is O(n²/m). For a cluster with a fixed number of nodes, m can be considered as a constant, and the scheduling complexity of each reroute operation is O(n²).

Each time, all shards need to be traversed to find the initializing and relocating shards. Therefore, the calculation can be performed once during initialization, and then a simple update can be performed upon each shard status change, which reduces the overall complexity to O(n). To address this problem, we made a simple modification to the Elasticsearch 7.4.0 code and packaged it for testing. The response time for requests that trigger reroute, such as index creation requests, was reduced to 1.2s from 58s. Both the hot threads API and jstack of Elasticsearch show that shardsWithState is no longer a hot spot. Therefore, performance is improved significantly.

Image for post
Image for post

Temporary Solutions

When you encounter this problem in your Elasticsearch cluster, set cluster.routing.allocation.disk.include_relocations to false so that disk usage of relocating shards is ignored during master node scheduling. However, this may result in an incorrect estimation of disk usage and the risk of repeatedly triggering relocation.

Original Source:

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store