Analyzing Elasticsearch Performance with Lucene

9 min readApr 8, 2019

By Yizheng

Elasticsearch is a very popular distributed search engine that provides powerful and easy-to-use query and analysis features, including full-text search, fuzzy query, multi-condition combination query, and geo location query. It also features analysis and aggregation capabilities. Analyzing query performance in a broad sense is very complex due to the wide range of query scenarios and many other factors such as machine models, parameter configuration, and cluster size. This article will analyze the query overhead using several main query scenarios from the query principle perspective, and provides rough performance indicator values for your reference.

Lucene Query Principle

This section mainly introduces some background knowledge about Lucene. You can skip this section if you are already familiar with it.

Data Structure and Query Principle of Lucene

Lucene is the underlying layer of Elasticsearch, and Lucene performance determines the Elasticsearch query performance.

The most important feature in Lucene is that it has several data structures, which determine how the data is retrieved. Let’s take a brief look at these data structures:

FST: FSTs save the term dictionary and support single-term, term range, term prefix, and wildcard queries.
Postings list: postings lists save a list of docIDs corresponding to each term, using the skipList structure for increased performance.
BKD-Tree: BKD-Tree is a data structure for saving multi-dimensional space points and quickly searching for data types (including space points).
DocValues: DocValues is a docID-based columnar storage structure that can efficiently improve the sort aggregation performance due to the characteristics of the columnar storage.

Merge Results for Combined Conditions

After getting familiar with the data structures in Lucene and the basic query principles, we know the following:

For a single-term query, Lucene reads the postings list for this term, which contains a list of ordered docIDs.
For string scope/prefix/wildcard queries, Lucene gets all terms that meet the conditions from FST, then finds the postings list according to these terms, and finally finds matching docs.
For a numeric range lookup, Lucene uses BKD-Tree to find a collection of unordered docIDs that meet the conditions.

The question is, if specific combined query conditions are given, how does Lucene combine results for each condition to get the final results? Simply put, how can we find the union and intersection of two sets?

1. Find the Intersection of N Postings Lists

As described in the Lucene principle analysis article above, we can use skipList to skip invalid docs and find the intersections of N postings lists.

2. Find the Union of N Postings Lists

Approach 1: Keep multiple ordered lists, and group the top of each list together into a priority queue (the minimum stack). This allows subsequent iterators to be performed on the entire union (take the top of the stack out of and get the next queued docID into the stack). We can also use a skipList to skip backwards (each sub-list is skipped using a skipList). This method works for scenarios where the number of postings lists is relatively small (N is relatively small).

Approach 2: If there are too many postings lists (N is relatively big), the first approach is not cost-effective. In this situation, we can directly merge results into an ordered docID array.

Approach 3: For the second approach, original docIDs are directly saved, and memory usage scales directly with the number of docIDs. Therefore, when the number of docs exceeds a specific value (a 32-bit docID only uses 1 bit in BitSet and the BitSet size depends on the total number of docs in segments. So, we can evaluate the cost-effectiveness of BitSet based on the total number of docs and the number of the current docs), constructing BitSet will reduce memory usage and improve the efficiency in finding unions/intersections.

3. How BKD-Tree Results Are Combined with Other Results

Since docIDs found through BKD-Tree are unordered, we should either convert them into ordered docID arrays or construct BitSet before merging them with other results.

Query Order Optimization

If a lookup contains several conditions, it is optimal to query by low-cost conditions first and then iterate over small result collections. Lucene has made a lot of optimizations in this regard. Before running queries, Lucene will first evaluate the cost of each query, and then decide an appropriate query order accordingly.

Result Sorting

By default, Lucene sorts by score (calculated score values). If other sort fields are specified, Lucene will sort results in the specified order. Does sorting significantly impact performance? Sorting doesn’t target all docs found. Instead, sorting constructs a stack and only ensures that the first (Offset+Size) docs are ordered. Therefore, sorting performance depends on (Size+Offset) and the number of docs found as well as the overhead of reading docValues. Since (Size+Offset) doesn’t get too large, and reading docValues is very efficient, sorting doesn’t impact performance very much.

Performance Analysis for Various Query Scenarios

The previous section explained some query-related theories. In this section, we will combine theories with practice and analyze query performance for specific scenarios based on some test values. In the test, we will use a single-shard 64-core machine with SSDs and analyze the computing overhead for several scenarios, ignoring any influences from the OS cache. The test results are for your reference only.

Single-Term Query

Create an index and a shard in ES. No replica is present. Prepare 10 million rows of data, each row containing only a few tags and a unique ID. Write all the data into the created index. Tag1 only has two values: a and b. Now, try to find entries with Tag1=a from the 10 million data rows (about 5 million entries). How long does it take to run the following query?
Request:
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "Tag1": "a"
        }
      }
    }
  },
  "size": 1
}'
Response:
{"took":233,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":5184867,"max_score":1.0,"hits":...}

This request takes 233 ms and returns a total of 5,184,867 matching data entries.

We know that the query condition Tag1=”a” intends to search the postings list of Tag1=”a”. The length of this postings list is 5,184,867, which is very long. Most of the time spent is scanning this postings list. In this example, the benefit of scanning the postings list is to get the total records that meet the condition. Because the constant_score is set in the condition, it only needs to return one matching record, without having to calculate scores. In scenarios where score calculation is required, Lucene will calculate scores based on how often the term appears in a doc and return sorted scores.

Now, at least 5 million postings lists can be scanned in 233 ms. In addition, since a single request is executed on a single thread, one CPU core can scan roughly 10 million docs in inverted indexes in one second.

Now let’s switch to a shorter postings list that has a total length of 10,000 and takes 3 ms to scan.

{"took":3,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":10478,"max_score":1.0,"hits":...}

Term Combination Query

Let’s try to find the intersection of two term queries first:

Consider a term combination query that includes two postings lists with a length of 10,000 and 5,000,000 respectively and has 5,000 matching data entries after the merge. How is the query performance?
Request:
{
  "size": 1,
  "query": {
    "constant_score": {
      "filter": {
        "bool": {
          "must": [
            {
              "term": {
                "Tag1": "a"  //length of postings list 5,000,000
              }
            },
            {
              "term": {
                "Tag2": "0" // length of postings list 10,000
              }
            }
          ]
        }
      }
    }
  }
}
Response:
{"took":21,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":5266,"max_score":2.0,"hits":...}

This request takes 21 ms, and the main action is to find the intersection of the two postings lists. Therefore, our analysis focuses on the skipList performance.

In this example, the postings list length is 1,000 and 5,000,000 respectively. After the merge, over 5,000 docs still match the condition. For the postings list with the length of 10,000, skip is almost unnecessary because half of the docs meet the condition; for postings lists longer than 5,000,000, skip an average of 1,000 docs each time. The minimum storage unit for postings lists is blocks. A block generally contains 128 docIDs, and the skip operation will not be performed inside a block. Therefore, even if it’s possible to skip to a specific block, the docIDs in that block still need to be scanned in sequence. In this example, roughly tens of thousands of docIDs are actually scanned, so around 20 ms is within the expected range.

Now let’s find the union of the term queries. Replace “must” in the preceding bool query with “should”, and here is the query result:

{"took":393,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":5190079,"max_score":1.0,"hits":...}

It takes 393 ms to complete the operation. Therefore, it takes longer to find the union than to run a single query.

String Range Query

Consider 10 million data entries. Each RecordID is a UUID, and each doc has a unique UUID. Find UUIDs that begin with 0–7. There are probably over 5 million results. Let's have a look at the query performance in this scenario.
Request:
{
  "query": {
    "constant_score": {
      "filter": {
        "range": {
          "RecordID": {
            "gte": "0",
            "lte": "8"
          }
        }
      }
    }
  },
  "size": 1
}
Response:
{"took":3001,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":5185663,"max_score":1.0,"hits":...}Assume that we are going to query UUIDs beginning with "a". We may get around 600,000 results. How about the performance?Request:
{
  "query": {
    "constant_score": {
      "filter": {
        "range": {
          "RecordID": {
            "gte": "a",
            "lte": "b"
          }
        }
      }
    }
  },
  "size": 1
}
Response:
{"took":379,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":648556,"max_score":1.0,"hits":...}

For this query, we will mainly analyze the FST query performance. Based on previous results, we can see that FST queries perform much worse than scanning postings lists. When scanning postings lists, it takes less than 300 ms to scan 5 million data entries. However, it takes 3 seconds to scan the same amount of data when using FST scans, almost 10 times slower. For UUID strings, FST range scanning can search about 1 million entries per second.

String Range Query Plus Term Query

Consider a string range query (5 million matching entries) and two term queries (5,000 matching entries). A total of 2,600 entries meet the conditions. Let's test the performance.
Request:
{
  "query": {
    "constant_score": {
      "filter": {
        "bool": {
          "must": [
            {
              "range": {
                "RecordID": {
                  "gte": "0",
                  "lte": "8"
                }
              }
            },
            {
              "term": {
                "Tag1": "a"
              }
            },
            {
              "term": {
                "Tag2": "0"
              }
            }
          ]
        }
      }
    }
  },
  "size": 1
}
Results:
{"took":2849,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":2638,"max_score":1.0,"hits":...}

In this example, most of the query time is spent scanning FSTs. First, the terms that match the conditions are obtained using FSTs, and then the docID list for each term is read to construct a BitSet, and finally it finds the intersection of the BitSet and the postings lists for the two term queries.

Numeric Range Query

For the numeric type, we also search 10 million data entries for 5 million targets and see how it performs.
Request:
{
  "query": {
    "constant_score": {
      "filter": {
        "range": {
          "Number": {
            "gte": 100000000,
            "lte": 150000000
          }
        }
      }
    }
  },
  "size": 1
}
Response:
{"took":567,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":5183183,"max_score":1.0,"hits":...}

In this scenario, we mainly test the BKD-Tree performance. We can see that the BKD-Tree query performance is pretty good. It takes around 500 ms to find 5 million docs, about twice the time needed to scan inverted indexes. Compared with FST, BKD-Tree has much higher performance. Geo location queries are also implemented by BKD-Tree, and have high performance.

Numeric Range Query Plus Term Query

Now, we'll cover a complex query scenario: the numeric range includes 5 million data entries, and another two term conditions are also added to the query, with over 2,600 final entries that match the conditions. Let's evaluate the performance for this scenario.
Request:
{
  "query": {
    "constant_score": {
      "filter": {
        "bool": {
          "must": [
            {
              "range": {
                "Number": {
                  "gte": 100000000,
                  "lte": 150000000
                }
              }
            },
            {
              "term": {
                "Tag1": "a"
              }
            },
            {
              "term": {
                "Tag2": "0"
              }
            }
          ]
        }
      }
    }
  },
  "size": 1
}
Response:
{"took":27,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":2638,"max_score":1.0,"hits":...}

The results are actually unexpected. This query operation only takes 27 ms! In the previous example, the numeric range query took more than 500 ms. Although we have added another two conditions, the query time is as short as 27 ms. Why is this?

Actually, Lucene has an optimization: in the underlying layer is a query called IndexOrDocValuesQuery, which automatically determines whether Index (BKD-Tree) or DocValues should be queried. In this example, Lucene first finds the intersection of the two term queries, and as a result, gets a little over 5,000 docIDs. Then it reads the docValues of these 5,000+ docIDs and searches for data entries that match the numeric range values. Since it only needs to read the docValues of around 5,000 docs, it doesn’t take so long.

Conclusion

Generally speaking, the more docs that need to be scanned, the poorer the performance is.
A single inverted index scan can find 10 million entries per second, indicating very high performance. If a Term query needs to be performed on a numeric type, it is recommended to convert the numeric type to the string type.
When a skipList is used to merge inverted indexes, the performance depends on the number of scans on the shortest index and the overhead of each skip (for example, the sequential scan in a block).
FST-related string queries are much slower than inverted index queries (Wildcard queries have a more significant impact on performance, and are not described in this article).
BKD-Tree-based numeric queries have good performance. However, because docIDs in BKD-Tree are unordered, skipping backwards (a method for skipLists) isn’t possible with these queries. Therefore it needs to construct a BitSet to get the intersection of the queries, and that’s a time-consuming step. In Lucene, some optimizations were made using IndexOrDocValuesQuery.

I will end this article with a question: since more data to scan means poorer performance, is it possible to stop a query after enough data has been obtained?

Reference:https://www.alibabacloud.com/blog/analyzing-elasticsearch-performance-with-lucene_594636?spm=a2c41.12734722.0.0