Analysis of Lucene — Basic Concepts


  1. Scalable, High-Performance Indexing
  2. Powerful, Accurate and Efficient Search Algorithms

Basic Concepts




Term and Term Dictionary


Sequence Number

  1. A DocId is not actually unique to the Index but is unique to a Segment. Lucene does this mainly to optimize writing and compression. Since it is only unique to a Segment, how can a Doc be uniquely identified at the Index level? The solution is simple. The segments are ordered. To take a simple example, an Index has two segments and each segment has 100 docs respectively. The DocId’s in the Segment are 0–100 but when they are converted to the Index level, the range of the DocId’s in the second Segment is converted to 100–200.
  2. DocId’s are unique within a Segment, numbered progressively from zero. But this does not mean that the DocId’s are continuous. When a Doc is deleted, there is a gap.
  3. The DocId corresponding to a document can change, usually when Segments are merged.

Index type

  • stored: indicates whether to save the field. If false, Lucene does not store the value of the field and the documents returned in the query results will only contain saved fields.
  • tokenized: represents whether or not to tokenize. In Lucene, only the TextField field needs to be tokenized.
  • termVector: This article explains the concept of term vector well. Simply put, a term vector saves all the information related to a term, include the Term value, frequencies, positions. It is a per-document inverted index and provides functionality to find all the term information in the document according to docid. It is not recommended to enable term vector for short fields because all the term information can be obtained by tokenizing once again. However, it is recommended to enable term vector for longer fields or fields with a high cost of tokenization. There are two main uses of term vector. The first is keyword highlighting and the other is similarity matching between documents (more-like-this).
  • omitNorms: Norms stands for normalization. Lucene enables every field in every document to have a normalization factor saved, which is a coefficient related to scoring during searches. It only takes one byte to save Norms, but a separate Norms is saved in every field of every document and every piece of Norms data is loaded into memory. So enabling Norms consumes additional storage space and memory. But if you disable Norms, then you cannot use index-time boosting (elasticsearch officially recommends using query-time boosting instead) and length normalization.
  • indexOptions: Lucene provides five optional parameters for inverted indexes (NONE, DOCS, DOCS_AND_FREQS, DOCS_AND_FREQS_AND_POSITIONS, DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS), which are used to select whether the field needs to be indexed, and what content to index.
  • docValuesType: DocValue is a forward index introduced in Lucene 4.0 (docid to field column store), greatly enhancing the efficiency of sorting, faceting and aggregation. DocValues is a storage structure with a strong schema, so all fields with DocValues enabled must have exactly the same type. Currently Lucene only provides the five types of NUMERIC, BINARY, SORTED, SORTED_NUMERIC and SORTED_SET.
  • dimension: Lucene supports indexing of multi-dimensional data, employing special indexing to optimize queries of multi-dimensional data. The most typical usage scenario of this type of data is an index of geographical locations. This is the indexing method generally used for latitude and longitude data.

Elasticsearch Data Types





Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

How to Work with GET/POST/PUT/PATCH/DELETE in Postman(in 10 min)

Connect to your MySQL Database using C# & Windows Forms

What is Machine Learning in Java and how to implement it?

The Love story of Integration Studio and Micro Integrator

Continuous integration for small teams

A light introduction to vim and how to config it for python dev

Protect Rake Task using LDAP authentication

Two reasons why I’m not worried about GPT-3 as a developer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:

More from Medium

Apache Kafka Exam Notes

Build custom model using Azure percept Device

Database Evolution Part 1: Evolving indexes

How to using Mapping and Aliases on Elasticsearch