Lucene IndexWriter: An In-Depth Introduction

By Zhaofeng Zhou (Muluo)

Preface

In the previous article, we presented a basic overview of Lucene. In this article, we will delve deeper into IndexWriter, one of Lucene’s core classes, to explore the whole data writing and indexing process in Lucene.

IndexWriter

// initialization
Directory index = new NIOFSDirectory(Paths.get("/index"));
IndexWriterConfig config = new IndexWriterConfig();
IndexWriter writer = new IndexWriter(index, config);
// create a document
Document doc = new Document();
doc.add(new TextField("title", "Lucene - IndexWriter", Field.Store.YES));
doc.add(new StringField("author", "aliyun", Field.Store.YES));
//index a document
writer.addDocument(doc);
writer.commit();
  1. Document construction: In Lucene documents are represented by Document and a document is made up of Fields. Lucene provides different types of Field and the FieldType determines the index mode supported, which obviously includes custom field. See the previous article for details.
  2. Writing a document: A document is written using the addDocument function, and at the same time, different indexes are created depending on FieldType. The newly written document is not searchable until IndexWriter’s commit is called. Once commit has completed, Lucene ensures that the document is persistent and searchable.

IndexWriterConfig

IndexWriterConfig provides some core parameters for advanced users to optimize performance and customize functionality. We will look at a few examples:

  • Similarity: Relevance is at the core of searching. Similarity is the abstract interface for scoring algorithms. By default, Lucene uses the TF-IDF and BM25 algorithms. Relevance is scored when data is written and searched. Scoring during data writing is called index-time boosting. Normalization is calculated and written to the index. Scoring during a search is called query-time boosting.
  • MergePolicy: Lucene’s internal data writing generates many Segments. When a query is made, many segments are queried and the results merged. As the number of Segments affects query efficiency to a certain extent, segments are merged. The merging procedure is called Merge, and MergePolicy determines when Merge is triggered.
  • MergeScheduler: After MergePolicy triggers Merge, MergeScheduler takes care of executing Merge. The merge process usually puts a heavy load on the CPU and I/O. MergeScheduler enables the customization and management of the process of merging.
  • Codec: Codecs are a core component of Lucene defining Lucene’s internal encoders and decoders for all types of indexes. Lucene configures codecs at the Config level. The main purpose is to enable processing of different data versions. Most Lucene users have little need to customize this layer and it is mostly advanced users who configure codecs.
  • IndexerThreadPool: Manages IndexWriter’s internal index thread pool (DocumentsWriterPerThread). This is also part of Lucene’s internal resource management customization.
  • FlushPolicy: FlushPolicy determines when in-memory buffers are flushed. By default the timing will depend on the size of RAM and number of documents. FlushPolicy is invoked to make a decision every document add/update/delete.
  • MaxBufferedDoc: The maximum amount of memory allowed to be used by DocumentsWriterPerThread in the FlushByRamOrCountsPolicy implemented in the default FlushPolicy provided by Lucene. Exceeding this value triggers a flush.
  • RAMBufferSizeMB: The maximum number of documents DocumentsWriterPerThread is permitted to use in the FlushByRamOrCountsPolicy implemented in the default FlushPolicy provided by Lucene. Exceeding this value triggers a flush.
  • RAMPerThreadHardLimitMB: As well as FlushPolicy deciding on flushes, Lucene also has a metric to forcibly limit the amount of memory used by DocumentsWriterPerThread. Exceeding the threshold forces a flush.
  • Analyzer: A tokenizer. This is most frequently customized, especially for different languages.

Core operations

IndexWriter provides several simple operation interfaces. This chapter gives a simple explanation of their features and uses while the next chapter gives a detailed breakdown of their internals. The core APIs provided by IndexWriter are as follows:

  • updateDocument: This updates a document but is different from updating a database. A database updates after a query whereas Lucene updates by deleting a document and adding it again after a query. The process is delete by term then add document. But this process has a different effect to just calling delete then add. Only an update will ensure the atomicity of the delete and add within the thread. Details of the process will be discussed in the next chapter.
  • deleteDocument: deletes a document. It supports two types of delete: by term and by query. These two types of delete have different processes in terms of IndexWriter internals. Details will be discussed in the next chapter.
  • flush: triggers a hard flush so that every In-memory buffer in a thread is flushed to a segment file. This action frees up memory and forces data to become persistent.
  • prepareCommit/commit/rollback: Data is only searchable after a commit. Commit is a two-stage operation. The first stage of the process is prepareCommit, and commit can be called to complete a step. Rollback rolls back to the last commit.
  • maybeMerge/forceMerge: maybeMerge triggers a MergePolicy decision while forceMerge forces a merge.

Data Path

The last few chapters presented the basic processes, configuration and core interfaces of IndexWriter. These are very simple and easy to understand. In this chapter we take a closer look at IndexWriter internals to explore the kernel implementation.

Image for post
Image for post

Concurrency Model

The core interfaces provided by IndexWriter are thread-safe and involve some internal concurrency tuning to optimize multi-thread writing. IndexWriter opens up an independent space for every thread to write. These spaces are controlled by DocumentsWriterPerThread. The whole multi-thread, data-processing process is:

  1. Every thread processes data in its own independent DocumentsWriterPerThread space, including tokenizing, relevance scoring and indexing.
  2. After data processing is finished, some post-processing is carried out at the DocumentsWriter level, such as triggering a FlushPolicy decision.

Add & Update

The add interface is used to add a new document and the update interface is used to update a document. But Lucene updates in a different way to databases. Databases update after a query while Lucene deletes and re-adds a document after a query. Lucene does not support updating the internal sorting of a document. The process is delete by term then add document.

long updateDocument(final Iterable<? extends IndexableField> doc, final Analyzer analyzer,
final Term delTerm) throws IOException, AbortingException
  1. Execute delete in DWPT
  2. Execute add in DWPT

Delete

Compared to add and update, delete is a completely different data path. Even though update and delete internally delete data, they are different data paths. Deleting a document has no direct effect on data in the in-memory buffer and you must use another way to delete it.

Image for post
Image for post
  • When update deletes, it first acts on DWPT then on Global. After that it synchronizes the other DWPTs with Global.
  • The delete interface first acts at the Global level, then asynchronously synchronizes the changes down to the DWPT level.

Flush

A flush is the process of making in-memory buffers in DWPT into data-persistent files. A flush is automatically triggered after a new document is added based on FlushPolicy, and a flush can also be manually triggered using IndexWriter’s flush interface.

Commit

A commit triggers a forced data flush. Only after a commit does the data flushed before this become searchable. A commit triggers the generation of a file called a commit point. A commit point is managed by the IndexDeletionPolicy. Lucene’s default policy only retains the last commit point. Naturally Lucene provides other policies to choose from.

Merge

A merge is the merging of segment files. The advantage of merging is that it improves query efficiency and enables some deleted documents to be reclaimed. When merge flushes a segment file, it causes MergePolicy to decide on automatic triggering, and it can also do a force merge using IndexWriter.

IndexingChain

In the previous few chapters we looked at the processes of several key operations. This section looks at how DWPT, the most core part of Lucene, internally implements the process of indexing. The key concept in Lucene’s internal indexing is IndexingChain, which as its name suggests, is chain-like indexing. Why is it a chain? This is related to the whole structure of Lucene’s indexing system. Lucene provides different types of indexes, such as inverted indexes, forward indexes (column storage), StoreField, and DocValues. Each different type of index corresponds to a different type of indexing algorithm, data structure, and file storage. Some of them are at column-level, some are file-level and some are document level. So after a document is written, it is processed by many different types of index, some of which share memory buffers and some are completely independent. Lucene is theoretically able to extend other types of index based on this framework, something the best users can try.

Image for post
Image for post
  • CompressingTermVectorsWriter: writer for term vector indexes. The lowest level is in compressed block format.
  • CompressingStoredFieldsWriter: writer for Store fields index. The lowest level is in compressed block format.
  • Lucene70DocValuesConsumer: writer for doc values index.
  • Lucene60PointsWriter: writer for point values index.

Summary

This article mainly looks at IndexWriter from a global perspective, explaining its configuration, interfaces, and concurrency models, as well as data paths of core operations and index chains. The next article takes a deeper look at the indexing process of different types of indexes, exploring the implementation of memory buffers, indexing algorithms and data storage formats.

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store