Lucene IndexWriter: An In-Depth Introduction

Preface

IndexWriter

// initialization
Directory index = new NIOFSDirectory(Paths.get("/index"));
IndexWriterConfig config = new IndexWriterConfig();
IndexWriter writer = new IndexWriter(index, config);
// create a document
Document doc = new Document();
doc.add(new TextField("title", "Lucene - IndexWriter", Field.Store.YES));
doc.add(new StringField("author", "aliyun", Field.Store.YES));
//index a document
writer.addDocument(doc);
writer.commit();
  1. Initialization: the two elements needed to initialize IndexWriter are Directory and IndexWriterConfig. Directory is the abstract interface of the data persistence layer in Lucene. Many different types of data persistence layer can be implemented through this interface layer, for example, in local file systems, network file systems, databases, or distributed file systems. IndexWriterConfig contains many advanced, configurable parameters for advanced users to optimize performance and customize functionality. Several key parameters are described in detail later on.
  2. Document construction: In Lucene documents are represented by Document and a document is made up of Fields. Lucene provides different types of Field and the FieldType determines the index mode supported, which obviously includes custom field. See the previous article for details.
  3. Writing a document: A document is written using the addDocument function, and at the same time, different indexes are created depending on FieldType. The newly written document is not searchable until IndexWriter’s commit is called. Once commit has completed, Lucene ensures that the document is persistent and searchable.

IndexWriterConfig

  • IndexDeletionPolicy: Lucene enables the management of commit points to implement such functionality as snapshots. By default DeletionPolicy only retains the last commit point in Lucene.
  • Similarity: Relevance is at the core of searching. Similarity is the abstract interface for scoring algorithms. By default, Lucene uses the TF-IDF and BM25 algorithms. Relevance is scored when data is written and searched. Scoring during data writing is called index-time boosting. Normalization is calculated and written to the index. Scoring during a search is called query-time boosting.
  • MergePolicy: Lucene’s internal data writing generates many Segments. When a query is made, many segments are queried and the results merged. As the number of Segments affects query efficiency to a certain extent, segments are merged. The merging procedure is called Merge, and MergePolicy determines when Merge is triggered.
  • MergeScheduler: After MergePolicy triggers Merge, MergeScheduler takes care of executing Merge. The merge process usually puts a heavy load on the CPU and I/O. MergeScheduler enables the customization and management of the process of merging.
  • Codec: Codecs are a core component of Lucene defining Lucene’s internal encoders and decoders for all types of indexes. Lucene configures codecs at the Config level. The main purpose is to enable processing of different data versions. Most Lucene users have little need to customize this layer and it is mostly advanced users who configure codecs.
  • IndexerThreadPool: Manages IndexWriter’s internal index thread pool (DocumentsWriterPerThread). This is also part of Lucene’s internal resource management customization.
  • FlushPolicy: FlushPolicy determines when in-memory buffers are flushed. By default the timing will depend on the size of RAM and number of documents. FlushPolicy is invoked to make a decision every document add/update/delete.
  • MaxBufferedDoc: The maximum amount of memory allowed to be used by DocumentsWriterPerThread in the FlushByRamOrCountsPolicy implemented in the default FlushPolicy provided by Lucene. Exceeding this value triggers a flush.
  • RAMBufferSizeMB: The maximum number of documents DocumentsWriterPerThread is permitted to use in the FlushByRamOrCountsPolicy implemented in the default FlushPolicy provided by Lucene. Exceeding this value triggers a flush.
  • RAMPerThreadHardLimitMB: As well as FlushPolicy deciding on flushes, Lucene also has a metric to forcibly limit the amount of memory used by DocumentsWriterPerThread. Exceeding the threshold forces a flush.
  • Analyzer: A tokenizer. This is most frequently customized, especially for different languages.

Core operations

  • addDocument: a straightforward API to add a document to Lucene. Lucene does not have a primary key index internally. Every newly added document is treated as a new document and assigned an independent docId.
  • updateDocument: This updates a document but is different from updating a database. A database updates after a query whereas Lucene updates by deleting a document and adding it again after a query. The process is delete by term then add document. But this process has a different effect to just calling delete then add. Only an update will ensure the atomicity of the delete and add within the thread. Details of the process will be discussed in the next chapter.
  • deleteDocument: deletes a document. It supports two types of delete: by term and by query. These two types of delete have different processes in terms of IndexWriter internals. Details will be discussed in the next chapter.
  • flush: triggers a hard flush so that every In-memory buffer in a thread is flushed to a segment file. This action frees up memory and forces data to become persistent.
  • prepareCommit/commit/rollback: Data is only searchable after a commit. Commit is a two-stage operation. The first stage of the process is prepareCommit, and commit can be called to complete a step. Rollback rolls back to the last commit.
  • maybeMerge/forceMerge: maybeMerge triggers a MergePolicy decision while forceMerge forces a merge.

Data Path

Concurrency Model

  1. Multiple threads concurrently invoke an IndexWriter interface and IndexWriter’s specific internal request is executed by DocumentsWriter. Before DocumentsWriter internally processes the request, it allocates DocumentsWriterPerThread based on the thread currently executing the operation.
  2. Every thread processes data in its own independent DocumentsWriterPerThread space, including tokenizing, relevance scoring and indexing.
  3. After data processing is finished, some post-processing is carried out at the DocumentsWriter level, such as triggering a FlushPolicy decision.

Add & Update

long updateDocument(final Iterable<? extends IndexableField> doc, final Analyzer analyzer,
final Term delTerm) throws IOException, AbortingException
  1. Allocate DWPT depending on Thread
  2. Execute delete in DWPT
  3. Execute add in DWPT

Delete

  • The update interface only deletes documents by term while delete supports deletion by term and by query.
  • When update deletes, it first acts on DWPT then on Global. After that it synchronizes the other DWPTs with Global.
  • The delete interface first acts at the Global level, then asynchronously synchronizes the changes down to the DWPT level.

Flush

Commit

Merge

IndexingChain

  • BlockTreeTermsWriter: Codec for inverted indexes, among these the inverted list uses Lucene50PostingsWriter(Block-written inverted index chain) and Lucene50SkipWriter(SkipList Index for Block), while the dictionary uses FST (Block-level dictionary search for inverted index).
  • CompressingTermVectorsWriter: writer for term vector indexes. The lowest level is in compressed block format.
  • CompressingStoredFieldsWriter: writer for Store fields index. The lowest level is in compressed block format.
  • Lucene70DocValuesConsumer: writer for doc values index.
  • Lucene60PointsWriter: writer for point values index.

Summary

--

--

--

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

A new and improved Widget Options for WordPress

7 Factors to Consider When Selecting a Cloud Data Warehouse

How to Build Your Own Serverless ECS Instance Monitor

How to Install and Configure Seafile on Ubuntu 16.04

Kubernetes Bare Metal Storage

The Factory Method Pattern.

Git Patch file: What is it? How to make one?

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

More from Medium

How to Analysis or Audit your Elasticsearch Requests #

Stream avro data from kafka over ssl to Apache pinot

Confluent Kafka Multi-Region Cluster in 2 minutes

How to build an HTTP client from scratch

Redpanda mascot with settings symbol