Analysis of Lucene — Basic Concepts

Preface

  1. Scalable, High-Performance Indexing
  2. Powerful, Accurate and Efficient Search Algorithms

Basic Concepts

Index

Document

Field

Term and Term Dictionary

Segment

Sequence Number

  1. A DocId is not actually unique to the Index but is unique to a Segment. Lucene does this mainly to optimize writing and compression. Since it is only unique to a Segment, how can a Doc be uniquely identified at the Index level? The solution is simple. The segments are ordered. To take a simple example, an Index has two segments and each segment has 100 docs respectively. The DocId’s in the Segment are 0–100 but when they are converted to the Index level, the range of the DocId’s in the second Segment is converted to 100–200.
  2. DocId’s are unique within a Segment, numbered progressively from zero. But this does not mean that the DocId’s are continuous. When a Doc is deleted, there is a gap.
  3. The DocId corresponding to a document can change, usually when Segments are merged.

Index type

  • stored: indicates whether to save the field. If false, Lucene does not store the value of the field and the documents returned in the query results will only contain saved fields.
  • tokenized: represents whether or not to tokenize. In Lucene, only the TextField field needs to be tokenized.
  • termVector: This article explains the concept of term vector well. Simply put, a term vector saves all the information related to a term, include the Term value, frequencies, positions. It is a per-document inverted index and provides functionality to find all the term information in the document according to docid. It is not recommended to enable term vector for short fields because all the term information can be obtained by tokenizing once again. However, it is recommended to enable term vector for longer fields or fields with a high cost of tokenization. There are two main uses of term vector. The first is keyword highlighting and the other is similarity matching between documents (more-like-this).
  • omitNorms: Norms stands for normalization. Lucene enables every field in every document to have a normalization factor saved, which is a coefficient related to scoring during searches. It only takes one byte to save Norms, but a separate Norms is saved in every field of every document and every piece of Norms data is loaded into memory. So enabling Norms consumes additional storage space and memory. But if you disable Norms, then you cannot use index-time boosting (elasticsearch officially recommends using query-time boosting instead) and length normalization.
  • indexOptions: Lucene provides five optional parameters for inverted indexes (NONE, DOCS, DOCS_AND_FREQS, DOCS_AND_FREQS_AND_POSITIONS, DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS), which are used to select whether the field needs to be indexed, and what content to index.
  • docValuesType: DocValue is a forward index introduced in Lucene 4.0 (docid to field column store), greatly enhancing the efficiency of sorting, faceting and aggregation. DocValues is a storage structure with a strong schema, so all fields with DocValues enabled must have exactly the same type. Currently Lucene only provides the five types of NUMERIC, BINARY, SORTED, SORTED_NUMERIC and SORTED_SET.
  • dimension: Lucene supports indexing of multi-dimensional data, employing special indexing to optimize queries of multi-dimensional data. The most typical usage scenario of this type of data is an index of geographical locations. This is the indexing method generally used for latitude and longitude data.

Elasticsearch Data Types

Summary

--

--

--

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Automate “Last modified” dates in your documentation

CS371p Spring 2022: Sarunas Budreckis — Entry #10

Integrating AI with SaaS-based Cloud Data Warehouses

Flask: Web Framework

CAP Theorem — Explained with examples

How to Configure Trafeik for Routing Applications in Kubernetes

PostgreSQL Asynchronous Message practice — real-time feed system monitoring and response like in…

Model-View-ViewModel Pattern explained

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

More from Medium

Geospatial | GeoHash — [Notes]

Disaster Recovery — A practical guide (Part 1)

What do monitoring metrics tell us?

Historize elastic APM server data