14 Supplementary Knowledge Points of Es Principles and Overall Structure

14 Supplementary Knowledge Points of ES Principles and Overall Structure #


14.1 分布式搜索引擎 #



14.2 高性能检索 #


14.3 实时搜索与文档更新 #


14.4 高可用性和容错性 #


14.5 分布式数据存储和查询 #


14.6 RESTful API #

ES提供了RESTful API接口,可以通过HTTP协议进行数据操作和查询。利用RESTful API,我们可以通过简单的URL和请求体来进行索引、查询、更新和删除等操作。

14.7 总结 #

通过本章的学习,我们对ES的原理知识点有了更深入的了解。ES的分布式搜索引擎架构、倒排索引、高性能检索、实时搜索与文档更新、高可用性和容错性、分布式数据存储和查询以及RESTful API等特性,使其成为一款强大而灵活的搜索引擎。在后续的章节中,我们将学习如何使用ES进行高级搜索和聚合分析。

Overall Structure of ElasticSearch #

In the previous section, we have discussed the principles of ElasticSearch through illustrations. Now, let’s summarize the overall structure of ES.


  • In a cluster mode, an ES Index consists of multiple Nodes. Each node represents an instance of ES.
  • Each node contains multiple shards. P1 and P2 are primary shards, while R1 and R2 are replica shards.
  • Each shard corresponds to a Lucene Index, which is the underlying index file.
  • Lucene Index is a general term that refers to:
    • Multiple Segments (segment files, which are inverted indexes) that store Doc documents.
    • Commit point records information of all segments.

Supplementary: Lucene Index Structure #

What files are present in the Lucene index structure shown in the above diagram?


(For more file types, refer to this link)


The relationship between files is as follows:


Additional: Lucene Processing Flow #

In the previous graphical representation, it is necessary to understand the Lucene processing flow, which will help you better index and search documents.


Indexing process:

  • Prepare the original documents to be indexed, which can come from files, databases, or the internet.
  • Use the tokenization component to process the content of the documents, forming a series of terms.
  • The indexing component processes the documents and terms, forming a dictionary and an inverted index.

Search process:

  • Process the query statement by tokenization, forming a series of terms.
  • Use the inverted index table to find the documents that contain the terms and merge them to form the set of matching documents.
  • Compare the relevance scores of the query statement with each document and return them in descending order of scores.

Supplement: ElasticSearch Analyzer #

One important aspect in the diagram above is syntax analysis/language processing, so we need to supplement the knowledge of ElasticSearch analyzers.

Analysis consists of the following processes:

  • First, divide a piece of text into independent terms suitable for inverted indexing,
  • Then, standardize these terms to improve their “searchability” or recall

Analyzers perform the above tasks. An analyzer actually encapsulates three functions into one package:

  • Character filters: The string passes through each character filter in sequence. Their task is to prepare the string before tokenization. A character filter can be used to remove HTML, or convert “&” to “and”.
  • Tokenizer: Next, the string is divided into individual terms by the tokenizer. A simple tokenizer may split the text into terms when encountering spaces and punctuation.
  • Token filters: Finally, the terms pass through each token filter in sequence. This process may change the terms (e.g., lowercase “Quick”), delete terms (e.g., useless words like “a”, “and”, “the”), or add terms (e.g., synonyms like “jump” and “leap”).

Elasticsearch provides out-of-the-box character filters, tokenizers, and token filters. These can be combined to create custom analyzers for different purposes.

Built-in Analyzers #

Elasticsearch also comes with pre-packaged analyzers that can be used directly. Next, we will list the most important analyzers. To demonstrate their differences, let’s see which terms each analyzer will generate from the following string:

"Set the shape to semi-transparent by calling set_trans(5)"
  • Standard Analyzer

The standard analyzer is the default analyzer used by Elasticsearch. It is the most common choice for analyzing text in various languages. It divides the text based on word boundaries defined by the Unicode Consortium. It removes most punctuation and converts the terms to lowercase. It generates the following terms:

set, the, shape, to, semi, transparent, by, calling, set_trans, 5
  • Simple Analyzer

The simple analyzer separates the text at any non-alphabet characters and converts the terms to lowercase. It generates the following terms:

set, the, shape, to, semi, transparent, by, calling, set, trans
  • Whitespace Analyzer

The whitespace analyzer divides the text at whitespace. It generates the following terms:

Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
  • Language Analyzers

Language-specific analyzers are available for many languages. They can take into account the specific characteristics of the specified language. For example, the English analyzer comes with a list of English stopwords (common words like “and” or “the” that have little impact on relevance) that are removed. Due to an understanding of the rules of English grammar, this tokenizer can extract the stem of English words.

The English analyzer will generate the following terms:

set, shape, semi, transpar, call, set_tran, 5

Note that “transparent”, “calling”, and “set_trans” have been converted to their stem format.

When to Use Analyzers #

When we index a document, its full-text fields are analyzed into terms to create inverted indexes. However, when we search within these full-text fields, we need to apply the same analysis process to the query string to ensure that the format of the search terms matches the format of the terms in the index.

Full-text queries understand how each field is defined so that they can do the right thing:

  • When you query a full-text field, the query string is passed through the same analyzer to generate the correct list of search terms.
  • When you query an exact value field, the query string is not analyzed; the search is for the exact value you specify.


Suppose in Elasticsearch we have one data entry per day, and we query in the following ways:

GET /_search?q=2014              # 12 results
GET /_search?q=2014-09-15        # 12 results !
GET /_search?q=date:2014-09-15   # 1  result
GET /_search?q=date:2014         # 0  results !

Why do we get those results?

  • The “date” field contains an exact value: the single term “2014-09-15”.
  • The “_all” field is a full-text field, so the tokenization process converts the date into three terms: “2014”, “09”, and “15”.

When we query in the “_all” field for “2014”, it matches all 12 tweets because they all contain “2014”:

GET /_search?q=2014              # 12 results

When we query in the “_all” field for “2014-09-15”, it first analyzes the query string and generates a query that matches any term among “2014”, “09”, or “15”. This also matches all 12 tweets because they all contain “2014”:

GET /_search?q=2014-09-15        # 12 results !

When we query in the “date” field for “2014-09-15”, it looks for the exact date and finds only one tweet:

GET /_search?q=date:2014-09-15   # 1  result

When we query in the “date” field for “2014”, it doesn’t find any documents because no document contains this exact date:

GET /_search?q=date:2014         # 0  results !

Reference Articles #

Article 1

Article 2

Article 3