14 Supplementary Knowledge Points of Es Principles and Overall Structure

14 Supplementary Knowledge Points of ES Principles and Overall Structure #

在前面的章节中,我们已经介绍了Elasticsearch(简称ES)的基本概念和使用方法。本章将继续探讨ES的原理知识点,以更深入地理解其工作原理和整体结构。

14.1 分布式搜索引擎 #

ES是一种分布式搜索引擎,其核心思想是将数据分片存储在多个节点上,并使用倒排索引实现快速搜索。倒排索引是ES的核心数据结构,它将每个词与出现该词的文档建立关联。

ES采用分布式架构,节点之间通过网络进行通信。每个节点都可以拥有多个分片,而每个分片存储了全部数据的一个小部分。这样的分布式存储方式使得ES可以处理大规模的数据集,并具备容错性和高可用性。

14.2 高性能检索 #

ES的高性能检索依赖于倒排索引和分布式搜索的并行计算。倒排索引可以快速找到包含某个词的文档,并按照相关性排序。ES通过将搜索请求分发给多个分片并行处理,然后将结果合并返回,从而实现高效的搜索。

14.3 实时搜索与文档更新 #

ES支持实时搜索和文档更新。当有新的文档添加到集群中时,ES会立即将其写入到相应的分片,并更新倒排索引。这样可以保证搜索结果实时性,并且支持近实时的文档更新。

14.4 高可用性和容错性 #

ES通过数据的复制和故障转移实现高可用性和容错性。每个分片可以有多个副本,副本可以分布在不同的节点上。当某个节点故障时,ES会自动将其上的分片副本切换到其他节点上,从而实现故障恢复。

14.5 分布式数据存储和查询 #

ES的分布式存储和查询功能使得它可以处理大规模的数据集。数据存储在多个节点上的分片中,并且可以水平扩展以适应更多的数据量。查询请求会被分发到多个分片并行处理,然后将结果合并返回。

14.6 RESTful API #

ES提供了RESTful API接口,可以通过HTTP协议进行数据操作和查询。利用RESTful API,我们可以通过简单的URL和请求体来进行索引、查询、更新和删除等操作。

14.7 总结 #

通过本章的学习,我们对ES的原理知识点有了更深入的了解。ES的分布式搜索引擎架构、倒排索引、高性能检索、实时搜索与文档更新、高可用性和容错性、分布式数据存储和查询以及RESTful API等特性,使其成为一款强大而灵活的搜索引擎。在后续的章节中,我们将学习如何使用ES进行高级搜索和聚合分析。

Overall Structure of ElasticSearch #

In the previous section, we have discussed the principles of ElasticSearch through illustrations. Now, let’s summarize the overall structure of ES.

img

  • In a cluster mode, an ES Index consists of multiple Nodes. Each node represents an instance of ES.
  • Each node contains multiple shards. P1 and P2 are primary shards, while R1 and R2 are replica shards.
  • Each shard corresponds to a Lucene Index, which is the underlying index file.
  • Lucene Index is a general term that refers to:
    • Multiple Segments (segment files, which are inverted indexes) that store Doc documents.
    • Commit point records information of all segments.

Supplementary: Lucene Index Structure #

What files are present in the Lucene index structure shown in the above diagram?

img

(For more file types, refer to this link)

img

The relationship between files is as follows:

img

Additional: Lucene Processing Flow #

In the previous graphical representation, it is necessary to understand the Lucene processing flow, which will help you better index and search documents.

img

Indexing process:

  • Prepare the original documents to be indexed, which can come from files, databases, or the internet.
  • Use the tokenization component to process the content of the documents, forming a series of terms.
  • The indexing component processes the documents and terms, forming a dictionary and an inverted index.

Search process:

  • Process the query statement by tokenization, forming a series of terms.
  • Use the inverted index table to find the documents that contain the terms and merge them to form the set of matching documents.
  • Compare the relevance scores of the query statement with each document and return them in descending order of scores.

Supplement: ElasticSearch Analyzer #

One important aspect in the diagram above is syntax analysis/language processing, so we need to supplement the knowledge of ElasticSearch analyzers.

Analysis consists of the following processes:

  • First, divide a piece of text into independent terms suitable for inverted indexing,
  • Then, standardize these terms to improve their “searchability” or recall

Analyzers perform the above tasks. An analyzer actually encapsulates three functions into one package:

  • Character filters: The string passes through each character filter in sequence. Their task is to prepare the string before tokenization. A character filter can be used to remove HTML, or convert “&” to “and”.
  • Tokenizer: Next, the string is divided into individual terms by the tokenizer. A simple tokenizer may split the text into terms when encountering spaces and punctuation.
  • Token filters: Finally, the terms pass through each token filter in sequence. This process may change the terms (e.g., lowercase “Quick”), delete terms (e.g., useless words like “a”, “and”, “the”), or add terms (e.g., synonyms like “jump” and “leap”).

Elasticsearch provides out-of-the-box character filters, tokenizers, and token filters. These can be combined to create custom analyzers for different purposes.

Built-in Analyzers #

Elasticsearch also comes with pre-packaged analyzers that can be used directly. Next, we will list the most important analyzers. To demonstrate their differences, let’s see which terms each analyzer will generate from the following string:

"Set the shape to semi-transparent by calling set_trans(5)"
  • Standard Analyzer

The standard analyzer is the default analyzer used by Elasticsearch. It is the most common choice for analyzing text in various languages. It divides the text based on word boundaries defined by the Unicode Consortium. It removes most punctuation and converts the terms to lowercase. It generates the following terms:

set, the, shape, to, semi, transparent, by, calling, set_trans, 5
  • Simple Analyzer

The simple analyzer separates the text at any non-alphabet characters and converts the terms to lowercase. It generates the following terms:

set, the, shape, to, semi, transparent, by, calling, set, trans
  • Whitespace Analyzer

The whitespace analyzer divides the text at whitespace. It generates the following terms:

Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
  • Language Analyzers

Language-specific analyzers are available for many languages. They can take into account the specific characteristics of the specified language. For example, the English analyzer comes with a list of English stopwords (common words like “and” or “the” that have little impact on relevance) that are removed. Due to an understanding of the rules of English grammar, this tokenizer can extract the stem of English words.

The English analyzer will generate the following terms:

set, shape, semi, transpar, call, set_tran, 5

Note that “transparent”, “calling”, and “set_trans” have been converted to their stem format.

When to Use Analyzers #

When we index a document, its full-text fields are analyzed into terms to create inverted indexes. However, when we search within these full-text fields, we need to apply the same analysis process to the query string to ensure that the format of the search terms matches the format of the terms in the index.

Full-text queries understand how each field is defined so that they can do the right thing:

  • When you query a full-text field, the query string is passed through the same analyzer to generate the correct list of search terms.
  • When you query an exact value field, the query string is not analyzed; the search is for the exact value you specify.

Example

Suppose in Elasticsearch we have one data entry per day, and we query in the following ways:

GET /_search?q=2014              # 12 results
GET /_search?q=2014-09-15        # 12 results !
GET /_search?q=date:2014-09-15   # 1  result
GET /_search?q=date:2014         # 0  results !

Why do we get those results?

  • The “date” field contains an exact value: the single term “2014-09-15”.
  • The “_all” field is a full-text field, so the tokenization process converts the date into three terms: “2014”, “09”, and “15”.

When we query in the “_all” field for “2014”, it matches all 12 tweets because they all contain “2014”:

GET /_search?q=2014              # 12 results

When we query in the “_all” field for “2014-09-15”, it first analyzes the query string and generates a query that matches any term among “2014”, “09”, or “15”. This also matches all 12 tweets because they all contain “2014”:

GET /_search?q=2014-09-15        # 12 results !

When we query in the “date” field for “2014-09-15”, it looks for the exact date and finds only one tweet:

GET /_search?q=date:2014-09-15   # 1  result

When we query in the “date” field for “2014”, it doesn’t find any documents because no document contains this exact date:

GET /_search?q=date:2014         # 0  results !

Reference Articles #

Article 1

Article 2

Article 3