13 the Principles of Understanding the Architecture of Es From Graphical Constructions

13 The Principles of Understanding the Architecture of ES from Graphical Constructions #

在学习ES的原理时，图解是一种非常有效的方式。通过图解，可以更直观地理解ES的工作原理和核心概念。本文将通过图示来解释ES的原理，帮助理解ES的基本概念和组成部分。

13.1 ES的基本原理 #

ES是指Elasticsearch，是一个基于Lucene的开源搜索引擎。它使用分布式的、无中心的架构，具有高可靠性、高性能和易用性。

ES的基本原理如下：

集群(Cluster)：ES由多个节点组成，这些节点可以组成一个集群。集群中的节点通过互联网进行通信和协调，可以共同工作来处理搜索和分析请求。
索引(Index)：索引是ES中最重要的组成部分之一，它类似于数据库中的表。索引由一组文档组成，每个文档包含一条记录。每个文档都有一个唯一的标识符（ID）和包含各种字段的数据。
文档(Document)：文档是ES中的基本单位，相当于数据库中的记录。文档由一组字段组成，每个字段都有一个名称和对应的值。文档可以是任意类型的数据，例如JSON格式的数据。
分片(Shard)：ES将索引分成多个分片，每个分片存储索引的一部分数据。分片可以水平扩展和分布式存储，提高了ES的容量和性能。
副本(Replica)：ES可以为每个分片创建副本，副本存储了分片的完整拷贝。副本提供了数据冗余和容错能力，增加了系统的可靠性。

13.2 ES的工作流程 #

ES的工作流程如下图所示：

ES Workflow

索引数据：首先，将需要索引的数据发送到ES，并指定要索引的索引名称、索引类型和文档ID等信息。
分词和建立倒排索引：接下来，ES对文档进行分词处理，将文本分成一个个词条，并建立倒排索引。倒排索引是指通过文档中的内容来查找索引的结构，可以快速地定位到包含特定词条的文档。
搜索和查询：当用户发送搜索请求时，ES会对请求的关键词进行匹配，找出包含这些关键词的文档，并返回相应搜索结果。ES使用一种称为"分布式倒排索引"的技术来提高搜索效率。
评分和排序：ES会对搜索结果进行评分，根据相关性、匹配度等因素，对搜索结果进行排序，最终返回排名最高的文档。

以上是ES的基本原理和工作流程的图解解释。通过这些图示，能够更清晰地理解ES的工作原理和核心概念。如果想深入学习ES的原理和使用方法，可以参考ES的官方文档和其他相关资料。

Introduction #

This article provides an overview of the underlying principles of ElasticSearch, from top to bottom and then bottom to top. It aims to answer the following questions:

Why does my search *foo-bar* fail to match foo-bar?
Why does adding more files compress the index?
Why does ElasticSearch consume a large amount of memory?

Version

elasticsearch version: elasticsearch-2.2.0

Illustration of ElasticSearch #

Cluster in the cloud

Boxes in the cluster

Each white square box inside the cloud represents a node - Node.

Between nodes

Multiple green small blocks combined together form an ElasticSearch index, connected directly between one or more nodes.

Small blocks in the index

Under one index, the green small blocks distributed among multiple nodes are called shards - Shard.

Shard = Lucene Index

An ElasticSearch shard is essentially a Lucene Index.

Lucene is a Full Text search library (there are also many other forms of search libraries), and ElasticSearch is built on top of Lucene. Most of the content in the following story is actually about how ElasticSearch works based on Lucene.

Understanding Lucene #

Segment #

Mini Index - Segment

In Lucene, there are many small segments that can be seen as mini-indexes within Lucene.

Internal Structure of a Segment

(Consists of various data structures)

Inverted Index
Stored Fields
Document Values
Cache

Inverted Index #

The most important part, the Inverted Index.

The Inverted Index consists of two main parts:

An ordered data dictionary (Dictionary) that includes terms and their frequencies.
Postings which correspond to each term (i.e. files that contain the term).

When searching, the search query is first broken down, and then the corresponding terms are found in the dictionary to retrieve the relevant file content.

Searching for “the fury”

Auto Completion - Prefix

To search for words starting with the letter “c”, you can simply perform a binary search in the Inverted Index table to find terms such as “choice” or “coming”.

Expensive Lookup

If you want to find all words that contain the letter “our”, the system will scan the entire Inverted Index, which can be very expensive.

In such cases, if optimization is needed, the problem we face is how to generate suitable terms.

Problem Transformation

For such problems, there may be several feasible solutions:

* suffix -> xiffus *

If we want to use the suffix as a search criterion, we can process the terms inversely.

(60.6384, 6.5017) -> u4u8gyykk

For GEO location information, it can be converted to GEO Hash.

123 -> {1-hundreds, 12-tens, 123}

For simple numbers, multiple forms of terms can be generated.

Handling Spelling Errors

A Python library generates a tree-based state machine that contains information about misspelled words, solving the problem of spelling errors.

Stored Fields Lookup #

When searching for files that contain specific title content, the Inverted Index may not be able to address this problem effectively. Therefore, Lucene provides another data structure called Stored Fields to solve this problem. Essentially, Stored Fields is a simple key-value pair. By default, ElasticSearch stores the entire JSON source of the file.

Document Values for Sorting and Aggregations #

Even with the structures mentioned above, sorting, aggregations, and facets are still challenging because we may need to read a large amount of unnecessary information. So, another data structure solves this problem: Document Values. This structure is essentially a columnar storage that optimizes the storage structure of data of the same type.

To improve efficiency, ElasticSearch can read all the Document Values of a document under an index into memory for operations, greatly improving access speed, but also consuming a large amount of memory space.

In summary, these data structures - Inverted Index, Stored Fields, Document Values, and their caches - are all located within the segment.

When a search occurs #

During a search, Lucene searches all segments and returns the search results of each segment, and finally merges and presents them to the client.

Some features of Lucene make this process very important:

Segments are immutable.
- Delete? When deletion occurs, Lucene simply marks it as deleted, but the file remains in its original location and is not modified.
- Update? So for updates, essentially what it does is: delete first, then re-index.
Everywhere-compression.
- Lucene is very good at compressing data, and basically all compression methods in textbooks can be found in Lucene.
Caching everything.
- Lucene also caches all information, which greatly improves its query efficiency.

The story of caching #

When ElasticSearch indexes a file, it creates corresponding caches for the file and regularly refreshes this data (every second), making the files searchable.

As time goes by, we will have many segments.

So ElasticSearch merges these segments, and in this process, segments will eventually be deleted.

This is why adding files may reduce the space occupied by the index. It triggers merges, which may result in more compression.

Example

Two segments will be merged.

These two segments will eventually be deleted and merged into a new segment.

At this point, this new segment is in the cold state in the cache, but most segments remain unchanged in the warm state.

This scenario frequently occurs in the internal operations of the Lucene Index.

Searching in the shard #

The process of searching from a shard in ElasticSearch is similar to searching from a Lucene segment.

The difference compared to searching in a Lucene segment is that a shard may be distributed across different nodes, so when searching and returning results, all information is transmitted through the network.

Note that:

Searching for 2 shards in 1 search = 2 separate shard searches.

Processing log files

When we want to search for logs generated on a specific date, partitioning and indexing the log files based on timestamps can greatly improve search efficiency.

It is also very convenient to delete old data, just delete the old index.

In this case, each index has two shards.

How to Scale

The shard itself will not be further divided, but it can be moved to different nodes.

So, if the stress on the cluster nodes reaches a certain level, we may consider adding new nodes, which requires us to re-index all the data, something we don’t want to see. Therefore, we need to carefully consider how to balance the relationship between a sufficient number of nodes and an insufficient number of nodes in our planning.

Node allocation and shard optimization
- Allocate high-performance machines for indexing more important data.
- Ensure that each shard has replica information.

Routing

Each node maintains a routing table, so when a request arrives at any node, ElasticSearch is capable of forwarding the request to the desired shard of the expected node for further processing.

A Real Request #

Query

The query has a filtered type and a multi_match query.

Aggregation

Aggregate based on the author to get information on the top 10 authors with the top 10 hits.

Request Distribution

This request can be distributed to any node in the cluster.

Master Node

At this point, this node becomes the coordinator for the current request. It decides: a) which core node the request will be routed to based on index information, and b) which replica is available, and so on.