17 Optimization Detailed Explanation of Elastic Search Performance Optimization

17 Optimization Detailed Explanation of ElasticSearch Performance Optimization #

ElasticSearch是一个强大的分布式搜索引擎,但要发挥其最佳性能,就需要进行一些优化措施。本文将详细介绍ElasticSearch的性能优化方法。

17.1 硬件和操作系统优化 #

硬件和操作系统的选择对ElasticSearch的性能至关重要。以下是一些优化建议:

  1. 选择高性能硬件:选择具备足够计算能力、内存和存储空间的硬件设备。
  2. 使用固态硬盘(SSD):SSD的读写速度更快,适合处理大量的数据读写操作。
  3. 优化文件系统:在Linux上,可以使用ext4或XFS文件系统,并根据数据和日志的类型进行相应的调整。
  4. 关闭不必要的服务:关闭不需要的服务,以释放系统资源。

17.2 索引设计优化 #

索引是ElasticSearch中的核心概念,良好的索引设计可以提高搜索性能。以下是一些建议:

  1. 选择合适的字段类型:选择最适合存储数据类型的字段类型,如文本型、数值型、日期型等。
  2. 使用更少的字段:只保留必要的字段以减少存储空间和搜索时间。
  3. 合理设置分片和副本:根据集群规模和负载情况,调整分片和副本的数量。
  4. 优化映射和分析器:适当配置字段映射和分析器,以提高搜索的准确性和效率。

17.3 查询性能优化 #

查询是ElasticSearch的主要操作之一。以下是一些查询性能优化的方法:

  1. 使用过滤器:过滤器是一种高效的查询方式,可以缩小搜索范围。
  2. 合理使用缓存:利用ElasticSearch的缓存机制,缓存热点数据,避免重复计算。
  3. 避免深度嵌套查询:深度嵌套查询会增加查询的复杂度和时间。
  4. 合理使用索引:根据查询的需求,选择合适的索引类型。

17.4 集群性能优化 #

当面对大规模数据和高并发请求时,集群性能优化显得尤为重要。以下是一些建议:

  1. 水平扩展:通过增加节点来扩展集群,以提高吞吐量和可用性。
  2. 均衡负载:通过调整分片和副本的分布来均衡集群的负载。
  3. 监控和调优:使用工具监控集群的状态和性能,并根据监控结果进行相应的调优。

17.5 数据备份和恢复 #

对数据进行定期备份是保障数据安全的重要措施。以下是一些建议:

  1. 使用快照和恢复:ElasticSearch提供了快照和恢复功能,可以方便地对数据进行备份和恢复。
  2. 设置合理的备份策略:根据业务需求和数据变化情况,设置合理的备份频率和保留时间。

通过以上优化措施,可以提高ElasticSearch的性能和可靠性,同时满足业务需求。

Hardware Configuration Optimization #

Upgrading hardware device configuration has always been the fastest and most effective way to improve service capabilities. There are generally three factors that can affect application performance at the system level: CPU, memory, and IO. ES performance optimization can be performed from these three aspects.

CPU Configuration #

In general, the reasons for CPU being busy are as follows:

  • Infinite empty loops, non-blocking operations, regular expressions matching, or pure calculations in threads;
  • Frequent garbage collection (GC);
  • Context switching in multi-threaded environments;

Most Elasticsearch deployments do not require high CPU demands. Therefore, compared to other resources, the specific configuration of the number of CPUs is not so critical. You should choose a modern processor with multiple cores. Common clusters use machines with 2 to 8 cores. If you have to choose between faster CPUs and more cores, it is better to choose more cores. Multiple cores provide additional concurrency, which far outweighs a slightly higher clock frequency.

Memory Configuration #

If there is a resource that is exhausted first, it is likely to be memory. Sorting and aggregations consume a lot of memory, so it is important to have enough heap space to handle them. Even when heap space is relatively small, it can provide additional memory for the operating system file cache. Because many data structures used by Lucene are based on disk formats, Elasticsearch can benefit greatly from the operating system cache.

Machines with 64 GB of memory are ideal, but machines with 32 GB and 16 GB are also common. Less than 8 GB will have the opposite effect (you will eventually need many small machines), and machines with more than 64 GB will also have problems.

Since ES is built on lucene, and the powerful feature of lucene is its ability to utilize operating system memory to cache index data for fast query performance. Lucene’s index files, known as segments, are stored in a single file and are immutable. For the operating system, it can keep the index files in cache friendly manner for fast access. Therefore, it is necessary to leave half of the physical memory for lucene, and leave the other half for ES (JVM heap).

Memory Allocation #

When the machine memory is less than 64G, follow the general principle of allocating 50% to ES and 50% to lucene.

When the machine memory is greater than 64G, follow the following principles:

  • If the main use case is full-text search, it is recommended to allocate 4~32G of memory to ES heap; the remaining memory is left for the operating system for lucene usage (segments cache) to provide faster query performance.
  • If the main use case is aggregations or sorting, and it mostly involves numerics, dates, geo_points, and not_analyzed character types, it is recommended to allocate 4~32G of memory to ES heap; the remaining memory is left for the operating system for lucene usage, to provide fast document-based clustering and sorting performance.
  • If the use case is aggregations or sorting, and it is based on analyzed character data, more heap size is required. It is recommended to run multiple ES instances on the machine, each instance should not exceed 50% of the ES heap setting (but not exceeding 32G, when the heap memory setting is below 32G, JVM uses object pointers compression techniques to save space), and leave more than 50% for lucene.

Disable swap #

Disabling swap can prevent fatal performance problems caused by memory swapping with disk. You can achieve this by setting bootstrap.memory_lock: true in elasticsearch.yml to lock the JVM memory and ensure ES performance.

GC settings #

Official documentation of the old version recommended the default setting as: Concurrent-Mark and Sweep (CMS), the reason being that there were many bugs in G1 at that time.

The reason is that there were known issues with early versions of the HotSpot JVM that comes with JDK 8. These issues could lead to index corruption when using the G1GC collector. The affected versions are older than the HotSpot version included with JDK 8u40. Source: Official explanation

In reality, if you are using a higher version of JDK8 or JDK9+, I recommend using G1 GC. Because our current project is using G1 GC and it performs well, especially for optimizing large heap objects. Modify the jvm.options file and change the following lines:

-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

to

-XX:+UseG1GC
-XX:MaxGCPauseMillis=50

Among them, -XX:MaxGCPauseMillis controls the maximum expected GC pause time, with a default value of 200ms. If the online business feature is very sensitive to GC pauses, you can set it lower. However, setting this value too small may result in high CPU consumption.

G1GC is still effective in reducing the impact of G1 pause on service latency under normal operation of the cluster. However, if the GC you described is causing the cluster to hang, it is very likely that changing to G1 will not fundamentally solve the problem. Usually, the cluster’s data model or query needs optimization.

Disk #

Disk is important for all clusters, especially those with a large amount of writes (such as those storing log data). Disk is the slowest subsystem on a server, which means that clusters with high write loads can easily saturate the disk, making it a bottleneck for the cluster.

If economically feasible, use solid-state drives (SSDs). SSDs provide significant improvement in IO performance compared to any rotating media (mechanical hard drives, magnetic tapes, etc.), whether for random writes or sequential writes.

  1. If you are using SSDs, make sure that your system’s I/O scheduler is configured correctly. When you write data to the disk, the I/O scheduler determines when to actually send the data to the disk. The scheduler is called cfq (completely fair queuing) in most default *nix distributions.
  2. The scheduler allocates time slices to each process and optimizes the passing of these numerous queues to the disk. However, it is optimized for rotating media: the inherent characteristics of mechanical hard drives mean that writing data to the physically laid out disk is more efficient.
  3. This is inefficient for SSDs, even though mechanical hard drives are not involved here. Therefore, the deadline or noop scheduler should be used. The deadline scheduler is optimized based on the write wait time, while noop is just a simple FIFO queue.

This simple change can have a significant impact. Just by using the correct scheduler, we have seen a 500-fold increase in write capacity.

If you are using rotating media (such as mechanical hard drives), try to obtain the fastest possible disk (high-performance server hard drives, 15k RPM drives).

Using RAID0 is an effective way to improve disk speed, both for mechanical hard drives and SSDs. There is no need to use mirroring or other RAID variants because Elasticsearch already provides backup functionality through replicas at its own level, so there is no need to rely on disk backups. At the same time, using disk backups has a significant impact on write speed.

Finally, avoid using network-attached storage (NAS). People often claim that their NAS solution is faster and more reliable than local drives. Despite these claims, we have never seen a NAS live up to its hype. NAS is often slow, exhibits greater latency and wider average latency variation, and is a single point of failure.

Index Optimization Settings #

Index optimization is mainly optimized at the Elasticsearch insertion level. Elasticsearch itself has a relatively fast indexing speed. For specific data, we can refer to the official benchmark data. We can optimize indexes according to different needs.

Bulk Submission #

When there is a large amount of data to be submitted, it is recommended to use bulk submission (Bulk operation). In addition, when using bulk requests, each request should not exceed tens of megabytes, as the large size will cause excessive memory usage.

For example, in the ELK process, when Logstash indexer submits data to Elasticsearch, the batch size can be used as an optimization feature. However, the optimal size needs to be determined based on the document size and server performance.

If the document size exceeds 20MB when submitted in Logstash, Logstash will split a batch request into multiple batch requests.

If an EsRejectedExecutionException exception is encountered during the submission process, it means that the indexing performance of the cluster has reached its limit. In this case, you can either increase the resources of the server cluster or reduce the data collection speed based on business rules, such as only collecting logs with severity levels of Warn and Error.

Increase Refresh Interval #

In order to improve indexing performance, Elasticsearch adopts a strategy of delayed writing when writing data. That is, the data is first written to memory, and when it exceeds the default 1 second (index.refresh_interval), a write operation is performed, which flushes the segment data in memory to disk. Only then can we search for the data. This is why Elasticsearch provides near real-time search functionality instead of real-time search functionality.

If our system does not require high data latency, we can effectively reduce pressure on segment merging and improve indexing speed by extending the refresh interval. For example, in the process of full distributed tracing, we set the index.refresh_interval to 30s to reduce the number of refreshes. Similarly, when performing full indexing, we can temporarily turn off refreshes by setting index.refresh_interval to -1, and then switch back to normal mode after successful data import, such as 30s.

When loading a large amount of data, you can temporarily disable refresh and replicas by setting index.refresh_interval to -1 and index.number_of_replicas to 0.

For more details, please refer to [Principle: Detailed Explanation of the Indexing Document Process in ES].

Modify index_buffer_size Setting #

The index buffer setting can control how much memory is allocated to the indexing process. This is a global configuration that applies to different shards on a single node.

indices.memory.index_buffer_size: 10%
indices.memory.min_index_buffer_size: 48mb

indices.memory.index_buffer_size accepts a percentage or a value representing the size in bytes. The default is 10%, which means that 10% of the total memory allocated to the node is used for the index buffer size. This value is distributed to different shards. If a percentage is set, you can also set min_index_buffer_size (default 48mb) and max_index_buffer_size (no default upper limit).

Firstly, control the frequency of data from memory to disk operations to reduce disk IO. You can increase the sync_interval time. The default is 5s.

index.translog.sync_interval: 5s

You can also control the size of the translog data block. It will only flush to the Lucene index file when the threshold is reached. The default is 512m.

index.translog.flush_threshold_size: 512mb

We also mentioned translog in [Principle: Detailed Explanation of the Indexing Document Process in ES].

Pay Attention to the Use of the _id Field #

When using the _id field, it is advisable to avoid customizing the _id to avoid versioning issues with respect to IDs. It is recommended to use the default ID generation strategy provided by Elasticsearch or use a numeric ID as the primary key.

Pay Attention to the Use of the _all Field and _source Field #

When using the _all field and _source field, pay attention to the scenarios and needs. The _all field contains all indexed fields, making it convenient for full-text search. If there is no such requirement, it can be disabled. The _source field stores the original document content. If there is no need to retrieve the original document data, you can define the fields to be included in the _source field by setting the includes and excludes properties.

Configure the index Attribute Appropriately #

Configure the index attribute appropriately, i.e., analyzed and not_analyzed, based on business requirements to control whether a field is tokenized or not. For fields only used for group by requirements, set them as not_analyzed to improve query or aggregation efficiency.

Reduce the Number of Replicas #

By default, Elasticsearch has 3 replicas, which improves cluster availability and increases the number of concurrent searches, but it also affects the efficiency of writing indexes.

During indexing, the updated document needs to be sent to replica nodes and wait for the replica nodes to take effect before returning. When using Elasticsearch for business search, it is recommended to set the number of replicas to 3. However, for internal ELK log systems and distributed tracing systems, the number of replicas can be set to 1.

Query Optimization #

When using Elasticsearch for near real-time query of business searches, optimizing query efficiency is particularly important.

Routing Optimization #

When we query documents, how does Elasticsearch know which shard a document should be stored in? It actually calculates it using the following formula:

shard = hash(routing) % number_of_primary_shards

By default, routing is the document’s ID, but it can also use a custom value, such as user ID.

Query without Routing #

When querying without routing, there are 2 steps involved:

  • Distribution: After the request reaches the coordinating node, the coordinating node distributes the query request to each shard.
  • Aggregation: The coordinating node collects the query results from each shard, sorts them, and then returns the results to the user.

Query with Routing #

When querying with routing, you can directly locate a certain shard based on the routing information, without having to query all shards and sort the results through the coordinating node.

For example, with a custom user query, if routing is set to user ID, you can directly query the data, which significantly improves efficiency.

Filter VS Query #

Prefer using filter context (Filter) instead of query context (Query) as much as possible.

  • Query: How well does this document match this query clause?
  • Filter: Does this document match the query clause?

For filter queries in Elasticsearch, it only needs to answer “yes” or “no”, without having to calculate relevance scores like query queries. Furthermore, filter results can be cached.

Deep Paging #

When using Elasticsearch, it’s best to avoid large-scale paging.

Normally, paging queries start from the from parameter and return size data. This means that each shard needs to query the top from + size data based on scoring. The coordinating node collects the top from + size data from each shard. In total, the coordinating node receives N*(from+size) data, which then needs to be sorted. Finally, it returns the data from from to from + size. If from or size is large, the number of data involved in sorting will increase, resulting in increased CPU resource consumption.

This problem can be efficiently solved by using Elasticsearch scroll and scroll-scan.

Alternatively, based on the specific characteristics of the business, if the document ID and the document creation time are in a consistent order, you can use the document ID as the offset for paging and include it as a condition in the paging query.

Proper Use of Scripts #

We know that scripts can be used in 3 forms: inline dynamic compiled scripts, scripts stored in the _script index, and scripts stored in files. Generally, scripts are used for rough ranking. It is recommended to store the scripts in the _script index first to compile them in advance. Then, use the script ID and params parameters in combination to separate the logic of the model from the data, while also facilitating the expansion and maintenance of the script module.

Setting and Usage of Cache #

  • Query Cache: When performing a filter query in ES, the query cache is used. If there are many filter queries in the business scenario, it is recommended to increase the size of the query cache to improve query speed.

indices.queries.cache.size: 10% (default). You can set it as a percentage or a specific value, such as 256mb.

Of course, you can also disable the query cache (enabled by default) by setting index.queries.cache.enabled: false.

  • FieldData Cache: Field data cache is frequently used in clustering or sorting scenarios. Therefore, it is necessary to set the size of the field data cache in scenarios with many clustering or sorting operations. You can set it as indices.fielddata.cache.size: 30% or a specific value like 10GB. However, if the scenario or data changes frequently, setting cache may not be a good practice, as the loading overhead of the cache is also significant.
  • Shard Request Cache: After a query request is initiated, each shard returns the results to the coordinating node, which then consolidates the results. If needed, you can enable the shard request cache by setting index.requests.cache.enable: true. However, the shard request cache only caches the data of types like hits.total, aggregations, and suggestions, and does not cache the content of hits. You can also control the cache space size by setting indices.requests.cache.size: 1% (default).

More Query Optimization Experience #

  • The more query fields used for query_string or multi_match, the slower the query will be. To improve performance, you can use the copy_to property in the mapping phase to index the values of multiple fields into a new field. Then, when using multi_match, query using the new field.
  • When querying date fields, especially when using now, there is no caching. So from a business perspective, consider whether it is necessary to use now, as utilizing query cache can greatly improve query efficiency.
  • The size of the query result set should not be set to an outrageous value. For example, query.setSize should not be set to Integer.MAX_VALUE because Elasticsearch needs to create a data structure to hold the specified size of the result set.
  • Avoid using deeply nested aggregation queries as they can consume a lot of memory and CPU. It is recommended to assemble business logic in the service layer through programming, or optimize using pipelines.
  • Reuse pre-indexed data to improve aggregation performance:

For example, instead of using range aggregations to group by age, where the groups are “Youth” (below 14 years old), “Young” (14-28), “Middle-aged” (29-50), and “Elderly” (51 and above), you can set an age_group field during indexing to classify the data in advance. Then, there is no need to use range aggregations based on age, and you can use the age_group field instead.

Locating Slow Queries by Enabling Slow Query Logging #

Whether it is a database or a search engine, enabling slow query logging is necessary for troubleshooting. There are several ways to enable slow query logging in Elasticsearch, but the most common way is to use the template API for global settings:

PUT /_template/{TEMPLATE_NAME}
{
  "template": "{INDEX_PATTERN}",
  "settings": {
    "index.indexing.slowlog.level": "INFO",
    "index.indexing.slowlog.threshold.index.warn": "10s",
    "index.indexing.slowlog.threshold.index.info": "5s",
    "index.indexing.slowlog.threshold.index.debug": "2s",
    "index.indexing.slowlog.threshold.index.trace": "500ms",
    "index.indexing.slowlog.source": "1000",
    "index.search.slowlog.level": "INFO",
    "index.search.slowlog.threshold.query.warn": "10s",
    "index.search.slowlog.threshold.query.info": "5s",
    "index.search.slowlog.threshold.query.debug": "2s",
    "index.search.slowlog.threshold.query.trace": "500ms",
    "index.search.slowlog.threshold.fetch.warn": "1s",
    "index.search.slowlog.threshold.fetch.info": "800ms",
    "index.search.slowlog.threshold.fetch.debug": "500ms",
    "index.search.slowlog.threshold.fetch.trace": "200ms"
  },
  "version": 1
}

PUT {INDEX_PAATERN}/_settings
{
  "index.indexing.slowlog.level": "INFO",
  "index.indexing.slowlog.threshold.index.warn": "10s",
  "index.indexing.slowlog.threshold.index.info": "5s",
  "index.indexing.slowlog.threshold.index.debug": "2s",
  "index.indexing.slowlog.threshold.index.trace": "500ms",
  "index.indexing.slowlog.source": "1000",
  "index.search.slowlog.level": "INFO",
  "index.search.slowlog.threshold.query.warn": "10s",
  "index.search.slowlog.threshold.query.info": "5s",
  "index.search.slowlog.threshold.query.debug": "2s",
  "index.search.slowlog.threshold.query.trace": "500ms",
  "index.search.slowlog.threshold.fetch.warn": "1s",
  "index.search.slowlog.threshold.fetch.info": "800ms",
  "index.search.slowlog.threshold.fetch.debug": "500ms",
  "index.search.slowlog.threshold.fetch.trace": "200ms"
}

By doing this, the slow query logs will output the necessary information:

{CLUSTER_NAME}_index_indexing_slowlog.log
{CLUSTER_NAME}_index_search_slowlog.log

Data Structure Optimization #

Based on the use case of Elasticsearch, the document data structure should be integrated with the use case, removing unnecessary and unreasonable data.

Minimize unnecessary fields #

If Elasticsearch is used for business search services, it is best not to store fields that are not required for search in ES. This not only saves space but also improves search performance with the same amount of data.

Avoid using dynamic values as fields. Dynamic incrementing mapping can cause cluster crashes. Similarly, field quantity should be controlled. Fields that are not used in the business should not be indexed. Controlling the number of indexed fields, mapping depth, and field types is crucial for optimizing the performance of ES.

The following are some default settings of ES regarding field numbers and mapping depth:

index.mapping.nested_objects.limit: 10000
index.mapping.total_fields.limit: 1000
index.mapping.depth.limit: 20

Nested Object vs Parent/Child #

Try to avoid using nested or parent/child fields if possible. Nested queries are slow, and parent/child queries are even slower, hundreds of times slower than nested queries. Therefore, if something can be accomplished during the mapping design phase (e.g., designing a denormalized table or using a more efficient data structure), avoid using parent/child relationships in the mapping.

If it is necessary to use nested fields, make sure that the number of nested fields is not too high. Currently, ES limits the number of nested fields to 50. This is because for each document, for every nested field, an independent document is generated, which significantly increases the number of documents and affects query efficiency, especially for JOIN operations.

index.mapping.nested_fields.limit: 50
Comparison Nested Object Parent/Child
Advantages Documents are stored together, resulting in higher read performance Parent and child documents can be independently updated without affecting each other
Disadvantages Updating a parent or child document requires updating the entire document Memory is occupied to maintain the join relationship, resulting in poorer read performance
Use Case Occasional updates of child documents and frequent queries Frequent updates of child documents

Choose static mapping and disable dynamic mapping when unnecessary #

Try to avoid using dynamic mapping, as it may cause cluster crashes. In addition, dynamic mapping may introduce uncontrollable data types, which can lead to related exceptions in query processing and impact business operations.

Furthermore, as Elasticsearch primarily serves as a search engine for query matching and sorting, the storage types of data are divided into two categories based on the purposes of these two functionalities. One category includes fields used for matching, which are used to establish inverted indexes for query matching. The other category includes feature fields used for rough sorting, such as CTR, click counts, comment counts, etc.

Document model design #

In MySQL, we often have complex join queries. How should we handle this in Elasticsearch? It is best to avoid using complex join queries in Elasticsearch, as their performance is generally not good.

Ideally, the associations are already done in the Java system and the associated data is directly written into Elasticsearch. When searching, there is no need to rely on Elasticsearch’s search syntax to perform join-like association searches.

Document model design is crucial. Many operations should not be performed in the search phase, where various complex and messy operations are considered. Elasticsearch supports only a limited number of operations, so do not consider using Elasticsearch for tasks that it is not good at. If such operations do exist, try to complete them during the document model design and writing phase. Additionally, complex operations such as join/nested/parent-child searches should be avoided as they all have poor performance.

Cluster Architecture Design #

Proper deployment of Elasticsearch helps improve overall service availability.

Separation of Master, Data, and Coordination Nodes #

When designing the architecture of the Elasticsearch cluster, it is recommended to separate the master nodes, data nodes, and load balancing nodes. Starting from version 5.x, data nodes can be further divided into a “Hot-Warm” architecture.

There are two parameters in the Elasticsearch configuration file: node.master and node.data. When these two parameters are used together, they can help provide server performance.

Master Nodes #

By configuring node.master:true and node.data:false, the node server is designated as a master node that does not store any index data. It is recommended to run three dedicated master nodes in each cluster to provide better elasticity. When using this configuration, you also need to set the discovery.zen.minimum_master_nodes setting parameter to 2 to avoid split-brain situations. These three dedicated master nodes are responsible for managing the cluster and enhancing overall stability. Since these master nodes do not contain data and do not participate in actual search and indexing operations, they do not need to perform the same intensive tasks like indexing or resource-consuming searches on the JVM. Therefore, the CPU, memory, and disk configuration of the master nodes can be much less than the data nodes.

Data Nodes #

By configuring node.master:false and node.data:true, the node server is designated as a data node that is used solely for storing index data. This ensures that the server has a single function of data storage and query, reducing resource consumption rates.

Starting from Elasticsearch version 5.x, data nodes can be further divided into a “Hot-Warm” architecture, which includes hot nodes and warm nodes.

Hot Nodes:

Hot nodes are primarily indexing nodes (write nodes) and also store recently queried indexes. Because indexing operations are CPU and IO intensive, it is recommended to use SSD disks to maintain good write performance. It is recommended to deploy a minimum of three hot nodes to ensure high availability. Depending on the amount of data that needs to be collected and queried recently, you can increase the number of servers to achieve the desired performance.

To configure a node as a hot type, the elasticsearch.yml file should include the following settings:

node.attr.box_type: hot

If it is specific to operations on a particular index, you can use the settings index setting index.routing.allocation.require.box_type: hot to write the index to a hot node.

Warm Nodes:

These nodes are designed to handle a large volume of infrequently accessed read-only indexes. Since these indexes are read-only, warm nodes tend to mount a large number of regular disks instead of SSDs. The memory and CPU configuration can remain the same as the hot nodes, and the number of nodes is generally equal to or greater than three. To set a node as a warm type, the elasticsearch.yml configuration should include the following:

node.attr.box_type: warm

Additionally, you can also set index.codec: best_compression in elasticsearch.yml to ensure compression configuration for warm nodes.

When an index is no longer frequently queried, you can mark the index as warm by using index.routing.allocation.require.box_type:warm. This ensures that the index is not written to hot nodes, allowing SSD disk resources to be used more effectively. Once this property is set, Elasticsearch automatically merges the index onto warm nodes.

Coordinating Nodes

Coordinating nodes are used for coordinating distributed operations, combining data returned from various shards or nodes. These nodes are not selected as master nodes and do not store any index data. They are primarily used for query load balancing. During a query, it often involves querying data from multiple node servers and distributing the requests to multiple specified node servers. The results returned by each node server are then aggregated and returned to the client. In an Elasticsearch cluster, all nodes have the potential to be coordinating nodes, but you can set dedicated coordinating nodes by setting node.master, node.data, and node.ingest all to false. They require good CPU and higher memory.

  • node.master:false and node.data:true: This node server only serves as a data node, storing index data and handling data storage and queries, reducing resource consumption.
  • node.master:true and node.data:false: This node server serves as a master node but does not store any index data. It utilizes its own available resources to coordinate various index creation or query requests and distribute them to relevant node servers.
  • node.master:false and node.data:false: This node server is neither selected as a master node nor stores any index data. It is primarily used for query load balancing. During a query, it often involves querying data from multiple node servers and distributing the requests to multiple specified node servers. The results returned by each node server are then aggregated and returned to the client.

Disabling HTTP functionality on data nodes #

For all data nodes in the Elasticsearch cluster, there is no need to enable the HTTP service. This can be achieved by setting the configuration parameter http.enabled:false. Additionally, avoid installing monitoring plugins such as Head, BigDesk, Marvel, etc. on these data node servers to ensure that data node servers only handle operations like creating/updating/deleting/querying index data.

HTTP functionality can be enabled on non-data node servers, and the aforementioned monitoring plugins can be installed on these servers to monitor the Elasticsearch cluster status and other data information. This approach is taken for data security and service performance reasons.

A single physical server can start multiple node server instances (by setting different startup ports), but considering that the CPU, memory, disk, and other resources on a server are limited, it is not recommended to start multiple node servers on one server from a performance perspective.

Cluster shard settings #

Once an index is created in Elasticsearch, the shard settings cannot be adjusted. In Elasticsearch, a shard corresponds to a Lucene index, and the read/write operations of a Lucene index consume a lot of system resources. Therefore, the number of shards cannot be set too large. Therefore, it is crucial to configure the number of shards properly when creating an index. In general, we follow some principles:

  • Control the disk capacity occupied by each shard to not exceed the maximum heap space setting of Elasticsearch (usually not exceeding 32 GB, referring to the JVM memory setting principles mentioned above). Therefore, if the total capacity of the index is around 500 GB, a shard size of around 16 is recommended. Of course, it is best to consider principle 2 at the same time.
  • Consider the number of nodes. Generally, each node is sometimes a physical machine. If the number of shards is too large and greatly exceeds the number of nodes, it is likely that multiple shards will exist on one node. If that node fails, even if there are more than one replica, data loss and an unrecoverable cluster may occur. Therefore, it is usually recommended to set the number of shards to not exceed three times the number of nodes.