01 Understand the Basic Concepts of Elastic Search

01 Understand the Basic Concepts of ElasticSearch #

ElasticSearch是一个开源的分布式搜索引擎,用于在大规模数据集上进行高效的搜索、分析和存储。它建立在Apache Lucene搜索引擎库之上,并提供了简单易用的RESTful API,使开发人员能够轻松地与之交互。

以下是一些ElasticSearch的基础概念:

  1. 索引(Index):索引是一组具有类似特性的文档的集合。它类似于关系数据库中的表。每个索引都有一个唯一的名称,可以用于在ElasticSearch中唯一标识和区分数据。

  2. 文档(Document):文档是索引中的最小单位。它是一个被索引的数据单元,可以是任何结构化或非结构化的数据。文档由多个字段组成,每个字段都有一个名称和一个对应的值。

  3. 类型(Type):类型是在索引中对文档进行逻辑分组的方式。一个索引可以包含一个或多个类型。类型并不像关系型数据库中的表那样严格定义模式,而是更加灵活,允许给不同的文档类型指定不同的字段。

  4. 映射(Mapping):映射定义了文档的字段和属性。它规定了每个字段的数据类型和如何处理该字段的索引和查询。映射可以自动创建,也可以由用户手动定义。

  5. 分片和副本(Shards and Replicas):为了处理大规模数据集,ElasticSearch将索引分成多个分片,每个分片都是一个完整且独立的索引。分片能够在多个节点上分布,从而提供了横向扩展的能力。每个分片都有多个副本,用于提高数据的冗余性和可用性。

以上是ElasticSearch的一些基础概念,对于理解和正确使用ElasticSearch非常重要。在后续的教程中,我们将深入探索每个概念,并介绍如何使用ElasticSearch进行各种操作和查询。

Why do we need to learn ElasticSearch #

According to the rankings from DB Engine, ElasticSearch is the most popular enterprise search engine.

In the screenshot below, the red checkmarks indicate the libraries we have previously discussed, but you can also see that the search engine ElasticSearch is included in the top ten:

img

So why do we need to learn ElasticSearch?

  1. In the current software industry, search is a fundamental feature of software systems or platforms. Learning ElasticSearch allows us to create a good search experience for corresponding software.

  2. Furthermore, ElasticSearch has very strong capabilities for big data analysis. Although Hadoop can also be used for big data analysis, ElasticSearch has much higher analysis capabilities that Hadoop does not have. For example, sometimes it takes a long time to analyze a result using Hadoop.

  3. ElasticSearch is easy to use and can be installed on a personal laptop as well as scaled horizontally in a production environment.

  4. Many large internet companies in China, such as Xiaomi, Didi, and Ctrip, are using ElasticSearch. Additionally, there are corresponding ElasticSearch cloud products available on Tencent Cloud and Alibaba Cloud’s cloud platforms.

  5. In the era of big data, mastering near real-time search and analysis capabilities is essential to grasp core competitiveness and gain insights into the future.

What is ElasticSearch #

ElasticSearch is a powerful, open-source search and analytics engine based on Lucene. It is a real-time distributed search and analytics engine that allows you to explore your data at unprecedented speed and scale.

It is used for full-text search, structured search, analysis, and combinations of these three functionalities:

  • Wikipedia uses Elasticsearch to provide full-text search with highlighted snippets, search-as-you-type, and did-you-mean suggestions.
  • The Guardian combines social network data with visitor logs using Elasticsearch to provide real-time feedback from the public on new articles.
  • Stack Overflow incorporates geolocation queries into full-text search and uses the more-like-this interface to find related questions and answers.
  • GitHub uses Elasticsearch to query 130 billion lines of code.

In addition to search, when combined with the open-source products Kibana, Logstash, and Beats, the Elastic Stack (referred to as ELK) is widely used in the field of near real-time big data analytics, including log analysis, metric monitoring, information security, and more. It can help you explore massive structured and unstructured data, create visual reports on demand, set alert thresholds for monitoring data, and automatically identify abnormal conditions using machine learning.

ElasticSearch is based on a Restful WebApi and is developed in Java. It is a search engine library class and is released as open source under the Apache license. It is a popular enterprise search engine. Its client is available in many languages such as Java, C#, PHP, Python, etc.

Origins of ElasticSearch #

The story behind ElasticSearch

Many years ago, a recently married unemployed developer named Shay Banon went to London with his wife, who was studying to become a chef. While looking for a job that would bring in money, he started using an early version of Lucene to create a recipe search engine for his wife.

Using Lucene directly was difficult, so Shay started building an abstraction layer that would allow Java developers to easily add search functionality to their programs. He released his first open-source project, Compass.

Later, Shay got a job focused on high-performance, distributed, in-memory data grids. This job highlighted the need for a high-performance, real-time, distributed search engine. Shay decided to rewrite Compass as a standalone service and named it Elasticsearch.

The first public version was released in February 2010. Since then, Elasticsearch has become one of the most active projects on Github, with over 300 contributors (currently 736 contributors). A company has started offering commercial services around Elasticsearch and developing new features. However, Elasticsearch will always remain open source and available to everyone.

It is said that Shay’s wife is still waiting for her recipe search engine…

Why not just use Lucene directly #

Elasticsearch is based on Lucene, so why not just use Lucene directly?

Lucene can be considered the most advanced, high-performance, full-featured search engine library available today.

However, Lucene is just a library. To fully leverage its capabilities, you need to use Java and integrate Lucene directly into your application. Even worse, you might need a degree in information retrieval to understand how it works. Lucene is very complex.

Elasticsearch is also written in Java, and internally, it uses Lucene for indexing and searching. But its goal is to make full-text search simple by hiding the complexity of Lucene and providing a simple and consistent RESTful API.

However, Elasticsearch is not just Lucene, and it is not just a full-text search engine. It can be accurately described as:

  • A distributed real-time document store where each field can be indexed and searched
  • A distributed real-time analytics search engine
  • Capable of scaling to hundreds of service nodes and supporting structured or unstructured data at the petabyte level

Key Features and Use Cases of ElasticSearch #

In what scenarios can ES be used?

  • Key Features:
  1. Distributed storage and cluster management of massive data, achieving high availability and horizontal scalability of services and data;

  2. Exceptional near real-time search performance. Handling structured, full-text, and geospatial data;

  3. Near real-time analysis of massive data (aggregation feature).

  • Use Cases:
  1. Website search, vertical search, code search;

  2. Log management and analysis, security metric monitoring, application performance monitoring, web crawling and sentiment analysis;

Basic Concepts of ElasticSearch #

Before we start learning, let’s compare ElasticSearch with relational databases to lay the foundation for our study.

  • Near Realtime (NRT): Near real-time. After data is indexed, it can be searched immediately.
  • Cluster: A cluster is identified by a unique name, “elasticsearch” by default. The cluster name is important. Nodes with the same cluster name will form a cluster. The cluster name can be specified in the configuration file.
  • Node: Stores the data of a cluster and participates in indexing and searching operations. Like a cluster, a node also has a name. By using the cluster name, nodes can discover peers to form a cluster. A node can also be a cluster.
  • Index: A collection of documents (similar to a collection in Solr). Each index has a unique name that is used to manipulate it. A cluster can have any number of indexes.
  • Type: In an index, different types of documents can be indexed, such as user data or blog data. This feature has been deprecated since version 6.0.0. An index can only store one type of data.
  • Document: A single piece of data that is indexed. It represents the basic information unit in an index and is represented in JSON format.
  • Shard: When creating an index, the number of shards can be specified. Each shard is a fully functional and independent “index” that can be placed on any node in the cluster.
  • Replication: A shard can have multiple backups (replicas).

To facilitate understanding, let’s make a comparison between ElasticSearch and a database:

img

Reference articles #