04 Basic Usage of Introductory Queries and Aggregations

04 Basic Usage of Introductory Queries and Aggregations #

在本章中,我们将介绍查询和聚合的基础使用。查询是指从数据库中检索特定的数据。聚合是指对数据进行统计和计算,例如对数据进行计数、求和、平均值等操作。

在Elasticsearch中,我们使用Query DSL(查询领域特定语言)来构建查询。Query DSL是一个丰富的查询语言,可以根据各种条件和参数来构建查询。

本章将介绍以下内容:

  • 基础查询:介绍如何构建基本的查询,包括匹配、范围和布尔查询等。
  • 聚合查询:介绍如何使用聚合查询对数据进行统计和计算。
  • 查询优化:介绍如何优化查询性能,包括索引优化和查询缓存等。

通过学习本章内容,您将掌握Elasticsearch中查询和聚合的基础使用。这将为您进一步使用Elasticsearch做更深入的数据分析和搜索任务奠定基础。让我们开始吧!

Getting Started: Starting with Indexing Documents #

  • Index a Document

    PUT /customer/_doc/1 { “name”: “John Doe” }

For convenience in testing, we will use Kibana’s Dev Tools for learning and testing:

img

Query the document that was just inserted

img

Study Preparation: Bulk Indexing Documents #

ES also provides bulk operations, such as using bulk operations to insert some data for us to use in later studies.

Using bulk to batch process document operations is much faster than submitting requests individually because it reduces network round trips.

  • Download test data

The data is indexed as “bank” with accounts.json download link (if you cannot download it, you can also clone ES’s official repository, and then go to the /docs/src/test/resources/accounts.json directory to obtain it).

The format of the data is as follows:

{
  "account_number": 0,
  "balance": 16623,
  "firstname": "Bradshaw",
  "lastname": "Mckenzie",
  "age": 29,
  "gender": "F",
  "address": "244 Columbus Place",
  "employer": "Euron",
  "email": "[[email protected]](/cdn-cgi/l/email-protection)",
  "city": "Hobucken",
  "state": "CO"
}
  • Bulk Insert Data

Copy accounts.json to the specified directory, I put it under /opt/ here,

Then execute:

curl -H "Content-Type: application/json" -XPOST "localhost:9200/bank/_bulk?pretty&refresh" --data-binary "@/opt/accounts.json"
  • Check status
[elasticsearch@VM-0-14-centos root]$ curl "localhost:9200/_cat/indices?v=true" | grep bank
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1524  100  1524    0     0   119k      0 --:--:-- --:--:-- --:--:--  124k
yellow open   bank                            yq3eSlAWRMO2Td0Sl769rQ   1   1       1000            0    379.2kb        379.2kb
[elasticsearch@VM-0-14-centos root]$

Query Data #

We use Kibana for query testing.

Query All #

match_all represents querying all data, sort is for sorting by a certain field.

GET /bank/_search
{
  "query": { "match_all": {} },
  "sort": [
    { "account_number": "asc" }
  ]
}

Result

img

Explanation of relevant fields

  • took - The time it took Elasticsearch to execute the query (in milliseconds).
  • timed_out - Whether the search request timed out.
  • _shards - The breakdown of how many shards were searched and how many shards succeeded, failed, or were skipped.
  • max_score - The score of the most relevant document found.
  • hits.total.value - The number of matching documents found.
  • hits.sort - The position of the documents in the sort order (when not sorting by relevance score).
  • hits._score - The relevance score of the documents (not applicable when using match_all).

Pagination (from+size) #

Basically it’s done using the from and size fields.

GET /bank/_search
{
  "query": { "match_all": {} },
  "sort": [
    { "account_number": "asc" }
  ],
  "from": 10,
  "size": 10
}

Result

img

Field-based Query: match #

If you want to search for specific words in a field, you can use match. The following statement will query data in the address field that contains either “mill” or “lane”.

GET /bank/_search
{
  "query": { "match": { "address": "mill lane" } }
}

Result

img

(Since Elasticsearch indexes data based on tokenization, the above query result includes data in the address field that contains either “mill” or “lane”.)

Phrase-based Query: match_phrase #

If you want to search for a specific phrase in a field, you can use match_phrase. For example, the following query will search for data in the address field that contains “mill lane”.

GET /bank/_search
{
  "query": { "match_phrase": { "address": "mill lane" } }
}

Result

img

Multi-condition Query: bool #

If you want to construct more complex queries, you can use the bool query to combine multiple query conditions.

For example, the following request searches for accounts of customers who are 40 years old in the bank index, excluding anyone who lives in Idaho (ID).

GET /bank/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "age": "40" } }
      ],
      "must_not": [
        { "match": { "state": "ID" } }
      ]
    }
  }
}

Result

img

must, should, must_not, and filter are clauses of the bool query. So what’s the difference between filter and the aforementioned query clauses?

Query Condition: query or filter #

Now let’s look at the following query, where the bool query has both query/must and filter clauses.

GET /bank/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "state": "ND"
          }
        }
      ],
      "filter": [
        {
          "term": {
            "age": "40"
          }
        },
        {
          "range": {
            "balance": {
              "gte": 20000,
              "lte": 30000
            }
          }
        }
      ]
    }
  }
}

Result

img

Both must and filter can be used to write query conditions, and the syntax is similar. The difference is that the conditions in the query context are used to score documents, with higher relevance scores indicating better matches; the conditions in the filter only have two results: matches or does not match, with the latter being filtered out.

So, let’s further look at a query that only contains filters:

GET /bank/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "age": "40"
          }
        },
        {
          "range": {
            "balance": {
              "gte": 20000,
              "lte": 30000
            }
          }
        }
      ]
    }
  }
}

Result, clearly without a score:

img

Aggregation #

We know that in SQL there is group by, and in Elasticsearch it is called Aggregation, which means aggregation operations.

Simple Aggregation #

For example, if we want to calculate the number of accounts in each state, we can use the aggs keyword to aggregate the state field. The aggregated field does not need to be tokenized, so we use state.keyword to count the entire field.

GET /bank/_search
{
  "size": 0,
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "state.keyword"
      }
    }
  }
}

Result:

img

Because we don’t need to return specific data with conditions, we set size=0 and return an empty hits array.

doc_count represents the number of data entries in each state bucket.

Nested Aggregation #

ES can also handle nested aggregation conditions.

For example, following the previous example, let’s calculate the average balance of each state. This involves nesting the calculation of avg(balance) on the basis of grouping by state:

GET /bank/_search
{
  "size": 0,
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "state.keyword"
      },
      "aggs": {
        "average_balance": {
          "avg": {
            "field": "balance"
          }
        }
      }
    }
  }
}

Result:

img

Sorting Aggregation Results #

You can sort the nested aggregation results in aggs.

For example, continuing from the previous example, we sort the nested aggregation result avg(balance), which is average_balance here:

GET /bank/_search
{
  "size": 0,
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "state.keyword",
        "order": {
          "average_balance": "desc"
        }
      },
      "aggs": {
        "average_balance": {
          "avg": {
            "field": "balance"
          }
        }
      }
    }
  }
}

Result:

img