07 Detailed Explanation of Query DSL Compound Queries

07 Detailed Explanation of Query DSL - Compound Queries #

在 Elasticsearch 中,复合查询是一种结合了多个查询类型的查询方式。它可以通过逻辑运算符(如 must, must_not, should)和布尔查询来组合各种查询语句。这种查询结构非常灵活,能够满足各种复杂的查询需求。

1. Bool查询 #

Bool查询是 Elasticsearch 中最基本和最常用的复合查询,它将多个查询语句通过逻辑运算符组合在一起。Bool查询支持下列逻辑运算符:

  • must:必须满足的查询条件,相当于逻辑中的 “AND”。
  • must_not:不能满足的查询条件,相当于逻辑中的 “NOT”。
  • should:应该满足但非强制的查询条件,相当于逻辑中的 “OR”。

Bool查询使用示例:

{
  "query": {
    "bool": {
      "must": [
        { "term": { "field1": "value1" } },
        { "term": { "field2": "value2" } }
      ],
      "must_not": [
        { "term": { "field3": "value3" } }
      ],
      "should": [
        { "term": { "field4": "value4" } },
        { "term": { "field5": "value5" } }
      ]
    }
  }
}

2. Constant Score查询 #

Constant Score查询是一种简单而强大的复合查询,它将一个查询语句封装成一个具有固定分值的查询。这种查询方式适用于无需考虑相关度分值的情况,例如过滤查询和基于特定字段的查询。

常数得分查询使用示例:

{
  "query": {
    "constant_score": {
      "filter": {
        "term": { "field": "value" }
      },
      "boost": 1.2
    }
  }
}

3. Dis Max查询 #

Dis Max查询是一种将多个查询语句进行或运算的查询方式,它会将每个查询的最高分组合到一起,从而得到最终的结果。Dis Max查询适用于在实际应用中需要增加结果的相关度的情况。

Dis Max查询使用示例:

{
  "query": {
    "dis_max": {
      "queries": [
        { "term": { "field1": "value1" } },
        { "term": { "field2": "value2" } }
      ],
      "tie_breaker": 0.2
    }
  }
}

复合查询的灵活性使得在 Elasticsearch 中可以进行更加精细和复杂的查询操作。通过合理地组合不同的查询语句和运算符,可以满足各种实际场景中的查询需求。

Introduction to Compound Queries #

In the previous section on Multi-condition Queries - bool, we used the bool query to combine multiple query conditions.

For example, consider the statement introduced earlier:

GET /bank/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "age": "40" } }
      ],
      "must_not": [
        { "match": { "state": "ID" } }
      ]
    }
  }
}

This type of query is called a compound query, and the bool query is just one type of compound query.

bool query (Boolean Query) #

Combine smaller queries into larger queries using Boolean logic.

Concept #

Boolean query syntax has the following characteristics:

  • Sub-queries can appear in any order.
  • Multiple queries, including bool queries, can be nested.
  • If there is no “must” condition in the bool query, at least one condition in “should” must be satisfied to return results.

The bool query includes four operators: must, should, must_not, and filter. They are all arrays, and the conditions correspond to the elements in the array.

  • must: Must match. Contributes to the score.
  • must_not: Filter clause that must not match, but does not contribute to the score.
  • should: Optional match. At least one condition must be satisfied. Contributes to the score.
  • filter: Filter clause that must match, but does not contribute to the score.

Examples #

Let’s take a look at some official examples.

  • Example 1

    POST _search
    {
      "query": {
        "bool": {
          "must": {
            "term": { "user.id": "kimchy" }
          },
          "filter": {
            "term": { "tags": "production" }
          },
          "must_not": {
            "range": {
              "age": { "gte": 10, "lte": 20 }
            }
          },
          "should": [
            { "term": { "tags": "env1" } },
            { "term": { "tags": "deployed" } }
          ],
          "minimum_should_match": 1,
          "boost": 1.0
        }
      }
    }
    

    The query specified under the filter element does not affect the score, so the score returned is 0. The score is only affected by the specified queries.

  • Example 2

    GET _search
    {
      "query": {
        "bool": {
          "filter": {
            "term": {
              "status": "active"
            }
          }
        }
      }
    }
    

    This example assigns a score of 0 to all documents because no scoring queries are specified.

  • Example 3

    GET _search
    {
      "query": {
        "bool": {
          "must": {
            "match_all": {}
          },
          "filter": {
            "term": {
              "status": "active"
            }
          }
        }
      }
    }
    

    This bool query has a match_all query, which assigns a score of 1.0 to all documents.

  • Example 4

    GET /_search
    {
      "query": {
        "bool": {
          "should": [
            { "match": { "name.first": { "query": "shay", "_name": "first" } } },
            { "match": { "name.last": { "query": "banon", "_name": "last" } } }
          ],
          "filter": {
            "terms": {
              "name.last": [ "banon", "kimchy" ],
              "_name": "test"
            }
          }
        }
      }
    }
    

    Each query condition can have a _name property to track which condition the matched data belongs to.

Boosting Query #

Unlike the bool query, where if any sub-query condition does not match, the data will not appear in the search results, the boosting query reduces the displayed weight/priority (i.e. the score).

Concept #

For example, if the search logic is name = 'apple' and type = 'fruit', for data that only satisfies partial conditions, it will not be hidden, but rather its priority will be reduced (i.e. the score).

Example #

First, create the data:

POST /test-dsl-boosting/_bulk
{ "index": { "_id": 1 }}
{ "content":"Apple Mac" }
{ "index": { "_id": 2 }}
{ "content":"Apple Fruit" }
{ "index": { "_id": 3 }}
{ "content":"Apple employee like Apple Pie and Apple Juice" }

Perform a downgrade display for matches containing pie:

GET /test-dsl-boosting/_search
{
  "query": {
    "boosting": {
      "positive": {
        "term": {
          "content": "apple"
        }
      },
      "negative": {
        "term": {
          "content": "pie"
        }
      },
      "negative_boost": 0.5
    }
  }
}

The execution result is as follows:

img

constant_score Query #

With the constant_score query, you can set a fixed score for a specific condition. When you don’t need to calculate the score, you can simply use the filter condition, as the filter context ignores the score.

Example #

First, let’s create some data:

POST /test-dsl-constant/_bulk
{ "index": { "_id": 1 }}
{ "content":"Apple Mac" }
{ "index": { "_id": 2 }}
{ "content":"Apple Fruit" }

Now, let’s search for “apple”:

GET /test-dsl-constant/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": { "content": "apple" }
      },
      "boost": 1.2
    }
  }
}

The result of the query is as follows:

img

dis_max (Best Matching Query) #

The Disjunction Max Query refers to returning any document that matches any of the queries, but only the score of the best match is returned as the score of the query.

Example #

Let’s say we have a website that allows users to search for blog content using the following two blog documents as examples:

POST /test-dsl-dis-max/_bulk
{ "index": { "_id": 1 }}
{"title": "Quick brown rabbits","body":  "Brown rabbits are commonly seen."}
{ "index": { "_id": 2 }}
{"title": "Keeping pets healthy","body":  "My quick brown fox eats rabbits on a regular basis."}

Suppose the user enters the phrase “Brown fox” and clicks the search button. We do not know in advance whether the user’s search term will be found in the title or body field, but it is likely that the user wants to search for the relevant phrase. Judging by human eye, document 2 is a better match because it includes both of the search terms.

Now, run the following bool query:

GET /test-dsl-dis-max/_search
{
    "query": {
        "bool": {
            "should": [
                { "match": { "title": "Brown fox" }},
                { "match": { "body":  "Brown fox" }}
            ]
        }
    }
}

In order to understand the reason for this, let’s look at how the score is calculated.

  • Scoring of should condition
GET /test-dsl-dis-max/_search
{
    "query": {
        "bool": {
            "should": [
                { "match": { "title": "Brown fox" }},
                { "match": { "body":  "Brown fox" }}
            ]
        }
    }
}

To calculate the score above, we first need to calculate the score of the match.

  1. Score of brown in the first match

    Score of doc 1 = 0.6931471

    img

  2. Since the title does not have fox, the score of brown fox in the first match is score of brown + 0 = 0.6931471

    Score of doc 1 = 0.6931471 + 0 = 0.6931471

    img

  3. Score of brown in the second match

    Score of doc 1 = 0.21110919

    Score of doc 2 = 0.160443

    img

  4. Score of fox in the second match

    Score of doc 1 = 0

    Score of doc 2 = 0.60996956

    img

  5. So the score of brown fox in the second match is score of brown + score of fox

    Score of doc 1 = 0.21110919 + 0 = 0.21110919

    Score of doc 2 = 0.160443 + 0.60996956 = 0.77041256

    img

  6. Therefore, the score of the entire statement is should score = score of first match + score of second match

    Score of doc 1 = 0.6931471 + 0.21110919 = 0.90425634

    Score of doc 2 = 0 + 0.77041256 = 0.77041256

    img

  • Introduction of dis_max

Instead of using a bool query, you can use dis_max, which stands for Disjunction Max Query. Disjunction means “or”, which is the opposite of conjunction, which can be understood as “and”. The Disjunction Max Query refers to returning any document that matches any of the queries, but only the score of the best match is returned as the score of the query:

GET /test-dsl-dis-max/_search
{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Brown fox" }},
                { "match": { "body":  "Brown fox" }}
            ],
            "tie_breaker": 0
        }
    }
}

img

How did we get 0.77041256? The following explanation will show you how it is calculated.

  • Scoring of dis_max condition

Score = score of first matching condition + tie_breaker * score of second matching condition…

GET /test-dsl-dis-max/_search
{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Brown fox" }},
                { "match": { "body":  "Brown fox" }}
            ],
            "tie_breaker": 0
        }
    }
}

Score of doc 1 = 0.6931471 + 0.21110919 * 0 = 0.6931471

Score of doc 2 = 0.77041256 = 0.77041256

img

This way, you can understand why doc 2 is ranked higher through dis_max. Of course, if the tie_breaker field is missing, the default value is 0, and you can also set the ratio (between 0 and 1) to control the ranking. (Obviously, when the value is 1, it is consistent with the should query)

function_score (Function Query) #

In short, function_score is used to calculate the _score using custom functions.

What are the custom functions available in Elasticsearch?

  • script_score: Calculates the score using a custom script that allows complete control over the scoring logic. If you need functionality beyond the predefined functions mentioned above, you can implement it using scripts.
  • weight: Applies a simple boost to each document without normalization. For example, when the weight is 2, the result is 2 * _score.
  • random_score: Orders the results differently for each user using consistent random scoring, but maintains the same ordering for the same user.
  • field_value_factor: Modifies the _score based on a field’s value in the document, such as considering popularity or voting count.
  • Decay Functions: linear, exp, gauss

Example #

Let’s take the simplest example of random_score:

GET /_search
{
  "query": {
    "function_score": {
      "query": { "match_all": {} },
      "boost": "5",
      "random_score": {},
      "boost_mode": "multiply"
    }
  }
}

Furthermore, you can combine the above functions using functions:

GET /_search
{
  "query": {
    "function_score": {
      "query": { "match_all": {} },
      "boost": "5",
      "functions": [
        {
          "filter": { "match": { "test": "bar" } },
          "random_score": {},
          "weight": 23
        },
        {
          "filter": { "match": { "test": "cat" } },
          "weight": 42
        }
      ],
      "max_boost": 42,
      "score_mode": "max",
      "boost_mode": "multiply",
      "min_score": 42
    }
  }
}

You can use script_score as follows:

GET /_search
{
  "query": {
    "function_score": {
      "query": {
        "match": { "message": "elasticsearch" }
      },
      "script_score": {
        "script": {
          "source": "Math.log(2 + doc['my-int'].value)"
        }
      }
    }
  }
}

For more information, you can refer to the official documentation. PS: Once you have a conceptual understanding, you can refer to the documentation when you need to use it specifically.

Reference Articles #