08 Detailed Explanation of Query DSL Full Text Search

08 Detailed Explanation of Query DSL - Full-text Search #

全文搜索是一种用于在文本中查找匹配词语的搜索技术。在 DSL 查询中，我们可以使用全文搜索来进行高级文本查询。全文搜索可以用于以下场景：

在文章或博客中搜索特定的关键字或短语。
在用户评论或反馈中搜索用户提到的特定问题。
在大型文件或数据库中进行全文搜索。

在进行全文搜索之前，我们需要创建一个全文索引。DSL 查询中有两种主要的全文搜索方式：词项搜索和短语搜索。

词项搜索 #

词项搜索是最简单的全文搜索方式。它用于在文本中查找包含指定词项的文档。词项搜索可以使用以下查询语句：

{
  "query": {
    "match": {
      "text": "keyword"
    }
  }
}

在上面的查询语句中，text 是要搜索的文本字段，keyword 是要查找的关键字。这个查询语句将返回所有包含关键字的文档。

短语搜索 #

短语搜索用于在文本中查找包含特定短语的文档。短语搜索可以使用以下查询语句：

{
  "query": {
    "match_phrase": {
      "text": "search phrase"
    }
  }
}

在上面的查询语句中，text 是要搜索的文本字段，search phrase 是要查找的短语。这个查询语句将返回所有包含完全匹配该短语的文档。

全文搜索是 DSL 查询中非常强大和常用的功能之一。通过使用全文搜索，我们可以轻松地查找和过滤文本数据，以满足各种搜索需求。在实际应用中，还可以进一步调整搜索结果的相关性和排序。

Introduction: How to Learn from the Official Website #

Tips

Many readers have a misconception when studying the official documentation, for example, in the case of full-text queries in DSL. The content is very extensive, and without selective reading or focusing on key points, it can either take a lot of time or leave you with a confused mind. So here, I will focus on sharing my understanding. @pdai

Some Understanding:

First Point: Global Perspective, i.e., where does the content we are currently learning fit into the entire system?

The following diagram can help you build this kind of system easily:

Second Point: Categorization, understanding from higher levels rather than the content itself.

For example, in Full Text Query, we only need to categorize all these points into three major categories, and your system capability will greatly improve:

Third Point: Knowledge Points or API? API types can be queried, and you only need to know the general functionalities.

Match Type #

Type 1: Match Type

Steps of Match Query #

We have already introduced the match query in the Specifying Fields in a Query section.

Preparing the Data

Here, we prepare some data to demonstrate the steps of the match query.

PUT /test-dsl-match
{ "settings": { "number_of_shards": 1 }} 

POST /test-dsl-match/_bulk
{ "index": { "_id": 1 }}
{ "title": "The quick brown fox" }
{ "index": { "_id": 2 }}
{ "title": "The quick brown fox jumps over the lazy dog" }
{ "index": { "_id": 3 }}
{ "title": "The quick brown fox jumps over the quick dog" }
{ "index": { "_id": 4 }}
{ "title": "Brown fox brown dog" }

Querying the Data

GET /test-dsl-match/_search { “query”: { “match”: { “title”: “QUICK!” } } }

The steps Elasticsearch takes to execute the above match query are as follows:

Checking the Field Type.

The title field is of type string (analyzed), which means the query string itself should also be analyzed.

Analyzing the Query String.

The query string QUICK! is passed to the standard analyzer, which outputs a single term quick. Since there is only one term, the match query performs a single underlying term query.

Finding Matching Documents.

The term query searches for quick in the inverted index and retrieves a set of documents that include this term. In this case, the result is documents 1, 2, and 3.

Scoring Each Document.

The term query calculates relevance scores (_score) for each document. This involves a calculation that combines term frequency (the frequency with which the term quick appears in the title field of relevant documents), inverse document frequency (the frequency with which the term quick appears in the title field of all documents), and the length of the field (the shorter the field, the higher the relevance).

Validating the Results

Exploring Match with Multiple Terms #

In the previous section on compound queries, we have already used match with multiple terms, such as “Quick pets”. Here, we provide an example to help you understand match with multiple terms in more depth.

The Essence of Match with Multiple Terms

Querying with multiple terms “BROWN DOG!”

GET /test-dsl-match/_search
{
    "query": {
        "match": {
            "title": "BROWN DOG"
        }
    }
}

Since the match query needs to search for two terms ([“brown”,“dog”]), it actually performs two term queries internally and then merges the results of the two queries as the final result. In order to achieve this, it wraps the two term queries in a bool query. Therefore, the result of the above query is equivalent to the following:

GET /test-dsl-match/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "term": {
            "title": "brown"
          }
        },
        {
          "term": {
            "title": "dog"
          }
        }
      ]
    }
  }
}

The Logic of Match with Multiple Terms

The above query, which is equivalent to should (any one satisfies), is because match also has an operator parameter, which defaults to “or”. Therefore, the corresponding operator is should.

Therefore, the above query is also equivalent to:

GET /test-dsl-match/_search
{
  "query": {
    "match": {
      "title": {
        "query": "BROWN DOG",
        "operator": "or"
      }
    }
  }
}

What if we need an “and” operator, which means both terms need to be satisfied simultaneously?

GET /test-dsl-match/_search
{
  "query": {
    "match": {
      "title": {
        "query": "BROWN DOG",
        "operator": "and"
      }
    }
  }
}

This is equivalent to:

GET /test-dsl-match/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "title": "brown"
          }
        },
        {

"term": {
  "title": "dog"
}
}
]
}
}
}


![img](../images/es-dsl-full-text-7.png)

### Controlling the Precision of the match Query

If a user provides 3 query terms and wants to find documents that contain only 2 of them, how do we handle it? Setting the operator parameter to and or or is not appropriate.

The match query supports the minimum_should_match parameter, which allows us to specify the number of terms that must match to consider a document relevant. We can set it to a specific number or more commonly, as a percentage, because we can't control the number of words the user will search:

```plaintext
GET /test-dsl-match/_search
{
  "query": {
    "match": {
      "title": {
        "query": "quick brown dog",
        "minimum_should_match": "75%"
      }
    }
  }
}

When using a percentage, minimum_should_match does the right thing: in the example with three terms earlier, 75% is automatically truncated to 66.6%, which means two out of three terms. Regardless of the value, only documents that contain at least one term will be considered a match.

Of course, it is also equivalent to the following:

GET /test-dsl-match/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "title": "quick" }},
        { "match": { "title": "brown" }},
        { "match": { "title": "dog" }}
      ],
      "minimum_should_match": 2
    }
  }
}

Other match Types #

match_phrase

We have already learned about match_phrase in the previous section. Let’s look at another example.

GET /test-dsl-match/_search
{
  "query": {
    "match_phrase": {
      "title": {
        "query": "quick brown"
      }
    }
  }
}

Many people still misunderstand it, as in the following example:

GET /test-dsl-match/_search
{
  "query": {
    "match_phrase": {
      "title": {
        "query": "quick brown f"
      }
    }
  }
}

This query will not return any data because as we learned earlier, match is essentially a combination of terms, and match_phrase is a query for a continuous sequence of terms. Therefore, “f” is not a token and does not fulfill the term query, resulting in no search results.

match_phrase_prefix

Is there a way to search for “quick brown f”? Elasticsearch provides a way to search for the last term with a prefix on top of match_phrase, which allows us to search for “quick brown f”.

GET /test-dsl-match/_search
{
  "query": {
    "match_phrase_prefix": {
      "title": {
        "query": "quick brown f"
      }
    }
  }
}

(Note: The prefix does not mean matching the beginning of the entire text, but only the prefix of the last term.)

match_bool_prefix

In addition to match_phrase_prefix, Elasticsearch also provides the match_bool_prefix query.

GET /test-dsl-match/_search
{
  "query": {
    "match_bool_prefix": {
      "title": {
        "query": "quick brown f"
      }
    }
  }
}

These two ways have their differences. The match_bool_prefix query can be translated as:

GET /test-dsl-match/_search
{
  "query": {
    "bool" : {
      "should": [
        { "term": { "title": "quick" }},
        { "term": { "title": "brown" }},
        { "prefix": { "title": "f"}}
      ]
    }
  }
}

So, now you can understand that the “quick”, “brown”, and “f” in the match_bool_prefix query are not in a specific order.

multi_match

What if we want to search multiple fields at once? Elasticsearch provides the multi_match query for this purpose.

{
  "query": {
    "multi_match" : {
      "query":    "Will Smith",
      "fields": [ "title", "*_name" ] 
    }
  }
}

The “*” indicates a prefix match for fields.

## query string types

> Type 2: query string types

### query_string

This query uses a syntax based on operators like AND or NOT to parse and split the provided query string. The query then independently analyzes each split text before returning the matching documents.

The query_string can be used to create complex searches that include wildcards, searches across multiple fields, and more. Although versatile, the query is strict and will return an error if the query string contains any invalid syntax.

For example:
    

    GET /test-dsl-match/_search
    {
      "query": {
        "query_string": {
          "query": "(lazy dog) OR (brown dog)",
          "default_field": "title"
        }
      }
    }
    

In this case, the query results show that the search matches these four terms (terms with OR), so doc 3 and 4 are included.

![img](../images/es-dsl-full-text-15.png)

This should be enough for building a knowledge base, but there are actually more parameters and usage. For more information, please refer to the [official documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html).

### query_string_simple

This query uses a simpler syntax to parse the provided query string and split it into terms based on special operators. The query then independently analyzes each term before returning the matching documents.

Although its syntax is more restricted compared to the query_string query, the **simple_query_string query does not return an error for invalid syntax. Instead, it will ignore any invalid parts of the query string**.

For example:
    

    GET /test-dsl-match/_search
    {
      "query": {
        "simple_query_string" : {
            "query": "\"over the\" + (lazy | quick) + dog",
            "fields": ["title"],
            "default_operator": "and"
        }
      }
    }
    

![img](../images/es-dsl-full-text-16.png)

For more information, please refer to the [official documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-simple-query-string-query.html).
## Interval Type

> Third Category: Interval Type

Intervals refer to time intervals, essentially matching multiple rules in sequence.

For example:

```json
GET /test-dsl-match/_search
{
  "query": {
    "intervals" : {
      "title" : {
        "all_of" : {
          "ordered" : true,
          "intervals" : [
            {
              "match" : {
                "query" : "quick",
                "max_gaps" : 0,
                "ordered" : true
              }
            },
            {
              "any_of" : {
                "intervals" : [
                  { "match" : { "query" : "jump over" } },
                  { "match" : { "query" : "quick dog" } }
                ]
              }
            }
          ]
        }
      }
    }
  }
}

Because intervals can be combined, they can be quite complex. For more information, please refer to the official website.

Reference Articles #

https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html#full-text-queries

https://www.elastic.co/guide/cn/elasticsearch/guide/current/match-multi-word.html