10 Detailed Explanation of Aggregation Bucket Aggregations

10 Detailed Explanation of Aggregation - Bucket Aggregations #

在聚合查询中,Bucket 聚合是一种非常强大的聚合类型,它根据指定的条件将文档分到不同的桶中,并对每个桶中的文档进行聚合计算。

Bucket 聚合是一种分桶的操作,它将文档按照指定的条件分成不同的桶,然后对每个桶中的文档进行聚合计算。Bucket 聚合可以用于对数据进行分组、分类和统计分析。

Bucket 聚合的语法如下:

{
  "aggs": {
    "bucket_name": {
      "bucket_type": {
        "field": "field_name"
      }
    }
  }
}

其中,aggs 表示聚合查询,bucket_name 是桶的名称,bucket_type 是桶的类型,field_name 是用于分桶的字段名。

常用的 Bucket 聚合类型有以下几种:

  • terms:根据指定的字段名进行分桶,并对每个桶中的文档进行统计和计算。
  • date_histogram:根据指定的日期字段进行时间范围的分桶,并对每个桶中的文档进行统计和计算。
  • range:根据指定的范围条件进行分桶,并对每个桶中的文档进行统计和计算。
  • nested:根据嵌套字段进行分桶,并对每个桶中的文档进行统计和计算。

Bucket 聚合可以灵活地对数据进行分桶和统计,可以根据实际需求进行选择和组合,以得到所需的聚合结果。

Introduction to Aggregation #

In SQL, we often have the following in the result:

SELECT COUNT(color) 
FROM table
GROUP BY color

In Elasticsearch, buckets are conceptually similar to the GROUP BY clause in SQL, while metrics are similar to statistical methods such as COUNT(), SUM(), MAX(), etc.

This introduces two concepts:

  • Buckets: A collection of documents that meet specific conditions
  • Metrics: Statistical calculations performed on the documents within a bucket

Therefore, Elasticsearch includes three types of aggregations:

  • Bucket Aggregation: Explained in detail in this article

  • Metric Aggregation: Explained in the following article

  • Pipeline Aggregation:

    • Covered in the next article

    • Aggregation pipelining involves using the result of one aggregation as the input for the next aggregation

(PS: In many cases, metric aggregations and bucket aggregations are used together. As you can see, a bucket aggregation is essentially a special type of metric aggregation, where the aggregation metric is the count of the data.)

Understanding Bucket Aggregations #

If you directly look at the documentation, there are probably dozens of ways:

img

Either you need to spend a lot of time learning, or you are already lost or about to get lost in the knowledge points…

So you need to think from the perspective of the designer for a moment, and it is not difficult to see that the design can be roughly divided into three categories (of course, some are a combination of the second and third categories).

img

(Not all contents are listed in the figure because the intention of the figure is clear enough to understand; with this kind of thinking and understanding, your cognitive efficiency will be greatly improved.)

Aggregating by Knowledge Point #

Let’s start by studying the knowledge points in the Aggregation according to an example in the official authoritative guide.

Preparing the Data #

Let’s start with an example. We will create some useful aggregations for car dealerships, with data about car transactions: model, manufacturer, price, and when it was sold.

First, let’s bulk index some data:

POST /test-agg-cars/_bulk
{ "index": {}}
{ "price" : 10000, "color" : "red", "make" : "honda", "sold" : "2014-10-28" }
{ "index": {}}
{ "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" }
{ "index": {}}
{ "price" : 30000, "color" : "green", "make" : "ford", "sold" : "2014-05-18" }
{ "index": {}}
{ "price" : 15000, "color" : "blue", "make" : "toyota", "sold" : "2014-07-02" }
{ "index": {}}
{ "price" : 12000, "color" : "green", "make" : "toyota", "sold" : "2014-08-19" }
{ "index": {}}
{ "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" }
{ "index": {}}
{ "price" : 80000, "color" : "red", "make" : "bmw", "sold" : "2014-01-01" }
{ "index": {}}
{ "price" : 25000, "color" : "blue", "make" : "ford", "sold" : "2014-02-12" }

Simple Aggregations #

With the data in place, let’s start building our first aggregation. Car dealers might want to know which color of cars sells the best, and we can easily get the result with the terms bucket operation:

GET /test-agg-cars/_search
{
    "size" : 0,
    "aggs" : { 
        "popular_colors" : { 
            "terms" : { 
              "field" : "color.keyword"
            }
        }
    }
}
  1. The aggregation operation is placed under the top-level parameter aggs (or aggregations in its full form).
  2. Then, we can specify a name for the aggregation we want, in this case, popular_colors.
  3. Finally, the type of bucket is defined as terms.

The result is as follows:

img

  1. Since we set the size parameter, there are no hits search results returned.
  2. The popular_colors aggregation is returned as part of the aggregations field.
  3. Each bucket’s key corresponds to a unique term found in the color field. It always includes a doc_count field, which tells us the number of documents containing that term.
  4. The number of buckets represents the number of documents for each color.

Multiple Aggregations #

Calculate the results for two different buckets: by color and by make.

GET /test-agg-cars/_search
{
    "size" : 0,
    "aggs" : { 
        "popular_colors" : { 
            "terms" : { 
              "field" : "color.keyword"
            }
        },
        "make_by" : { 
            "terms" : { 
              "field" : "make.keyword"
            }
        }
    }
}

The result is as follows:

img

Nested Aggregation #

This new aggregation layer allows us to nest an avg metric inside a terms bucket. In fact, it generates an average price for each color.

GET /test-agg-cars/_search
{
   "size" : 0,
   "aggs": {
      "colors": {
         "terms": {
            "field": "color.keyword"
         },
         "aggs": { 
            "avg_price": { 
               "avg": {
                  "field": "price" 
               }
            }
         }
      }
   }
}

The result is as follows:

img

Just like the example with color, we need to give a name to the metric (avg_price) so that we can retrieve its value later. Finally, we specify the metric itself (avg) and the field we want to calculate the average value for (price).

Aggregations with Dynamic Scripts #

This example shows that Elasticsearch also supports complex dynamic aggregations based on scripts (which generate runtime fields).

GET /test-agg-cars/_search
{
  "runtime_mappings": {
    "make.length": {
      "type": "long",
      "script": "emit(doc['make.keyword'].value.length())"
    }
  },
  "size" : 0,
  "aggs": {
    "make_length": {
      "histogram": {
        "interval": 1,
        "field": "make.length"
      }
    }
  }
}

The result is as follows:

img

For the histogram, refer to the following content.

Learning Bucket Aggregations by Category #

When learning specifically, there is no need to learn every point. Based on the understanding of the above diagram, we only need to spend 20% of the time learning the most commonly used 80% of the functionality, and refer to the documentation for the rest. @pdai

Filter by Precondition: filter #

Defines a single storage bucket that matches all the documents defined and specified by the filter in the current document set context. Typically, this will be used to narrow down the current aggregation context to a specific set of documents.

GET /test-agg-cars/_search
{
  "size": 0,
  "aggs": {
    "make_by": {
      "filter": { "term": { "type": "honda" } },
      "aggs": {
        "avg_price": { "avg": { "field": "price" } }
      }
    }
  }
}

The result is as follows:

img

Aggregating the Filters: filters #

Design a new example. In the log system, each log is in text and contains information such as warning/info.

PUT /test-agg-logs/_bulk?refresh
{ "index" : { "_id" : 1 } }
{ "body" : "warning: page could not be rendered" }
{ "index" : { "_id" : 2 } }
{ "body" : "authentication error" }
{ "index" : { "_id" : 3 } }
{ "body" : "warning: connection timed out" }
{ "index" : { "_id" : 4 } }
{ "body" : "info: hello pdai" }

We need to group the logs that contain different types of log messages, which requires the use of filters:

GET /test-agg-logs/_search
{
  "size": 0,
  "aggs" : {
    "messages" : {
      "filters" : {
        "other_bucket_key": "other_messages",
        "filters" : {
          "infos" :   { "match" : { "body" : "info"   }},
          "warnings" : { "match" : { "body" : "warning" }}
        }
      }
    }
  }
}

The result is as follows:

img

Aggregating Number Types: Range #

Aggregating based on multiple bucket value sources allows users to define a set of ranges, with each range representing a bucket. During the aggregation process, the values extracted from each document will be checked against each range, and the relevant/matching documents will be “stored”. Note that this aggregation includes the from value, but does not include the to value for each range.

GET /test-agg-cars/_search
{
  "size": 0,
  "aggs": {
    "price_ranges": {
      "range": {
        "field": "price",
        "ranges": [
          { "to": 20000 },
          { "from": 20000, "to": 40000 },
          { "from": 40000 }
        ]
      }
    }
  }
}

The result is as follows:

img

Aggregating IP Types: IP Range #

Specifically for aggregating IP values within a range.

GET /ip_addresses/_search
{
  "size": 10,
  "aggs": {
    "ip_ranges": {
      "ip_range": {
        "field": "ip",
        "ranges": [
          { "to": "10.0.0.5" },
          { "from": "10.0.0.5" }
        ]
      }
    }
  }
}

Returns

{
  ...

  "aggregations": {
    "ip_ranges": {
      "buckets": [
        {
          "key": "*-10.0.0.5",
          "to": "10.0.0.5",
          "doc_count": 10
        },
        {
          "key": "10.0.0.5-*",
          "from": "10.0.0.5",
          "doc_count": 260
        }
      ]
    }
  }
}
  • Grouping by CIDR Mask

You can also group by CIDR mask.

GET /ip_addresses/_search
{
  "size": 0,
  "aggs": {
    "ip_ranges": {
      "ip_range": {
        "field": "ip",
        "ranges": [
          { "mask": "10.0.0.0/25" },
          { "mask": "10.0.0.127/25" }
        ]
      }
    }
  }
}

Returns

{
  ...

  "aggregations": {
    "ip_ranges": {
      "buckets": [
        {
          "key": "10.0.0.0/25",
          "from": "10.0.0.0",
          "to": "10.0.0.128",
          "doc_count": 128
        },
        {
          "key": "10.0.0.127/25",
          "from": "10.0.0.0",
          "to": "10.0.0.128",
          "doc_count": 128
        }
      ]
    }
  }
}
  • Adding Key Display

    GET /ip_addresses/_search {

{
  "size": 0,
  "aggs": {
    "ip_ranges": {
      "ip_range": {
        "field": "ip",
        "ranges": [
          { "to": "10.0.0.5" },
          { "from": "10.0.0.5" }
        ],
        "keyed": true // 在这里
      }
    }
  }
}

返回结果

{
  ...

  "aggregations": {
    "ip_ranges": {
      "buckets": {
        "*-10.0.0.5": {
          "to": "10.0.0.5",
          "doc_count": 10
        },
        "10.0.0.5-*": {
          "from": "10.0.0.5",
          "doc_count": 260
        }
      }
    }
  }
}
  • 自定义键显示
GET /ip_addresses/_search
{
  "size": 0,
  "aggs": {
    "ip_ranges": {
      "ip_range": {
        "field": "ip",
        "ranges": [
          { "key": "infinity", "to": "10.0.0.5" },
          { "key": "and-beyond", "from": "10.0.0.5" }
        ],
        "keyed": true
      }
    }
  }
}

返回结果

{
  ...

  "aggregations": {
    "ip_ranges": {
      "buckets": {
        "infinity": {
          "to": "10.0.0.5",
          "doc_count": 10
        },
        "and-beyond": {
          "from": "10.0.0.5",
          "doc_count": 260
        }
      }
    }
  }
}

对日期类型聚合:Date Range #

专用于日期值的范围聚合。

GET /test-agg-cars/_search
{
  "size": 0,
  "aggs": {
    "range": {
      "date_range": {
        "field": "sold",
        "format": "yyyy-MM",
        "ranges": [
          { "from": "2014-01-01" },
          { "to": "2014-12-31" }
        ]
      }
    }
  }
}

结果如下:

img

此聚合与Range聚合之间的主要区别在于 from和to值可以在Date Math表达式 中表示,并且还可以指定日期格式,通过该日期格式将返回from and to响应字段。请注意,此聚合包括from值,但 不包括to每个范围的值

对柱状图功能:Histrogram #

直方图 histogram 本质上是就是为柱状图功能设计的。

创建直方图需要指定一个区间,如果我们要为售价创建一个直方图,可以将间隔设为 20,000。这样做将会在每个 $20,000 档创建一个新桶,然后文档会被分到对应的桶中。

对于仪表盘来说,我们希望知道每个售价区间内汽车的销量。我们还会想知道每个售价区间内汽车所带来的收入,可以通过对每个区间内已售汽车的售价求和得到。

可以用 histogram 和一个嵌套的 sum 度量得到我们想要的答案:

GET /test-agg-cars/_search
{
   "size" : 0,
   "aggs":{
      "price":{
         "histogram":{ 
            "field": "price.keyword",
            "interval": 20000
         },
         "aggs":{
            "revenue": {
               "sum": { 
                 "field" : "price"
               }
             }
         }
      }
   }
}
  1. histogram 桶要求两个参数:一个数值字段以及一个定义桶大小间隔。
  2. sum 度量嵌套在每个售价区间内,用来显示每个区间内的总收入。

如我们所见,查询是围绕 price 聚合构建的,它包含一个 histogram 桶。它要求字段的类型必须是数值型的同时需要设定分组的间隔范围。 间隔设置为 20,000 意味着我们将会得到如 [0-19999, 20000-39999, …] 这样的区间。

接着,我们在直方图内定义嵌套的度量,这个 sum 度量,它会对落入某一具体售价区间的文档中 price 字段的值进行求和。 这可以为我们提供每个售价区间的收入,从而可以发现到底是普通家用车赚钱还是奢侈车赚钱。

响应结果如下:

img

结果很容易理解,不过应该注意到直方图的键值是区间的下限。键 0 代表区间 0-19,999 ,键 20000 代表区间 20,000-39,999 ,等等。

img

当然,我们可以为任何聚合输出的分类和统计结果创建条形图,而不只是 直方图 桶。让我们以最受欢迎 10 种汽车以及它们的平均售价、标准差这些信息创建一个条形图。 我们会用到 terms 桶和 extended_stats 度量:

GET /test-agg-cars/_search
{
  "size" : 0,
  "aggs": {
    "makes": {
      "terms": {
        "field": "make.keyword",
        "size": 10
      },
      "aggs": {
        "stats": {
          "extended_stats": {
            "field": "price"
          }
        }
      }
    }
  }
}

上述代码会按受欢迎度返回制造商列表以及它们各自的统计信息。我们对其中的 stats.avg 、 stats.count 和 stats.std_deviation 信息特别感兴趣,并用 它们计算出标准差:

std_err = std_deviation / count

img

对应报表:

img

References #

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket.html

https://www.elastic.co/guide/cn/elasticsearch/guide/current/_aggregation_test_drive.html