11 Detailed Explanation of Aggregation Metric Aggregations

11 Detailed Explanation of Aggregation - Metric Aggregations #

在Elasticsearch中,聚合(Aggregation)是对一组文档执行的一系列计算操作的集合,用于从数据中提取汇总信息。聚合使用聚合查询(Aggregation Query)来定义,聚合查询使用JSON格式进行描述。

Metric聚合是一种特殊类型的聚合,它用于计算数字类型字段的统计信息。Metric聚合可以计算某个字段的最大值、最小值、平均值、求和等等。

下面是Metric聚合可用的一些选项:

  • avg:计算某个字段的平均值。
  • max:计算某个字段的最大值。
  • min:计算某个字段的最小值。
  • sum:计算某个字段的总和。
  • stats:计算某个字段的统计信息,包括平均值、最大值、最小值、总和等。
  • extended_stats:与stats相似,但提供更详细的统计信息,例如方差、标准差等。
  • value_count:计算某个字段的非空值数量。

使用Metric聚合需要指定要计算的字段名和相应的聚合选项。下面是一个示例:

{
  "aggs": {
    "avg_price": {
      "avg": {
        "field": "price"
      }
    },
    "max_price": {
      "max": {
        "field": "price"
      }
    },
    "value_count_price": {
      "value_count": {
        "field": "price"
      }
    }
  }
}

以上示例将计算字段price的平均值、最大值和非空值数量。

Metric聚合可用于在聚合查询中获得关于数字字段的各种统计信息,从而可以更好地理解数据的分布和特征。同时,Metric聚合还支持嵌套使用和组合使用,以满足更复杂的统计需求。

Understanding Metric Aggregations #

In the [bucket aggregations], I have created a diagram to help you establish the system. Now, let’s understand metric aggregations.

If you directly refer to the official documentation, there are probably more than a dozen types:

img

So, how do we understand metric aggregations? I think it can be understood from two perspectives:

  • From a classification perspective: Metric aggregations can be categorized into single-value analysis and multi-value analysis.
  • From a functional perspective: Specific analysis APIs are designed based on different application scenarios, such as geographic location, percentages, etc.

By combining the above two aspects, we can outline a rough mind map:

  • Single-value analysis

Outputs only one analysis result.

* Standard statistical type:
  * `avg` - Average
  * `max` - Maximum
  * `min` - Minimum
  * `sum` - Sum
  * `value_count` - Count
* Other types:
  * `cardinality` - Cardinality (distinct values)
  * `weighted_avg` - Weighted average
  * `median_absolute_deviation` - Median value
  • Multi-value analysis

Beyond single values.

* Stats type:
  * `stats` - Includes avg, max, min, sum, and count
  * `matrix_stats` - Designed for matrix models
  * `extended_stats`
  * `string_stats` - Designed for strings
* Percentile type:
  * `percentiles` - Range of percentiles
  * `percentile_ranks` - Percentile ranks
* Geographic location type:
  * `geo_bounds` - Geo bounds
  * `geo_centroid` - Geo centroid
  * `geo_line` - Geo line
* Top type:
  * `top_hits` - Top hits after bucketing
  * `top_metrics`

With the above list (I won’t create a diagram here), the system we establish is based on classification and functionality, rather than specific items (such as avg, percentiles…). This is a different cognitive dimension: specific items are fragmented, while classification and functionality are the system you need to build. @pdai

Univariate Analysis: Standard stat types #

avg Average #

Calculate the average grade of the class

POST /exams/_search?size=0
{
  "aggs": {
    "avg_grade": { "avg": { "field": "grade" } }
  }
}

Response

{
  ...
  "aggregations": {
    "avg_grade": {
      "value": 75.0
    }
  }
}

max Maximum #

Calculate the maximum sale price

POST /sales/_search?size=0
{
  "aggs": {
    "max_price": { "max": { "field": "price" } }
  }
}

Response

{
  ...
  "aggregations": {
      "max_price": {
          "value": 200.0
      }
  }
}

min Minimum #

Calculate the minimum sale price

POST /sales/_search?size=0
{
  "aggs": {
    "min_price": { "min": { "field": "price" } }
  }
}

Response

{
  ...

  "aggregations": {
    "min_price": {
      "value": 10.0
    }
  }
}

sum Sum #

Calculate the total sale price

POST /sales/_search?size=0
{
  "query": {
    "constant_score": {
      "filter": {
        "match": { "type": "hat" }
      }
    }
  },
  "aggs": {
    "hat_prices": { "sum": { "field": "price" } }
  }
}

Response

{
  ...
  "aggregations": {
    "hat_prices": {
      "value": 450.0
    }
  }
}

value_count Count #

Count the number of sales

POST /sales/_search?size=0
{
  "aggs" : {
    "types_count" : { "value_count" : { "field" : "type" } }
  }
}

Response

{
  ...
  "aggregations": {
    "types_count": {
      "value": 7
    }
  }
}

Univariate Analysis: Other Types #

weighted_avg Weighted Average #

POST /exams/_search
{
  "size": 0,
  "aggs": {
    "weighted_grade": {
      "weighted_avg": {
        "value": {
          "field": "grade"
        },
        "weight": {
          "field": "weight"
        }
      }
    }
  }
}

Response

{
  ...
  "aggregations": {
    "weighted_grade": {
      "value": 70.0
    }
  }
}

cardinality Cardinality (Distinct Count) #

POST /sales/_search?size=0
{
  "aggs": {
    "type_count": {
      "cardinality": {
        "field": "type"
      }
    }
  }
}

Response

{
  ...
  "aggregations": {
    "type_count": {
      "value": 3
    }
  }
}

median_absolute_deviation Median Absolute Deviation #

GET reviews/_search
{
  "size": 0,
  "aggs": {
    "review_average": {
      "avg": {
        "field": "rating"
      }
    },
    "review_variability": {
      "median_absolute_deviation": {
        "field": "rating" 
      }
    }
  }
}

Response

{
  ...
  "aggregations": {
    "review_average": {
      "value": 3.0
    },
    "review_variability": {
      "value": 2.0
    }
  }
}

Non-Aggregation Analysis: Stats Aggregation #

stats Aggregation includes avg, max, min, sum, and count #

POST /exams/_search?size=0
{
  "aggs": {
    "grades_stats": { "stats": { "field": "grade" } }
  }
}

Response:

{
  ...
  "aggregations": {
    "grades_stats": {
      "count": 2,
      "min": 50.0,
      "max": 100.0,
      "avg": 75.0,
      "sum": 150.0
    }
  }
}

matrix_stats Aggregation for Matrix Models #

The following example demonstrates the use of matrix statistics to describe the relationship between income and poverty.

GET /_search
{
  "aggs": {
    "statistics": {
      "matrix_stats": {
        "fields": [ "poverty", "income" ]
      }
    }
  }
}

Response:

{
  ...
  "aggregations": {
    "statistics": {
      "doc_count": 50,
      "fields": [
        {
          "name": "income",
          "count": 50,
          "mean": 51985.1,
          "variance": 7.383377037755103E7,
          "skewness": 0.5595114003506483,
          "kurtosis": 2.5692365287787124,
          "covariance": {
            "income": 7.383377037755103E7,
            "poverty": -21093.65836734694
          },
          "correlation": {
            "income": 1.0,
            "poverty": -0.8352655256272504
          }
        },
        {
          "name": "poverty",
          "count": 50,
          "mean": 12.732000000000001,
          "variance": 8.637730612244896,
          "skewness": 0.4516049811903419,
          "kurtosis": 2.8615929677997767,
          "covariance": {
            "income": -21093.65836734694,
            "poverty": 8.637730612244896
          },
          "correlation": {
            "income": -0.8352655256272504,
            "poverty": 1.0
          }
        }
      ]
    }
  }
}

extended_stats Aggregation #

Calculate statistical information based on numeric fields extracted from summary documents.

GET /exams/_search
{
  "size": 0,
  "aggs": {
    "grades_stats": { "extended_stats": { "field": "grade" } }
  }
}

The above aggregation calculates statistics on the grade field for all documents. The aggregation type is extended_stats and the field setting defines the numeric field of the documents on which the statistics are computed.

{
  ...
  "aggregations": {
    "grades_stats": {
      "count": 2,
      "min": 50.0,
      "max": 100.0,
      "avg": 75.0,
      "sum": 150.0,
      "sum_of_squares": 12500.0,
      "variance": 625.0,
      "variance_population": 625.0,
      "variance_sampling": 1250.0,
      "std_deviation": 25.0,
      "std_deviation_population": 25.0,
      "std_deviation_sampling": 35.35533905932738,
      "std_deviation_bounds": {
        "upper": 125.0,
        "lower": 25.0,
        "upper_population": 125.0,
        "lower_population": 25.0,
        "upper_sampling": 145.71067811865476,
        "lower_sampling": 4.289321881345245
      }
    }
  }
}

string_stats Aggregation for Strings #

Used to calculate statistics on string values extracted from aggregation documents. These values can be retrieved from a specific keyword field.

POST /my-index-000001/_search?size=0
{
  "aggs": {
    "message_stats": { "string_stats": { "field": "message.keyword" } }
  }
}

Response:

{
  ...
  "aggregations": {
    "message_stats": {
      "count": 5,
      "min_length": 24,
      "max_length": 30,
      "avg_length": 28.8,
      "entropy": 3.94617750050791
    }
  }
}

Non-uniform Analysis: Percentile Type #

percentiles Percentile Ranges #

Calculate one or multiple percentiles for values extracted from aggregated documents.

GET latency/_search
{
  "size": 0,
  "aggs": {
    "load_time_outlier": {
      "percentiles": {
        "field": "load_time" 
      }
    }
  }
}

By default, percentile metrics generate a range of percentiles: [1, 5, 25, 50, 75, 95, 99].

{
  ...
  
 "aggregations": {
    "load_time_outlier": {
      "values": {
        "1.0": 5.0,
        "5.0": 25.0,
        "25.0": 165.0,
        "50.0": 445.0,
        "75.0": 725.0,
        "95.0": 945.0,
        "99.0": 985.0
      }
    }
  }
}

percentile_ranks Percentile Ranks #

Calculate one or multiple percentile ranks for values extracted from aggregated documents.

GET latency/_search
{
  "size": 0,
  "aggs": {
    "load_time_ranks": {
      "percentile_ranks": {
        "field": "load_time",   
        "values": [ 500, 600 ]
      }
    }
  }
}

Returns:

{
  ...
  
 "aggregations": {
    "load_time_ranks": {
      "values": {
        "500.0": 90.01,
        "600.0": 100.0
      }
    }
  }
}

The above results indicate that 90.01% of page loads are completed within 500ms, while 100% of page loads are completed within 600ms.

Non-Single-Value Aggregations: Geo-location Type #

geo_bounds Geo bounds #

PUT /museums
{
  "mappings": {
    "properties": {
      "location": {
        "type": "geo_point"
      }
    }
  }
}

POST /museums/_bulk?refresh
{"index":{"_id":1}}
{"location": "52.374081,4.912350", "name": "NEMO Science Museum"}
{"index":{"_id":2}}
{"location": "52.369219,4.901618", "name": "Museum Het Rembrandthuis"}
{"index":{"_id":3}}
{"location": "52.371667,4.914722", "name": "Nederlands Scheepvaartmuseum"}
{"index":{"_id":4}}
{"location": "51.222900,4.405200", "name": "Letterenhuis"}
{"index":{"_id":5}}
{"location": "48.861111,2.336389", "name": "Musée du Louvre"}
{"index":{"_id":6}}
{"location": "48.860000,2.327000", "name": "Musée d'Orsay"}

POST /museums/_search?size=0
{
  "query": {
    "match": { "name": "musée" }
  },
  "aggs": {
    "viewport": {
      "geo_bounds": {
        "field": "location",    
        "wrap_longitude": true  
      }
    }
  }
}

The above aggregation demonstrates how to calculate the bounding box of the location field for all documents with the “museum” business type.

{
  ...
  "aggregations": {
    "viewport": {
      "bounds": {
        "top_left": {
          "lat": 48.86111099738628,
          "lon": 2.3269999679178
        },
        "bottom_right": {
          "lat": 48.85999997612089,
          "lon": 2.3363889567553997
        }
      }
    }
  }
}

geo_centroid Geo-centroid #

PUT /museums
{
  "mappings": {
    "properties": {
      "location": {
        "type": "geo_point"
      }
    }
  }
}

POST /museums/_bulk?refresh
{"index":{"_id":1}}
{"location": "52.374081,4.912350", "city": "Amsterdam", "name": "NEMO Science Museum"}
{"index":{"_id":2}}
{"location": "52.369219,4.901618", "city": "Amsterdam", "name": "Museum Het Rembrandthuis"}
{"index":{"_id":3}}
{"location": "52.371667,4.914722", "city": "Amsterdam", "name": "Nederlands Scheepvaartmuseum"}
{"index":{"_id":4}}
{"location": "51.222900,4.405200", "city": "Antwerp", "name": "Letterenhuis"}
{"index":{"_id":5}}
{"location": "48.861111,2.336389", "city": "Paris", "name": "Musée du Louvre"}
{"index":{"_id":6}}
{"location": "48.860000,2.327000", "city": "Paris", "name": "Musée d'Orsay"}

POST /museums/_search?size=0
{
  "aggs": {
    "centroid": {
      "geo_centroid": {
        "field": "location" 
      }
    }
  }
}

The above aggregation shows how to calculate the centroid of the location field for all documents with the “crime” type.

{
  ...
  "aggregations": {
    "centroid": {
      "location": {
        "lat": 51.00982965203002,
        "lon": 3.9662131341174245
      },
      "count": 6
    }
  }
}

geo_line Geo-Line #

PUT test
{
    "mappings": {
        "dynamic": "strict",
        "_source": {
            "enabled": false
        },
        "properties": {
            "my_location": {
                "type": "geo_point"
            },
            "group": {
                "type": "keyword"
            },
            "@timestamp": {
                "type": "date"
            }
        }
    }
}

POST /test/_bulk?refresh
{"index": {}}
{"my_location": {"lat":37.3450570, "lon": -122.0499820}, "@timestamp": "2013-09-06T16:00:36"}
{"index": {}}
{"my_location": {"lat": 37.3451320, "lon": -122.0499820}, "@timestamp": "2013-09-06T16:00:37Z"}
{"index": {}}
{"my_location": {"lat": 37.349283, "lon": -122.0505010}, "@timestamp": "2013-09-06T16:00:37Z"}

POST /test/_search?filter_path=aggregations
{
  "aggs": {
    "line": {
      "geo_line": {
        "point": {"field": "my_location"},
        "sort": {"field": "@timestamp"}
      }
    }
  }
}

Aggregate all the geo_point values in the bucket into a LineString sorted by the selected sort field.

{
  "aggregations": {
    "line": {
      "type": "Feature",
      "geometry": {
        "type": "LineString",
        "coordinates": [
          [
            -122.049982,
            37.345057
          ],
          [
            -122.050501,
            37.349283
          ],
          [
            -122.049982,
            37.345132
          ]
        ]
      },
      "properties": {
        "complete": true
      }
    }
  }
}

## Non-Aggregating Analysis: Top Type

### `top_hits` Top Hits after Bucketing
    
    
    POST /sales/_search?size=0
    {
      "aggs": {
        "top_tags": {
          "terms": {
            "field": "type",
            "size": 3
          },
          "aggs": {
            "top_sales_hits": {
              "top_hits": {
                "sort": [
                  {
                    "date": {
                      "order": "desc"
                    }
                  }
                ],
                "_source": {
                  "includes": [ "date", "price" ]
                },
                "size": 1
              }
            }
          }
        }
      }
    }
    

Response
    
    
    {
      ...
      "aggregations": {
        "top_tags": {
           "doc_count_error_upper_bound": 0,
           "sum_other_doc_count": 0,
           "buckets": [
              {
                 "key": "hat",
                 "doc_count": 3,
                 "top_sales_hits": {
                    "hits": {
                       "total" : {
                           "value": 3,
                           "relation": "eq"
                       },
                       "max_score": null,
                       "hits": [
                          {
                             "_index": "sales",
                             "_type": "_doc",
                             "_id": "AVnNBmauCQpcRyxw6ChK",
                             "_source": {
                                "date": "2015/03/01 00:00:00",
                                "price": 200
                             },
                             "sort": [
                                1425168000000
                             ],
                             "_score": null
                          }
                       ]
                    }
                 }
              },
              {
                 "key": "t-shirt",
                 "doc_count": 3,
                 "top_sales_hits": {
                    "hits": {
                       "total" : {
                           "value": 3,
                           "relation": "eq"
                       },
                       "max_score": null,
                       "hits": [
                          {
                             "_index": "sales",
                             "_type": "_doc",
                             "_id": "AVnNBmauCQpcRyxw6ChL",
                             "_source": {
                                "date": "2015/03/01 00:00:00",
                                "price": 175
                             },
                             "sort": [
                                1425168000000
                             ],
                             "_score": null
                          }
                       ]
                    }
                 }
              },
              {
                 "key": "bag",
                 "doc_count": 1,
                 "top_sales_hits": {
                    "hits": {
                       "total" : {
                           "value": 1,
                           "relation": "eq"
                       },
                       "max_score": null,
                       "hits": [
                          {
                             "_index": "sales",
                             "_type": "_doc",
                             "_id": "AVnNBmatCQpcRyxw6ChH",
                             "_source": {
                                "date": "2015/01/01 00:00:00",
                                "price": 150
                             },
                             "sort": [
                                1420070400000
                             ],
                             "_score": null
                          }
                       ]
                    }
                 }
              }
           ]
        }
      }
    }
    

### `top_metrics`
    
    
    POST /test/_bulk?refresh
    {"index": {}}
    {"s": 1, "m": 3.1415}
    {"index": {}}
    {"s": 2, "m": 1.0}
    {"index": {}}
    {"s": 3, "m": 2.71828}
    POST /test/_search?filter_path=aggregations
    {
      "aggs": {
        "tm": {
          "top_metrics": {
            "metrics": {"field": "m"},
            "sort": {"s": "desc"}
          }
        }
      }
    }
    

Response
    
    
    {
      "aggregations": {
        "tm": {
          "top": [ {"sort": [3], "metrics": {"m": 2.718280076980591 } } ]
        }
      }
    }
## Reference Article

[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics.html](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics.html)