11 Detailed Explanation of Aggregation - Metric Aggregations #
在Elasticsearch中,聚合(Aggregation)是对一组文档执行的一系列计算操作的集合,用于从数据中提取汇总信息。聚合使用聚合查询(Aggregation Query)来定义,聚合查询使用JSON格式进行描述。
Metric聚合是一种特殊类型的聚合,它用于计算数字类型字段的统计信息。Metric聚合可以计算某个字段的最大值、最小值、平均值、求和等等。
下面是Metric聚合可用的一些选项:
avg
:计算某个字段的平均值。max
:计算某个字段的最大值。min
:计算某个字段的最小值。sum
:计算某个字段的总和。stats
:计算某个字段的统计信息,包括平均值、最大值、最小值、总和等。extended_stats
:与stats
相似,但提供更详细的统计信息,例如方差、标准差等。value_count
:计算某个字段的非空值数量。
使用Metric聚合需要指定要计算的字段名和相应的聚合选项。下面是一个示例:
{
"aggs": {
"avg_price": {
"avg": {
"field": "price"
}
},
"max_price": {
"max": {
"field": "price"
}
},
"value_count_price": {
"value_count": {
"field": "price"
}
}
}
}
以上示例将计算字段price
的平均值、最大值和非空值数量。
Metric聚合可用于在聚合查询中获得关于数字字段的各种统计信息,从而可以更好地理解数据的分布和特征。同时,Metric聚合还支持嵌套使用和组合使用,以满足更复杂的统计需求。
Understanding Metric Aggregations #
In the [bucket aggregations], I have created a diagram to help you establish the system. Now, let’s understand metric aggregations.
If you directly refer to the official documentation, there are probably more than a dozen types:
So, how do we understand metric aggregations? I think it can be understood from two perspectives:
- From a classification perspective: Metric aggregations can be categorized into single-value analysis and multi-value analysis.
- From a functional perspective: Specific analysis APIs are designed based on different application scenarios, such as geographic location, percentages, etc.
By combining the above two aspects, we can outline a rough mind map:
- Single-value analysis
Outputs only one analysis result.
* Standard statistical type:
* `avg` - Average
* `max` - Maximum
* `min` - Minimum
* `sum` - Sum
* `value_count` - Count
* Other types:
* `cardinality` - Cardinality (distinct values)
* `weighted_avg` - Weighted average
* `median_absolute_deviation` - Median value
- Multi-value analysis
Beyond single values.
* Stats type:
* `stats` - Includes avg, max, min, sum, and count
* `matrix_stats` - Designed for matrix models
* `extended_stats`
* `string_stats` - Designed for strings
* Percentile type:
* `percentiles` - Range of percentiles
* `percentile_ranks` - Percentile ranks
* Geographic location type:
* `geo_bounds` - Geo bounds
* `geo_centroid` - Geo centroid
* `geo_line` - Geo line
* Top type:
* `top_hits` - Top hits after bucketing
* `top_metrics`
With the above list (I won’t create a diagram here), the system we establish is based on classification and functionality, rather than specific items (such as avg, percentiles…). This is a different cognitive dimension: specific items are fragmented, while classification and functionality are the system you need to build. @pdai
Univariate Analysis: Standard stat types #
avg
Average
#
Calculate the average grade of the class
POST /exams/_search?size=0
{
"aggs": {
"avg_grade": { "avg": { "field": "grade" } }
}
}
Response
{
...
"aggregations": {
"avg_grade": {
"value": 75.0
}
}
}
max
Maximum
#
Calculate the maximum sale price
POST /sales/_search?size=0
{
"aggs": {
"max_price": { "max": { "field": "price" } }
}
}
Response
{
...
"aggregations": {
"max_price": {
"value": 200.0
}
}
}
min
Minimum
#
Calculate the minimum sale price
POST /sales/_search?size=0
{
"aggs": {
"min_price": { "min": { "field": "price" } }
}
}
Response
{
...
"aggregations": {
"min_price": {
"value": 10.0
}
}
}
sum
Sum
#
Calculate the total sale price
POST /sales/_search?size=0
{
"query": {
"constant_score": {
"filter": {
"match": { "type": "hat" }
}
}
},
"aggs": {
"hat_prices": { "sum": { "field": "price" } }
}
}
Response
{
...
"aggregations": {
"hat_prices": {
"value": 450.0
}
}
}
value_count
Count
#
Count the number of sales
POST /sales/_search?size=0
{
"aggs" : {
"types_count" : { "value_count" : { "field" : "type" } }
}
}
Response
{
...
"aggregations": {
"types_count": {
"value": 7
}
}
}
Univariate Analysis: Other Types #
weighted_avg
Weighted Average
#
POST /exams/_search
{
"size": 0,
"aggs": {
"weighted_grade": {
"weighted_avg": {
"value": {
"field": "grade"
},
"weight": {
"field": "weight"
}
}
}
}
}
Response
{
...
"aggregations": {
"weighted_grade": {
"value": 70.0
}
}
}
cardinality
Cardinality (Distinct Count)
#
POST /sales/_search?size=0
{
"aggs": {
"type_count": {
"cardinality": {
"field": "type"
}
}
}
}
Response
{
...
"aggregations": {
"type_count": {
"value": 3
}
}
}
median_absolute_deviation
Median Absolute Deviation
#
GET reviews/_search
{
"size": 0,
"aggs": {
"review_average": {
"avg": {
"field": "rating"
}
},
"review_variability": {
"median_absolute_deviation": {
"field": "rating"
}
}
}
}
Response
{
...
"aggregations": {
"review_average": {
"value": 3.0
},
"review_variability": {
"value": 2.0
}
}
}
Non-Aggregation Analysis: Stats Aggregation #
stats
Aggregation includes avg, max, min, sum, and count
#
POST /exams/_search?size=0
{
"aggs": {
"grades_stats": { "stats": { "field": "grade" } }
}
}
Response:
{
...
"aggregations": {
"grades_stats": {
"count": 2,
"min": 50.0,
"max": 100.0,
"avg": 75.0,
"sum": 150.0
}
}
}
matrix_stats
Aggregation for Matrix Models
#
The following example demonstrates the use of matrix statistics to describe the relationship between income and poverty.
GET /_search
{
"aggs": {
"statistics": {
"matrix_stats": {
"fields": [ "poverty", "income" ]
}
}
}
}
Response:
{
...
"aggregations": {
"statistics": {
"doc_count": 50,
"fields": [
{
"name": "income",
"count": 50,
"mean": 51985.1,
"variance": 7.383377037755103E7,
"skewness": 0.5595114003506483,
"kurtosis": 2.5692365287787124,
"covariance": {
"income": 7.383377037755103E7,
"poverty": -21093.65836734694
},
"correlation": {
"income": 1.0,
"poverty": -0.8352655256272504
}
},
{
"name": "poverty",
"count": 50,
"mean": 12.732000000000001,
"variance": 8.637730612244896,
"skewness": 0.4516049811903419,
"kurtosis": 2.8615929677997767,
"covariance": {
"income": -21093.65836734694,
"poverty": 8.637730612244896
},
"correlation": {
"income": -0.8352655256272504,
"poverty": 1.0
}
}
]
}
}
}
extended_stats
Aggregation
#
Calculate statistical information based on numeric fields extracted from summary documents.
GET /exams/_search
{
"size": 0,
"aggs": {
"grades_stats": { "extended_stats": { "field": "grade" } }
}
}
The above aggregation calculates statistics on the grade field for all documents. The aggregation type is extended_stats and the field setting defines the numeric field of the documents on which the statistics are computed.
{
...
"aggregations": {
"grades_stats": {
"count": 2,
"min": 50.0,
"max": 100.0,
"avg": 75.0,
"sum": 150.0,
"sum_of_squares": 12500.0,
"variance": 625.0,
"variance_population": 625.0,
"variance_sampling": 1250.0,
"std_deviation": 25.0,
"std_deviation_population": 25.0,
"std_deviation_sampling": 35.35533905932738,
"std_deviation_bounds": {
"upper": 125.0,
"lower": 25.0,
"upper_population": 125.0,
"lower_population": 25.0,
"upper_sampling": 145.71067811865476,
"lower_sampling": 4.289321881345245
}
}
}
}
string_stats
Aggregation for Strings
#
Used to calculate statistics on string values extracted from aggregation documents. These values can be retrieved from a specific keyword field.
POST /my-index-000001/_search?size=0
{
"aggs": {
"message_stats": { "string_stats": { "field": "message.keyword" } }
}
}
Response:
{
...
"aggregations": {
"message_stats": {
"count": 5,
"min_length": 24,
"max_length": 30,
"avg_length": 28.8,
"entropy": 3.94617750050791
}
}
}
Non-uniform Analysis: Percentile Type #
percentiles
Percentile Ranges
#
Calculate one or multiple percentiles for values extracted from aggregated documents.
GET latency/_search
{
"size": 0,
"aggs": {
"load_time_outlier": {
"percentiles": {
"field": "load_time"
}
}
}
}
By default, percentile metrics generate a range of percentiles: [1, 5, 25, 50, 75, 95, 99].
{
...
"aggregations": {
"load_time_outlier": {
"values": {
"1.0": 5.0,
"5.0": 25.0,
"25.0": 165.0,
"50.0": 445.0,
"75.0": 725.0,
"95.0": 945.0,
"99.0": 985.0
}
}
}
}
percentile_ranks
Percentile Ranks
#
Calculate one or multiple percentile ranks for values extracted from aggregated documents.
GET latency/_search
{
"size": 0,
"aggs": {
"load_time_ranks": {
"percentile_ranks": {
"field": "load_time",
"values": [ 500, 600 ]
}
}
}
}
Returns:
{
...
"aggregations": {
"load_time_ranks": {
"values": {
"500.0": 90.01,
"600.0": 100.0
}
}
}
}
The above results indicate that 90.01% of page loads are completed within 500ms, while 100% of page loads are completed within 600ms.
Non-Single-Value Aggregations: Geo-location Type #
geo_bounds
Geo bounds
#
PUT /museums
{
"mappings": {
"properties": {
"location": {
"type": "geo_point"
}
}
}
}
POST /museums/_bulk?refresh
{"index":{"_id":1}}
{"location": "52.374081,4.912350", "name": "NEMO Science Museum"}
{"index":{"_id":2}}
{"location": "52.369219,4.901618", "name": "Museum Het Rembrandthuis"}
{"index":{"_id":3}}
{"location": "52.371667,4.914722", "name": "Nederlands Scheepvaartmuseum"}
{"index":{"_id":4}}
{"location": "51.222900,4.405200", "name": "Letterenhuis"}
{"index":{"_id":5}}
{"location": "48.861111,2.336389", "name": "Musée du Louvre"}
{"index":{"_id":6}}
{"location": "48.860000,2.327000", "name": "Musée d'Orsay"}
POST /museums/_search?size=0
{
"query": {
"match": { "name": "musée" }
},
"aggs": {
"viewport": {
"geo_bounds": {
"field": "location",
"wrap_longitude": true
}
}
}
}
The above aggregation demonstrates how to calculate the bounding box of the location field for all documents with the “museum” business type.
{
...
"aggregations": {
"viewport": {
"bounds": {
"top_left": {
"lat": 48.86111099738628,
"lon": 2.3269999679178
},
"bottom_right": {
"lat": 48.85999997612089,
"lon": 2.3363889567553997
}
}
}
}
}
geo_centroid
Geo-centroid
#
PUT /museums
{
"mappings": {
"properties": {
"location": {
"type": "geo_point"
}
}
}
}
POST /museums/_bulk?refresh
{"index":{"_id":1}}
{"location": "52.374081,4.912350", "city": "Amsterdam", "name": "NEMO Science Museum"}
{"index":{"_id":2}}
{"location": "52.369219,4.901618", "city": "Amsterdam", "name": "Museum Het Rembrandthuis"}
{"index":{"_id":3}}
{"location": "52.371667,4.914722", "city": "Amsterdam", "name": "Nederlands Scheepvaartmuseum"}
{"index":{"_id":4}}
{"location": "51.222900,4.405200", "city": "Antwerp", "name": "Letterenhuis"}
{"index":{"_id":5}}
{"location": "48.861111,2.336389", "city": "Paris", "name": "Musée du Louvre"}
{"index":{"_id":6}}
{"location": "48.860000,2.327000", "city": "Paris", "name": "Musée d'Orsay"}
POST /museums/_search?size=0
{
"aggs": {
"centroid": {
"geo_centroid": {
"field": "location"
}
}
}
}
The above aggregation shows how to calculate the centroid of the location field for all documents with the “crime” type.
{
...
"aggregations": {
"centroid": {
"location": {
"lat": 51.00982965203002,
"lon": 3.9662131341174245
},
"count": 6
}
}
}
geo_line
Geo-Line
#
PUT test
{
"mappings": {
"dynamic": "strict",
"_source": {
"enabled": false
},
"properties": {
"my_location": {
"type": "geo_point"
},
"group": {
"type": "keyword"
},
"@timestamp": {
"type": "date"
}
}
}
}
POST /test/_bulk?refresh
{"index": {}}
{"my_location": {"lat":37.3450570, "lon": -122.0499820}, "@timestamp": "2013-09-06T16:00:36"}
{"index": {}}
{"my_location": {"lat": 37.3451320, "lon": -122.0499820}, "@timestamp": "2013-09-06T16:00:37Z"}
{"index": {}}
{"my_location": {"lat": 37.349283, "lon": -122.0505010}, "@timestamp": "2013-09-06T16:00:37Z"}
POST /test/_search?filter_path=aggregations
{
"aggs": {
"line": {
"geo_line": {
"point": {"field": "my_location"},
"sort": {"field": "@timestamp"}
}
}
}
}
Aggregate all the geo_point
values in the bucket into a LineString sorted by the selected sort field.
{
"aggregations": {
"line": {
"type": "Feature",
"geometry": {
"type": "LineString",
"coordinates": [
[
-122.049982,
37.345057
],
[
-122.050501,
37.349283
],
[
-122.049982,
37.345132
]
]
},
"properties": {
"complete": true
}
}
}
}
## Non-Aggregating Analysis: Top Type
### `top_hits` Top Hits after Bucketing
POST /sales/_search?size=0
{
"aggs": {
"top_tags": {
"terms": {
"field": "type",
"size": 3
},
"aggs": {
"top_sales_hits": {
"top_hits": {
"sort": [
{
"date": {
"order": "desc"
}
}
],
"_source": {
"includes": [ "date", "price" ]
},
"size": 1
}
}
}
}
}
}
Response
{
...
"aggregations": {
"top_tags": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "hat",
"doc_count": 3,
"top_sales_hits": {
"hits": {
"total" : {
"value": 3,
"relation": "eq"
},
"max_score": null,
"hits": [
{
"_index": "sales",
"_type": "_doc",
"_id": "AVnNBmauCQpcRyxw6ChK",
"_source": {
"date": "2015/03/01 00:00:00",
"price": 200
},
"sort": [
1425168000000
],
"_score": null
}
]
}
}
},
{
"key": "t-shirt",
"doc_count": 3,
"top_sales_hits": {
"hits": {
"total" : {
"value": 3,
"relation": "eq"
},
"max_score": null,
"hits": [
{
"_index": "sales",
"_type": "_doc",
"_id": "AVnNBmauCQpcRyxw6ChL",
"_source": {
"date": "2015/03/01 00:00:00",
"price": 175
},
"sort": [
1425168000000
],
"_score": null
}
]
}
}
},
{
"key": "bag",
"doc_count": 1,
"top_sales_hits": {
"hits": {
"total" : {
"value": 1,
"relation": "eq"
},
"max_score": null,
"hits": [
{
"_index": "sales",
"_type": "_doc",
"_id": "AVnNBmatCQpcRyxw6ChH",
"_source": {
"date": "2015/01/01 00:00:00",
"price": 150
},
"sort": [
1420070400000
],
"_score": null
}
]
}
}
}
]
}
}
}
### `top_metrics`
POST /test/_bulk?refresh
{"index": {}}
{"s": 1, "m": 3.1415}
{"index": {}}
{"s": 2, "m": 1.0}
{"index": {}}
{"s": 3, "m": 2.71828}
POST /test/_search?filter_path=aggregations
{
"aggs": {
"tm": {
"top_metrics": {
"metrics": {"field": "m"},
"sort": {"s": "desc"}
}
}
}
}
Response
{
"aggregations": {
"tm": {
"top": [ {"sort": [3], "metrics": {"m": 2.718280076980591 } } ]
}
}
}
## Reference Article
[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics.html](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics.html)