Elasticsearch 查询与聚合基础指南

1. Elasticsearch 查询与聚合简介

Elasticsearch 是一个分布式、RESTful 风格的搜索和分析引擎，基于 Apache Lucene 构建。它能够以近实时的方式存储、搜索和分析大量数据。

Elasticsearch 两大核心功能：

查询(Query)：用于检索匹配特定条件的文档
聚合(Aggregation)：用于分析和汇总数据

graph TD A[Elasticsearch API] --> B[查询 DSL] A --> C[聚合 API] B --> D[全文查询] B --> E[Term 级别查询] B --> F[复合查询] C --> G[Bucket 聚合] C --> H[Metric 聚合] C --> I[Pipeline 聚合] C --> J[Matrix 聚合]

在本文中，我们将学习如何使用 Elasticsearch 的查询 DSL（领域特定语言）和聚合功能来有效地检索和分析数据。

2. 查询基础

Elasticsearch 提供了一种基于 JSON 的查询语言（Query DSL）来定义查询。查询可分为两大类：

叶查询子句

在特定字段中查找特定值的查询（如 match、term、range）。这些查询可以独立使用。

复合查询子句

包含其他叶查询或复合查询的查询（如 bool、dis_max），用于组合多个查询。

基本的查询请求结构如下：

{
  "query": {
    "查询类型": {
      "参数1": "值1",
      "参数2": "值2"
    }
  }
}

2.1 Match 查询

Match 查询是全文查询的标准查询，对输入文本进行分析（如分词），然后构建查询。

GET /my_index/_search
{
  "query": {
    "match": {
      "title": "elasticsearch guide"
    }
  }
}

上述查询会搜索 title 字段中包含 "elasticsearch" 或 "guide" 的文档。

注意： match 查询默认使用 OR 操作符，可以通过设置 operator 参数修改：

GET /my_index/_search
{
  "query": {
    "match": {
      "title": {
        "query": "elasticsearch guide",
        "operator": "and"
      }
    }
  }
}

这样就要求文档同时包含 "elasticsearch" 和 "guide" 两个词。

2.2 Term 查询

Term 查询用于精确值匹配，不会对搜索词进行分析处理。适用于关键字、数字、日期等结构化数据。

GET /my_index/_search
{
  "query": {
    "term": {
      "status": {
        "value": "active"
      }
    }
  }
}

最佳实践： 对于文本字段，确保理解分析与未分析字段的区别。如果需要精确匹配，使用 keyword 类型字段或 .keyword 子字段。

2.3 Range 查询

Range 查询用于查找字段值在指定范围内的文档，适用于数字、日期等类型。

GET /my_index/_search
{
  "query": {
    "range": {
      "age": {
        "gte": 20,
        "lte": 40
      }
    }
  }
}

常用的范围操作符：

gt - 大于
gte - 大于等于
lt - 小于
lte - 小于等于

对于日期类型的范围查询：

GET /my_index/_search
{
  "query": {
    "range": {
      "created_at": {
        "gte": "2021-01-01",
        "lte": "now",
        "format": "yyyy-MM-dd"
      }
    }
  }
}

2.4 Bool 复合查询

Bool 查询允许您组合多个查询子句，是构建复杂查询的基础。

Bool 查询包含四种子句类型：

must：文档必须匹配这些条件，类似于 AND
should：文档应该匹配这些条件，类似于 OR
must_not：文档不能匹配这些条件，类似于 NOT
filter：必须匹配，但不影响相关性评分

GET /my_index/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "title": "elasticsearch" } }
      ],
      "should": [
        { "match": { "content": "search" } },
        { "match": { "content": "analysis" } }
      ],
      "must_not": [
        { "term": { "status": "deleted" } }
      ],
      "filter": [
        { "range": { "publish_date": { "gte": "2020-01-01" } } }
      ]
    }
  }
}

性能提示： 优先使用 filter 而非 must，因为 filter 子句不计算分数且结果可被缓存，从而提高查询效率。

2.5 Query String 查询

Query String 查询提供了一种紧凑的方式来使用查询语法进行搜索，允许用户使用操作符（AND、OR、NOT）和通配符等。

GET /my_index/_search
{
  "query": {
    "query_string": {
      "query": "elasticsearch AND (guide OR tutorial)",
      "fields": ["title", "content"]
    }
  }
}

支持的语法包括：

逻辑操作符：AND, OR, NOT
字段限定：title:elasticsearch
通配符：elastic*, ?earch
正则表达式：/elastics[ae]arch/
模糊匹配：elasticsearch~2
范围查询：age:[20 TO 30], date:{2020-01-01 TO 2020-12-31}

3. 聚合基础

聚合提供了分组和提取数据的能力。聚合可以看作是SQL中GROUP BY和聚合函数的Elasticsearch等价物。

graph LR A[Elasticsearch聚合] --> B[Bucket聚合] A --> C[Metric聚合] A --> D[Pipeline聚合] A --> E[Matrix聚合] B --> F[Terms] B --> G[Date Range] B --> H[Histogram] C --> I[Sum] C --> J[Avg] C --> K[Stats] D --> L[Avg Bucket] D --> M[Cumulative Sum]

聚合语法的基本结构：

{
  "aggs": {
    "聚合名称": {
      "聚合类型": {
        "字段": "字段名称",
        "其他参数": "值"
      }
    }
  }
}

3.1 Bucket 聚合

Bucket 聚合将文档分组到不同的桶(buckets)中，类似于 SQL 中的 GROUP BY。

Terms 聚合

创建基于字段值的分组：

GET /my_index/_search
{
  "size": 0,
  "aggs": {
    "status_counts": {
      "terms": {
        "field": "status.keyword",
        "size": 10
      }
    }
  }
}

上面的查询会返回按 status 字段分组的文档计数。

Date Histogram 聚合

按时间间隔分组：

GET /my_index/_search
{
  "size": 0,
  "aggs": {
    "articles_over_time": {
      "date_histogram": {
        "field": "publish_date",
        "calendar_interval": "month",
        "format": "yyyy-MM"
      }
    }
  }
}

Range 聚合

创建自定义范围的桶：

GET /my_index/_search
{
  "size": 0,
  "aggs": {
    "age_ranges": {
      "range": {
        "field": "age",
        "ranges": [
          { "to": 25 },
          { "from": 25, "to": 35 },
          { "from": 35 }
        ]
      }
    }
  }
}

3.2 Metric 聚合

Metric 聚合计算一组文档的各种指标，类似于 SQL 中的聚合函数（SUM, AVG, MIN, MAX等）。

基本度量聚合

GET /my_index/_search
{
  "size": 0,
  "aggs": {
    "avg_age": { "avg": { "field": "age" } },
    "max_age": { "max": { "field": "age" } },
    "min_age": { "min": { "field": "age" } },
    "sum_age": { "sum": { "field": "age" } }
  }
}

Stats 聚合

一次计算多个统计指标：

GET /my_index/_search
{
  "size": 0,
  "aggs": {
    "age_stats": {
      "stats": {
        "field": "age"
      }
    }
  }
}

返回字段的 count、min、max、avg 和 sum。

Cardinality 聚合

计算字段的大致基数（不同值的数量）：

GET /my_index/_search
{
  "size": 0,
  "aggs": {
    "unique_statuses": {
      "cardinality": {
        "field": "status.keyword"
      }
    }
  }
}

3.3 Pipeline 聚合

Pipeline 聚合对其他聚合的输出进行操作，而不是直接对文档进行操作。

Avg Bucket 聚合

计算另一个聚合中各个桶的平均值：

GET /my_index/_search
{
  "size": 0,
  "aggs": {
    "sales_per_month": {
      "date_histogram": {
        "field": "date",
        "calendar_interval": "month"
      },
      "aggs": {
        "sales": {
          "sum": {
            "field": "price"
          }
        }
      }
    },
    "avg_monthly_sales": {
      "avg_bucket": {
        "buckets_path": "sales_per_month>sales"
      }
    }
  }
}

Cumulative Sum 聚合

计算累计和：

GET /my_index/_search
{
  "size": 0,
  "aggs": {
    "sales_per_month": {
      "date_histogram": {
        "field": "date",
        "calendar_interval": "month"
      },
      "aggs": {
        "sales": {
          "sum": {
            "field": "price"
          }
        },
        "cumulative_sales": {
          "cumulative_sum": {
            "buckets_path": "sales"
          }
        }
      }
    }
  }
}

关键知识点： Pipeline 聚合中的 buckets_path 参数用于指定要处理的聚合路径，格式为 聚合名称>子聚合名称。

4. 组合查询与聚合

实际应用中，通常需要结合查询和聚合，对过滤后的数据集进行分析。

GET /my_index/_search
{
  "size": 0,
  "query": {
    "range": {
      "date": {
        "gte": "2020-01-01",
        "lte": "2020-12-31"
      }
    }
  },
  "aggs": {
    "categories": {
      "terms": {
        "field": "category.keyword",
        "size": 10
      },
      "aggs": {
        "avg_price": {
          "avg": {
            "field": "price"
          }
        },
        "top_products": {
          "top_hits": {
            "size": 3,
            "_source": ["name", "price"],
            "sort": [
              { "price": { "order": "desc" } }
            ]
          }
        }
      }
    }
  }
}

上面的示例：

查询 2020 年的所有文档
按类别进行分组
计算每个类别的平均价格
返回每个类别中价格最高的三个产品

嵌套聚合的应用场景：

多维数据分析
时间序列数据的趋势分析
按类别的销售数据分析
层次结构数据的展示

5. 分页与排序

处理大量搜索结果时，分页和排序是必不可少的功能。

基础分页

使用 from 和 size 参数进行分页：

GET /my_index/_search
{
  "from": 10,  // 跳过前10个结果
  "size": 20,  // 返回20个结果
  "query": {
    "match": {
      "content": "elasticsearch"
    }
  }
}

注意： 深度分页（大的 from 值）可能会导致性能问题。Elasticsearch 默认限制 from + size 不能超过 10,000。对于大型结果集，建议使用 search_after 或滚动 API。

排序

使用 sort 参数指定排序字段和顺序：

GET /my_index/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    { "publish_date": { "order": "desc" } },
    { "rating": { "order": "desc" } },
    "_score"
  ]
}

排序字段需要是数字、日期或 keyword 类型。对于文本字段，通常使用其 .keyword 子字段进行排序。

Search After

对于深度分页，可以使用 search_after 实现点对点的分页，避免性能问题：

GET /my_index/_search
{
  "size": 10,
  "query": {
    "match": {
      "content": "elasticsearch"
    }
  },
  "sort": [
    { "publish_date": { "order": "desc" } },
    { "_id": { "order": "asc" } }
  ],
  "search_after": [1589570400000, "doc_id_123"]
}

search_after 参数基于上一个请求返回的最后一个文档的排序值。

6. 高亮搜索结果

搜索结果高亮功能可以让用户更容易看到匹配的内容。

GET /my_index/_search
{
  "query": {
    "match": {
      "content": "elasticsearch"
    }
  },
  "highlight": {
    "fields": {
      "content": {}
    },
    "pre_tags": [""],
    "post_tags": [""]
  }
}

高亮显示的自定义选项：

fragment_size：每个高亮片段的字符数
number_of_fragments：返回的片段数量
require_field_match：是否只高亮与查询匹配的字段
highlight_query：自定义高亮查询

提示： 对于长文本内容，可以使用 no_match_size 参数控制不匹配部分的显示大小，以提供更好的上下文。

7. 性能优化提示

查询和聚合操作可能很耗资源，以下是一些优化性能的建议：

查询优化

优先使用 filter 而非 must，可缓存结果
尽量避免使用通配符前缀查询（例如 *search）
使用更具体的查询而非宽泛的查询
对于大型数据集使用合适的分页策略

聚合优化

在聚合之前应用筛选条件
限制聚合桶的数量（terms.size）
使用日期直方图时选择合适的时间间隔
对不需要的搜索结果使用 size: 0

使用 Filter Context

将不影响相关性评分的条件放入 filter context：

GET /my_index/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "title": "elasticsearch" } }
      ],
      "filter": [
        { "term": { "status": "published" } },
        { "range": { "publish_date": { "gte": "now-1y" } } }
      ]
    }
  }
}

索引和字段映射优化

为需要聚合的字段启用 doc_values（默认启用）
为需要全文搜索的字段启用合适的分析器
对不需要分析的字段使用 keyword 类型
适当使用 text 字段的 fielddata（注意内存消耗）

8. 总结

Elasticsearch 的查询和聚合功能强大而灵活，能够满足各种复杂的搜索和数据分析需求。

查询与聚合基础要点：

理解查询类型的差异（match vs term）
掌握 bool 查询组合多个条件
使用 bucket 聚合进行数据分组
使用 metric 聚合计算统计数据
组合查询和聚合进行复杂数据分析
应用合适的分页和排序策略
注意性能优化，特别是大数据量场景

随着对 Elasticsearch 的深入了解，您可以探索更高级的功能，如跨集群搜索、搜索模板、异步搜索以及机器学习集成等。