Elasticsearch 為什麼快：Inverted Index、BM25、HNSW 與實戰索引設計

日期：2026-04-22 情境： 在一套鑑識分析平台中，需要對數百萬筆訊息、通話、帳號、位置做即時全文檢索 + 語意相似搜尋。本文整理底層演算法 + 5 個 index 的設計取捨。

TL;DR

Elasticsearch 快的根本原因是 Lucene 的 inverted index（每個 token → posting list）+ 跳躍指標 (skip lists) 讓 boolean 查詢退化成 set 交集，配合 BM25 評分公式做相關性排序。向量檢索則靠 HNSW (Hierarchical Navigable Small World) 圖把暴力 O(n) 比對降到 O(log n)。

實戰設計：把不同性質資料拆成多 index（共用 all_content 統一檢索欄位）、寫入時關 refresh 開大 buffer、查詢時混搭 BM25 + KNN。

一、為什麼 Elasticsearch 對全文檢索很快

1.1 Inverted Index：從「文件 → 字」翻成「字 → 文件」

關聯式資料庫做全文檢索通常用 LIKE '%keyword%' 或 tsvector。LIKE 是 O(n) 全表掃描，tsvector 雖然有 GIN index 但設計目標是中小規模。

Lucene（Elasticsearch 底層）走完全相反的路：inverted index（倒排索引）。

文件 1: "張三今天去了台北"
文件 2: "李四今天在台中"

正向索引（傳統）：
  文件 1 → ["張三", "今天", "去", "台北"]
  文件 2 → ["李四", "今天", "在", "台中"]

倒排索引（Lucene）：
  "張三" → [1]
  "李四" → [2]
  "今天" → [1, 2]
  "台北" → [1]
  "台中" → [2]

查詢「今天 AND 台北」就變成 set 交集 [1,2] ∩ [1] = [1]，不需要掃任何原始文件。每個 token 的 posting list 在 Lucene 內部用 FST (Finite State Transducer) 壓縮，在 disk 上順序讀取友善。

1.2 Skip List：百萬級 posting list 也能 O(log n) 跳躍

如果 “今天” 出現在 100 萬份文件裡、”台北” 出現在 10 萬份，AND 還是要逐一比對。Lucene 在 posting list 上維護 multi-level skip pointer：

posting list: [1, 5, 9, 12, 18, 25, 33, 41, 50, ...]
skip pointer (每 8 個一階):     [1,           18,            50, ...]
skip pointer (每 64 個一階): [1, ........................... 50, ...]

要找 doc=33 不用走 1→5→9→12→18→25→33 七步，直接從 skip pointer 跳到 18 再走 25→33 三步。AND 兩個 list 時，從稀疏端開始用 skip 往稠密端「找下一個 ≥ 當前」的位置 — 平均 O((m+n) log(N/m)) 而不是 O(N)。

1.3 Analyzer：寫入時就斷好詞

PUT /messages
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": { "type": "standard" }
      }
    }
  }
}

Standard analyzer 會把「張三今天去了台北」轉成 tokens ["張", "三", "今", "天", "去", "了", "台", "北"]（CJK 預設按字切）。中文常用 icu_analyzer 或 smartcn 改善斷詞，但對「人名搜尋」場景反而單字索引更不會漏。

實戰上鑑識資料有大量人名 / 帳號 / 別名，「按字索引 + match 搜尋」比「斷詞 + match_phrase」漏看少。

二、BM25：相關性怎麼算

ES 預設用 BM25（Best Matching 25）打分，公式：

score(q, d) = Σ IDF(qi) · (f(qi, d) · (k1 + 1))
                          / (f(qi, d) + k1 · (1 - b + b · |d|/avgdl))

各項含義：

f(qi, d)：term frequency — qi 在文件 d 出現幾次
|d|：文件 d 的長度
avgdl：所有文件平均長度
IDF(qi) = log((N - n + 0.5) / (n + 0.5) + 1) — 出現越少的字權重越高
k1 = 1.2：term frequency 飽和點（一個詞出現 1 次和 100 次的差距會收斂）
b = 0.75：文件長度懲罰（長文件被適度扣分，避免 term 多就贏）

直覺理解：

「李四」（罕見字）出現一次 > 「的」（常見字）出現一次
短訊息「李四詐騙集團」 > 萬字長文中夾一句「…李四…」
同一 term 第 50 次出現的貢獻幾乎等於第 5 次（k1 飽和）

vs 老一代的 TF-IDF：BM25 加了 saturation 和 length normalization，避免 term 多就無腦贏。

三、實戰：5 個 index 的設計

該鑑識分析平台拆成 5 個 index 而不是塞一個大 index，每個對應一種解析後的資料：

Index	主資料	shards/replicas	保留期	特殊欄位
`messages`	聊天訊息	3/1	365 天	`embedding` (dense_vector 1024-dim), `body` (text)
`accounts`	使用者帳號	1/1	永久	`account_meta.username/displayName`
`calls`	通話紀錄	2/1	365 天	`call_meta.direction/duration/status`
`locations`	位置點	2/1	365 天	`location` (geo_point)
`conversations`	聊天室	1/1	永久	`conversation_meta.name/messageCount`

3.1 為什麼拆而不合？

Shard 設計獨立：訊息資料量大用 3 shards，帳號小用 1 shard。一個 index 多 shards 適合大資料量平行查詢，但 shard 越多 cluster state 越重 — 小 index 不需要。
Mapping 衝突避免：訊息的 body 是 text、帳號的 username 是 keyword，名稱類似但 type 不同 — 在同一 index 會吵起來。
保留期不同：訊息只存 365 天但帳號要永久 — 用 ILM (Index Lifecycle Management) 個別套政策。

3.2 共用欄位 + `copy_to` 統一檢索欄位

每個 index 都共用一組欄位：

{
  pg_id: keyword,          // PostgreSQL 主鍵（外連用）
  pg_exhibit_id: keyword,  // 證物 ID（routing key + filter 用）
  timestamp: date,
  all_content: text,       // ✨ 核心：統一全文搜尋欄位
  sender: {
    pg_party_id: keyword,
    identifier: keyword,
    aliases: { type: keyword, copy_to: "all_content" }
  },
  participants: nested {
    pg_party_id: keyword,
    aliases: { type: keyword, copy_to: "all_content" }
  }
}

關鍵 trick：copy_to: "all_content"

把 sender.aliases、participants.aliases、訊息 body、通話 caller_name 全部複製進 all_content 一個欄位。前端搜尋只要 match all_content 就能跨欄位命中。

否則使用者搜「張三」會需要 multi_match 列出所有可能欄位，每加一個欄位都要改前端 — copy_to 把這個成本轉到 index time（一次性）。

3.3 Nested vs Object：為什麼 participants 要 nested

// 用 object 會踩坑：
participants: [
  { identifier: "alice", role: "sender" },
  { identifier: "bob",   role: "receiver" }
]

ES 預設把 array of object 拍扁成：

participants.identifier: ["alice", "bob"]
participants.role: ["sender", "receiver"]

查詢 identifier=alice AND role=receiver 會錯誤命中（因為 alice 和 receiver 都在）。

nested type 把每個 element 當獨立子文件存，查詢時用 nested query 確保條件在同一物件內 match。代價是寫入慢、子文件數量算 doc 數量上限 — 但鑑識場景必要。

四、KNN 向量搜尋：HNSW 是怎麼把暴力比對降到 O(log n) 的

訊息 index 多了 embedding: dense_vector(1024-dim) 欄位，存的是 BGE-M3 把訊息文字轉成的語意向量。

4.1 暴力 KNN = O(n)，不可接受

要找跟「我下午去匯款」最相似的 10 筆訊息，最直白做法是把 query 向量跟資料庫每一筆都算 cosine similarity 排序 — 100 萬筆就是 100 萬次浮點運算。即使單次只要 1μs，也要 1 秒。

4.2 HNSW（Hierarchical Navigable Small World）

HNSW 的核心是「多層導航圖」：

graph TB
    subgraph L2 ["Layer 2 (稀疏 - 樞紐節點)"]
        A2((A)) --- E2((E))
    end
    subgraph L1 ["Layer 1 (中等)"]
        A1((A)) --- C1((C)) --- E1((E)) --- G1((G))
    end
    subgraph L0 ["Layer 0 (全量節點)"]
        A0((A)) --- B0((B)) --- C0((C)) --- D0((D)) --- E0((E)) --- F0((F)) --- G0((G)) --- H0((H))
    end

    A2 -.->|跳到下層| A1
    E2 -.->|跳到下層| E1
    A1 -.->|跳到下層| A0
    C1 -.->|跳到下層| C0
    E1 -.->|跳到下層| E0
    G1 -.->|跳到下層| G0

    style A2 fill:#fef3c7,stroke:#92400e,stroke-width:2px
    style E2 fill:#fef3c7,stroke:#92400e,stroke-width:2px

上層只放少量「樞紐」節點，連線稀疏但跨度大
下層節點多、連線密集
每個節點知道它的 neighbors（不是全部）

搜尋演算法：

從最頂層的入口點開始
在當前層找到「跟 query 最近的 neighbor」往那走，直到走不動
跳到下一層繼續
到 layer 0 收集 ef (e.g. 100) 個候選，回傳 top-k

時間複雜度近 O(log N)，記憶體需求 ~4×D bytes/vec（D = 維度）。100 萬個 1024-dim float 向量約 4 GB — 全載 RAM 沒問題。

ES 內部用 Lucene 9.0+ 的 KnnVectorsFormat 實作 HNSW，每個 segment 一份 graph。

4.3 實戰查詢

const queryVector = await getEmbedding("詐騙集團匯款指示");

await elkClient.search({
  index: "messages",
  knn: {
    field: "embedding",
    query_vector: queryVector,
    k: 100,                  // 最終要回幾筆
    num_candidates: 1000,    // HNSW 在每個 shard 找的候選數
  },
  _source: ["pg_id", "body", "timestamp"],
});

num_candidates 越大召回越好但越慢，通常設為 10 × k。

4.4 Hybrid Search：BM25 + KNN 一起來

純向量搜尋容易召回「語意相關但字面無關」的（好/壞兼有）。實戰常做混合：

await elkClient.search({
  index: "messages",
  query: {
    bool: {
      must: [{ exists: { field: "embedding" } }],
      should: [{ match: { body: { query, boost: 1 } } }],   // BM25 權重 1
    },
  },
  knn: {
    field: "embedding",
    query_vector: queryVector,
    k: 100,
    num_candidates: 1000,
    boost: 2,                // KNN 權重 2
  },
  min_score: 2.0,
});

最終 score = BM25 score × 1 + KNN cosine × 2。實戰反覆調 boost 找平衡點，2:1 偏向語意 / 1:2 偏向關鍵字。

五、批量寫入優化：refresh / replica 的取捨

ES 預設每 1 秒 refresh 一次（讓新文件可被搜尋），refresh 會在每個 shard 開新 segment、commit 到 OS file cache。批量寫入時這 1 秒 refresh 是吞吐殺手。

寫入前：

await elkClient.indices.putSettings({
  index: "messages",
  body: {
    "index.refresh_interval": "-1",       // 完全停 refresh
    "index.number_of_replicas": 0,        // 關副本
  },
});
await elkClient.cluster.putSettings({
  transient: { "indices.memory.index_buffer_size": "30%" },  // 加大 buffer
});

寫入後：

await elkClient.indices.putSettings({
  index: "messages",
  body: {
    "index.refresh_interval": "10s",
    "index.number_of_replicas": 1,
  },
});
await elkClient.indices.refresh({ index: "messages" });   // 手動 refresh 一次

實戰測過，這套設定組合可讓 bulk 寫入吞吐量提升 5-10 倍（從 ~5K docs/s 到 ~30K-50K docs/s）。

六、學到的事

Inverted index + skip list 是 ES 快的根本 — 不是 cache 也不是 sharding 而已。理解這個才能解釋為什麼某些 query pattern 慢。
BM25 的 length normalization (b=0.75) 是 vs TF-IDF 最大改進 — 不會被長文件「騙分」。
copy_to 是低成本的多欄位統一檢索 — 把成本轉到 index time，前端只要 query 一個欄位。對鑑識資料這種「同一個概念散在多個欄位」的場景 game changer。
HNSW 是 KNN 從 O(n) 降到 O(log n) 的關鍵 — 但要記得它是 ANN（approximate）— 在意 recall 的場景要調 num_candidates。
Bulk 寫入時關掉 refresh + replica 是免費的 5-10× 提升 — 但別忘了寫完再開回來，且要手動 refresh 一次讓資料可搜尋。
Nested vs Object 是 ES schema 設計第一個要踩的坑 — array of object 會被拍扁。

參考資料

Elasticsearch — Inverted Index (Lucene 內部) — 老但經典的 ES 底層解說系列
BM25 — The Next Generation of Lucene Relevance — BM25 公式直覺與調參
Elasticsearch dense_vector 文件 — dense_vector field 與 knn query 規範
HNSW 原始論文 (Malkov & Yashunin, 2016) — 多層導航圖演算法原文
Lucene 9 KnnVectorsFormat — ES 8+ KNN 的底層實作
Hybrid search in Elasticsearch — RRF / boost 兩種混合策略
BGE-M3 多語言 embedding 模型 — 中英多語言 1024-dim，適合 hybrid search