Neo4j 為什麼快：Native Graph Storage、Index-Free Adjacency 與跨案碰撞偵測

日期：2026-04-22 情境： 鑑識分析平台需要在多個證物（手機）之間找「兩個人是否在不同案件、不同證物上有聯繫」。在 PostgreSQL 上做要寫 5-7 層 join，在 Neo4j 上是一條 Cypher。本文整理為什麼 graph database 對這種查詢天生比 relational 快。

TL;DR

Neo4j 快的根本原因是 index-free adjacency — 每個節點直接持有 neighbor 的物理指標（不是外鍵 ID），traversal 時不需要查 index，是 O(1) per hop。對比 RDBMS 的 join 是 O(log n)，每多一層 hop 在 graph DB 是常數成本，在 RDBMS 是指數爆炸。

實戰：用 5 個 node label + 7 種 relationship 把鑑識資料建模成圖，跨案碰撞變成一條 5 段 path 的 MATCH，回應時間從 PG 上的「跑半小時 timeout」變成 Neo4j 的「秒級」。

一、Native Graph Storage：什麼是 Index-Free Adjacency

1.1 RDBMS 的 join 為什麼貴

關聯式資料庫存「兩個人有通話關係」會這樣設計：

CREATE TABLE party (id, name);
CREATE TABLE call (id, caller_id, callee_id, timestamp);
CREATE INDEX ON call(caller_id);
CREATE INDEX ON call(callee_id);

-- 查 alice 通話過的人：
SELECT p2.name FROM party p1
  JOIN call c ON c.caller_id = p1.id
  JOIN party p2 ON p2.id = c.callee_id
  WHERE p1.name = 'alice';

每個 join 步驟都要 B-tree index lookup：

p1.name = 'alice' → 用 name index 找 p1.id（O(log n)）
c.caller_id = p1.id → 用 caller_id index 找 call rows（O(log n)）
p2.id = c.callee_id → 用 PK index 找 p2 rows（O(log n)）

3 層 join，3 次 B-tree 走訪。每多一層 hop 就多一次 index lookup。

更糟的是 query optimizer 不一定挑最佳 join order。對「查 alice 跟誰共同認識某 X」這種 5-hop 問題，PG 經常要評估 5! = 120 種 join order，optimizer 容易抓錯導致全表掃描。

1.2 Neo4j 的解法：直接記住物理位置

Neo4j 的儲存格式（簡化）：

node 表 (固定 15 bytes/record):
  [in_use][next_rel_id][next_property_id]

relationship 表 (固定 34 bytes/record):
  [in_use][start_node_id][end_node_id][relationship_type_id]
  [first_prev_rel_id][first_next_rel_id]      ← 雙向連結串列
  [second_prev_rel_id][second_next_rel_id]
  [next_property_id]

關鍵是 relationship record 裡直接存「下一個和上一個 relationship 的 record id」（雙向連結）。

從 node alice 出發：

讀 alice 的 node record，拿到 next_rel_id → 跳到第一個 relationship 的 record（直接算 offset，O(1)）
讀那個 relationship 的 first_next_rel_id → 跳到下一個 relationship（O(1)）
連續 traverse，每跳都是 指標跳躍，不是 index lookup

這就是 index-free adjacency：節點之間的連線資訊不需要靠 index 查，因為每個節點本身就是「鄰居清單的入口」。

1.3 複雜度比較

操作	RDBMS	Neo4j
找 1-hop neighbors	O(log n) per hop	O(1) per hop
找 k-hop neighbors	O((log n)^k)	O(d^k)，d = avg degree
Path finding (k 步)	optimizer 可能走死	k × O(d)

對「找 6 度分隔內的所有人」這種題目：

RDBMS：需要 6 個 self-join，計畫爆炸
Neo4j：直接 BFS，只受平均 degree 影響

二、Schema 設計：5 個 node label + 7 種關係

該鑑識分析平台的圖建模：

Node labels：

Party {pg_id, identifier, source, aliases[], display_name}
  // 一個「人」（手機帳號背後的個體）

Case {pg_id, name, caseNumber}
  // 一個案件

Exhibit {pg_id, name, source}
  // 一個證物（通常是手機 / 電腦）

Conversation {pg_id, name, source}
  // 一個聊天室

Call {pg_id, name, status, source, timestamp}
  // 一通通話

Relationship types：

graph LR
    Party((Party))
    Conversation((Conversation))
    Call((Call))
    Exhibit((Exhibit))
    Case((Case))

    Exhibit -->|BELONGS_TO_CASE| Case
    Conversation -->|FROM_EXHIBIT| Exhibit
    Call -->|FROM_EXHIBIT| Exhibit

    Party -->|PARTICIPATES_IN| Conversation
    Party -.->|OWNER_PARTICIPATES_IN ⭐| Conversation
    Party -->|PARTICIPATES_IN_CALL| Call
    Party -.->|OWNER_PARTICIPATES_IN_CALL ⭐| Call

    Party ---|COLLIDED_WITH 雙向| Party

    style Party fill:#fef3c7,stroke:#92400e,stroke-width:2px
    style Exhibit fill:#dbeafe,stroke:#1e40af
    style Case fill:#dcfce7,stroke:#166534
    style Conversation fill:#ede9fe,stroke:#5b21b6
    style Call fill:#fce7f3,stroke:#9f1239

實線 = 一般參與；虛線 ⭐ = 「手機持有人」專屬關係（為什麼拆兩種，見 §2.1）。

2.1 為什麼 OWNER_PARTICIPATES_IN 跟 PARTICIPATES_IN 要分開

這是整個 schema 最關鍵的設計。

「李四出現在某手機的通訊錄裡」可能有兩種意義：

a) 那支手機是李四的（李四是 owner）
b) 李四只是手機主人的某個聯絡人 / 群組成員

跨案碰撞分析要找的是「兩個人在不同手機上都是 owner，且互動過」 — 否則找出「兩個 owner 都認識同一個快遞員」沒意義。

如果用同一個 :PARTICIPATES_IN 關係加 isPhoneOwner: true/false property，每次 query 都要 WHERE r.isPhoneOwner = true filter，graph traversal 時的 relationship type 過濾比 property 過濾快得多（type 是儲存層 byte-level 過濾，property 要讀 property store）。

拆成兩個 type 後：

找 owner-only 的 traversal 直接 MATCH (p)-[:OWNER_PARTICIPATES_IN]->(c)，relationship store 跳過所有非 owner 的紀錄
還是有 isPhoneOwner property 留作向後相容檢查

重複資料的代價 vs 查詢速度：選後者。

三、Index 類型：不是只有 RANGE

Neo4j 5.x 提供 4 種 index：

Index 類型	用途	對應 query operator
RANGE (預設)	等於、範圍查詢	`=`, `<`, `>`, `STARTS WITH`
TEXT	純文字搜尋	`CONTAINS`, `ENDS WITH`
POINT	地理座標	`point.distance`, `point.withinBBox`
FULLTEXT	Lucene 全文索引	`db.index.fulltext.queryNodes`

實戰常用的：

-- 1. UNIQUE constraint（自動建 RANGE index）
CREATE CONSTRAINT party_pg_id IF NOT EXISTS
  FOR (p:Party) REQUIRE p.pg_id IS UNIQUE;

-- 2. 複合 index（多欄位等值查詢）
CREATE INDEX party_identifier_source IF NOT EXISTS
  FOR (p:Party) ON (p.identifier, p.source);

-- 3. 時間範圍 index（Call 按時間排序）
CREATE INDEX call_timestamp IF NOT EXISTS
  FOR (c:Call) ON (c.timestamp);

-- 4. ⭐ Relationship property index（Neo4j 4+ 才有）
CREATE INDEX rel_participates_is_owner IF NOT EXISTS
  FOR ()-[r:PARTICIPATES_IN]-() ON (r.isPhoneOwner);

第 4 個是雙刃劍：對「找 owner-only」這種 query 加速很顯著，但每筆 relationship insert 都要更新 index、寫入慢 20-30%。實戰是：寫入時拉到 batch level、查詢時依靠這個 index 過濾。

3.1 為什麼 UNIQUE constraint 比 plain index 重要

CREATE CONSTRAINT ... REQUIRE p.pg_id IS UNIQUE 不只建 index，還在儲存層 enforce 唯一性。MERGE (p:Party {pg_id: $id}) 會用這個 constraint 做 atomic upsert — 沒它的話高並發 insert 會出現重複節點。

實戰上每次設計新 node label 第一件事就是定 unique constraint。

四、跨案碰撞：5 段 path 的 Cypher

業務需求：給一批證物 ID，找出「兩個人都是其中某些證物的 owner，而且兩人有過互動」。

關係式 SQL 寫法：5-6 個 JOIN + EXISTS 子查詢，計畫複雜難維護。

Cypher 寫法：

MATCH (owner:Party)
WHERE owner.pg_id IN $targetIds                        -- 起點：指定 party

// PART 1: 從 owner A 出發，找他擁有的互動 (使用 OWNER 關係)
MATCH (owner)-[:OWNER_PARTICIPATES_IN | OWNER_PARTICIPATES_IN_CALL]
      ->(interaction)
      -[:FROM_EXHIBIT]->(homeExhibit:Exhibit)
      -[:BELONGS_TO_CASE]->(caseA:Case)

// PART 2: 找出該互動的另一端參與者 (B)
MATCH (interaction)<-[:PARTICIPATES_IN | PARTICIPATES_IN_CALL]
      -(otherOwner:Party)
WHERE elementId(owner) <> elementId(otherOwner)        -- 不是 owner 自己

// PART 3: 驗證 B 是不是某個其他證物的 owner
MATCH (otherOwner)-[:OWNER_PARTICIPATES_IN | OWNER_PARTICIPATES_IN_CALL]
      ->(interaction2)
      -[:FROM_EXHIBIT]->(otherHomeExhibit:Exhibit)
      -[:BELONGS_TO_CASE]->(caseB:Case)

WHERE homeExhibit <> otherHomeExhibit                  -- 必須跨證物

// PART 4: 聚合
WITH owner, otherOwner, caseA, caseB,
     count(DISTINCT interaction) as interactionCount,
     collect(DISTINCT interaction.source)[0..3] as sources

// PART 5: 寫入碰撞關係（雙向，無箭頭）
MERGE (owner)-[rel:COLLIDED_WITH]-(otherOwner)
ON CREATE SET
  rel.first_seen = datetime(),
  rel.count = interactionCount,
  rel.is_new = true
ON MATCH SET
  rel.last_seen = datetime(),
  rel.count = interactionCount,
  rel.is_new = false

RETURN owner, otherOwner, interactionCount, sources
ORDER BY interactionCount DESC LIMIT 20

4.1 為什麼這條 query 在 graph DB 上快

整條 query 的物理 traversal：

Party 起點：用 pg_id UNIQUE constraint 直接跳到節點（O(1)）
從 owner 出發找 interaction：跟著 OWNER_PARTICIPATES_IN 雙向連結串列走（O(degree)）
interaction → exhibit：1 個 FROM_EXHIBIT relationship，O(1)
exhibit → case：1 個 BELONGS_TO_CASE，O(1)
反向找 otherOwner：跟著 PARTICIPATES_IN 串列走（O(degree)）
確認 otherOwner 是 owner：再走一次 OWNER_PARTICIPATES_IN（O(degree)）

完全沒有 index lookup（除了起點）。每個 hop 都是指標跳躍。整體複雜度 O(d^4)，d 通常很小（手機 owner 平均聯絡人 ~100，過濾後更少）。

同樣 query 在 PG：5 個 join + UNION + GROUP BY，optimizer 一定會挑錯路徑，跑 30 分鐘 timeout 是常態。

4.2 雙向關係 `(owner)-[:COLLIDED_WITH]-(otherOwner)`

注意 MERGE 和 MATCH 都用無箭頭 - 而不是 -[]->。這代表「無向關係」 — 對「碰撞」這種對稱概念，A 碰 B 等於 B 碰 A，不需要存兩份。

Cypher 解析時會把無向 pattern 轉成「向量化」邏輯：實體上仍是有向（每個 relationship 還是有 start/end），但 query 兩個方向都會 match。

五、中間人 (middleman) 路徑分析

另一條業務需求：兩個 owner 不直接通話，但都跟某個第三人有交集 → 找出潛在的「中間人」。

MATCH path = (ownerA)-[r1:PARTICIPATES_IN]->(c1:Conversation)
              <-[r2:PARTICIPATES_IN]-(middleman:Party)
              -[r3:PARTICIPATES_IN]->(c2:Conversation)
              <-[r4:PARTICIPATES_IN]-(ownerB)
WHERE c1 <> c2
  AND middleman <> ownerA
  AND middleman <> ownerB

這個 pattern 是 4 個 hop 的 BFS：

ownerA → c1 → middleman → c2 → ownerB
middleman 是「同時參與 c1 和 c2」的人，c1 ownerA 在、c2 ownerB 在

Neo4j 走這條 query 大概需要：

ownerA 的所有 conversation：~50
對每個 conversation 的所有 participant：~10
對每個 middleman 的所有 conversation：~50
看 ownerB 是否在裡面：O(1) 用 PK lookup

總共 50 × 10 × 50 = 25000 次指標跳躍。每次 ~微秒級，整體百毫秒內完成。

PG 上同樣 query 是 4 個 self-join + WHERE NOT EQUAL，optimizer 算不出來。

5.1 Cypher 模式語言的力量

Cypher 把 SQL 裡需要 join + alias + EXISTS 子查詢的東西用 ASCII art 表達：

-- SQL: 找 alice 的朋友的朋友（不含 alice 自己）
SELECT DISTINCT f2.name FROM friend f1
  JOIN friend f2 ON f1.target = f2.source
  WHERE f1.source = 'alice'
    AND f2.target <> 'alice';

-- Cypher:
MATCH (alice {name: 'alice'})-[:KNOWS]->(:Person)-[:KNOWS]->(friend:Person)
WHERE friend <> alice
RETURN DISTINCT friend.name

可讀性差距巨大，且 graph DB 不需要寫 join 細節，traversal 路徑是宣告式的。

六、學到的事

Index-free adjacency 是 graph DB 對 RDBMS 的根本優勢 — 每個 hop 是指標跳躍而非 index lookup。但這也意味著 Neo4j 不適合做大量資料的 aggregate query（沒有 column store 優勢）。
節點數量大不一定慢，但 degree 大會慢 — Neo4j 的 traversal 成本是 O(d^k)，d 是節點 degree。「超級節點」（連幾十萬個 relationship）是設計上要避免的反模式。
Relationship type 拆分（vs 用 property filter）是常見優化 — OWNER_PARTICIPATES_IN vs PARTICIPATES_IN 拆開，用 type 過濾比 property 過濾快很多。
UNIQUE constraint 不只是約束，是 atomic upsert 的基石 — MERGE 沒有 unique constraint 會在高並發下產生重複節點。
Cypher 用 ASCII art 表達 graph pattern — (a)-[:KNOWS]->(b) 比同義 SQL 短一個量級，可讀性遠好。
無向關係 (a)-[]-(b) 是對稱關係（朋友、碰撞）的正確建模 — 不要存雙向兩條 relationship。
WITH 是 Cypher 的 pipeline boundary — 跟 SQL 的 CTE 類似，用來限制中間結果或做 aggregation。

七、演算法深入：Cypher 是怎麼被執行的

Index-free adjacency 是儲存層的優勢，但實際 query 性能還取決於 query planner 和執行引擎。這一節補演算法層。

7.1 Cypher → Operator Tree

Cypher query 會被編譯成 operator tree（類似 SQL 的 physical plan）。例如：

MATCH (a:Party {identifier: 'alice'})-[:KNOWS]->(b:Party)
WHERE b.age > 30
RETURN b.name

Planner 輸出大概是：

graph TD
    PR["ProduceResults
(b.name)"]
    F["Filter
(b.age > 30)"]
    EX["Expand(All)
沿 :KNOWS 展開到 b"]
    NI["NodeIndexSeek
(a:Party, identifier='alice')
起點"]

    PR --> F
    F --> EX
    EX --> NI

    style NI fill:#fef3c7,stroke:#92400e,stroke-width:2px

執行時是反向 pull-based — ProduceResults 向下 pull 一筆，由 NodeIndexSeek 從起點往上推。

每個 operator 是 pull-based iterator：ProduceResults 向 Filter 要下一筆，Filter 向 Expand 要，以此類推。這個設計來自 Volcano model（Graefe 1993）— 跟 PostgreSQL 執行引擎同族。

7.2 關鍵 Operator

Operator	成本	使用時機
NodeByLabelScan	O(N for label)	沒 index 或 match label 沒 filter
NodeIndexSeek	O(log N)	有 property 等值 filter + index
NodeIndexRangeScan	O(log N + 結果數)	range query (timestamp between …)
Expand(All)	O(degree) per source	從起點沿 relationship 展開
Expand(Into)	O(degree)	確認兩個已知節點之間有特定關係
DirectedRelationshipIndexSeek	O(log N)	5.x 新增：relationship property index

Expand(All) 的實作就是 index-free adjacency 的直接應用：

for source in lhs:
  rel_id = source.next_rel_id          # O(1) 從 node record 拿
  while rel_id != null:
    rel = load_relationship(rel_id)    # O(1) 算 offset
    if rel.type == REL_TYPE:
      emit(source, rel, rel.other_end)
    rel_id = rel.next_rel_id           # O(1) 雙向連結

整個迴圈沒有任何 index lookup — 這就是為什麼 graph traversal 的每 hop 成本是 O(d) 不是 O(log n)。

7.3 Cost-Based Optimizer：選起點很關鍵

給一條 (a:Person)-[:KNOWS]->(b:Person)-[:WORKS_AT]->(c:Company)，planner 要決定從哪端開始 expand：

從 a 出發：先 scan Person → expand KNOWS → expand WORKS_AT
從 c 出發：先 scan Company → expand reverse WORKS_AT → reverse KNOWS

Planner 用 statistics 估每個 operator 的 row count：

Person 總數 = 1,000,000，avg(KNOWS) degree = 100
Company 總數 = 5,000，avg(WORKS_AT) degree = 50

從 a 出發：1M × 100 × 50 = 50 億 rows 從 c 出發：5K × 50 × 100 = 2500 萬 rows

Planner 選 c 優先 — 從 selectivity 高的端點開始。

實戰 debug 查詢慢：EXPLAIN 看 operator tree、PROFILE 跑完看實際 row count，跟 estimate 差太多就用 CALL db.stats.retrieve('GRAPH COUNTS') 檢查統計是否過期（需要 CALL db.resampleIndex() 重算）。

7.4 shortestPath：雙向 BFS（Bidirectional Search）

找兩點最短路徑，單向 BFS 複雜度 O(d^k) — k 步距離會爆炸。

Neo4j 的 shortestPath((a)-[*]->(b)) 用 bidirectional BFS：

從 a 往前走 BFS（一層 d 個節點）
同時從 b 往回走 BFS（一層 d 個節點）
當兩邊 frontier 相交 → 找到最短路徑

複雜度從 O(d^k) 降到 O(d^(k/2)) — 對 k=6、d=100 是 10^12 vs 10^6，六個數量級的差。

實作細節：交替擴展兩個方向（每次選當下 frontier 較小的那邊擴），確保 worst-case balanced。

7.5 Pattern Matching：Subgraph Isomorphism 的處理

複雜 pattern (a)-[r1]->(b)-[r2]->(c)<-[r3]-(a)（環）的執行本質是 subgraph isomorphism problem — 一般情況 NP-hard。

Neo4j 不直接跑通用 VF2 / Ullmann 演算法（太貴），而是：

Pattern 先分解成 tree：選一個 relationship 當「閉合邊」把環打破
Tree pattern 用 Expand 鏈式執行（多項式時間）
閉合邊最後用 Expand(Into) 驗證：兩個已知節點之間是否有特定 relationship

起點 pinning（有 indexed property 或 label scan 能縮小候選集）+ Expand 限制讓實際工作量遠低於通用 solver。

7.6 Page Cache：固定長度 record 的 cache-line 優勢

Neo4j 儲存層用固定長度 record：

Node record: 15 bytes
Relationship record: 34 bytes

8 KB page 可以裝 ~546 個 node record 或 ~240 個 relationship record。

對 CPU cache 非常友善：

一次 disk I/O 拉百個節點進 RAM
traversal 時 sequential access pattern → CPU L1/L2 cache hit rate 高
對比 RDBMS 變長 row + TOAST，同樣資料量 cache line 填滿率差很多

實戰配置：Neo4j page cache 設為「資料集大小」或稍大，全圖在 RAM 是理想狀態。對百 GB 圖、128 GB RAM 機器，查詢延遲進入微秒級。

7.7 用 PROFILE 驗證 index 是否被用到

PROFILE
MATCH (p:Party {identifier: 'alice'})-[:KNOWS]->(f)
RETURN f.name

輸出會顯示：

+----------------+----------+---------+------------+
| Operator       | Rows     | DB Hits | Page Hits  |
+----------------+----------+---------+------------+
| ProduceResults |        5 |       0 |          0 |
| Expand(All)    |        5 |       6 |          0 |
| NodeIndexSeek  |        1 |       2 |          1 |
+----------------+----------+---------+------------+

NodeIndexSeek 出現 → index 有用到
若看到 NodeByLabelScan + Filter 對 indexed property → index 沒用到，要查 SHOW INDEXES 看 index 是否 ONLINE、或 planner 決定掃 label 成本更低（通常 label 節點少時會這樣選）

7.8 一點 trade-off：為什麼 Neo4j 對 aggregate 慢

Index-free adjacency 對 traversal 快，但對「count all People」這種 aggregation 沒有優勢 — 一樣要 sequential scan 所有 Person node。沒有 column store（所有屬性存在 property store 分離的位置），要 sum(salary) 得把每個 Person node 的 property chunk 載進來。

實戰上大量 aggregate 放 PG 或 ClickHouse，graph 只處理「要跨節點 traversal」的查詢。

八、什麼時候不該用 Neo4j

graph DB 不是萬能。實戰上以下場景塞 Neo4j 是錯選擇：

大量數值聚合：sum / avg over 1000 萬 rows → 用 PG / ClickHouse / DuckDB
全文搜尋：用 Elasticsearch
時序資料：用 TimescaleDB / InfluxDB
單表簡單 CRUD：PG 就夠了，graph DB 沒優勢且寫入慢

實戰一個健康的系統會把 graph DB 限定在「多跳關係 traversal / path finding」場景，其他資料還是放 PG。資料同步透過 worker 把 PG → Neo4j 的衍生資料推過去。

參考資料

Neo4j — Graph Database vs RDBMS — 官方對比文，含複雜度分析
Neo4j Internals — Storage Engine — Native graph storage 的詳細運作機制
Neo4j Query Planner Manual — operator tree / EXPLAIN vs PROFILE 完整指南
Volcano-style iterator model (Graefe 1993) — Neo4j 執行引擎的理論基礎
Bidirectional search (Wikipedia) — shortestPath 底層演算法
VF2 subgraph isomorphism (Cordella et al., 2004) — 通用 subgraph matching 演算法（Neo4j 不直接用但相關）
Index-Free Adjacency 概念解說 — 為什麼 graph DB 對多跳查詢快
Cypher Manual — Cypher 語法與 best practices
Neo4j 5.x Index types — RANGE / TEXT / POINT / FULLTEXT
Awesome Procedures On Cypher (APOC) — 圖演算法擴充庫，含 PageRank / Community Detection
Designing Data-Intensive Applications, Ch. 2 — Martin Kleppmann 寫的 graph data model 經典論述