Skip to content

test: comprehensive KNN spatial join test suite#798

Open
pierre-warnier wants to merge 4 commits intoduckdb:v1.5-variegatafrom
pierre-warnier:pr/knn-3-tests
Open

test: comprehensive KNN spatial join test suite#798
pierre-warnier wants to merge 4 commits intoduckdb:v1.5-variegatafrom
pierre-warnier:pr/knn-3-tests

Conversation

@pierre-warnier
Copy link
Copy Markdown

Adds SQL logic tests for the KNN spatial join operator introduced in #797.

Depends on: #797 (ST_KNN operator). Full stack: #795#796#797 → this PR.

Files

test/sql/join/spatial_knn_join.test                    | 264 ++  (fast)
test/sql/join/spatial_knn_join_edge.test               | 259 ++  (fast)
test/sql/join/spatial_knn_join_correctness.test_slow   | 100 ++  (slow)
test/sql/join/spatial_knn_join_threads.test_slow       |  61 ++  (slow)
4 files changed, 684 insertions(+)

115 assertions across 4 test cases.

Coverage

spatial_knn_join.test — functional correctness

  • Plan verification: EXPLAIN shows SPATIAL_KNN_JOIN operator
  • count(*) correctness for k ∈ {1, 3, 5, 10, 100}
  • Result cardinality: |probe| * k for INNER, |probe| * max(k, 1) for LEFT
  • LEFT JOIN with no matches → NULL right side
  • NULL / empty geometry handling on both sides
  • Real-world NYC coordinates (Manhattan POIs + boroughs)
  • Column projection and expression evaluation in probe output

spatial_knn_join_edge.test — edge cases

  • Empty build side (INNER → empty, LEFT → NULLs)
  • Empty probe side → empty result
  • k = 1 single row in build
  • k > |build| → capped result
  • Duplicate geometries on the build side → deterministic selection
  • Self-join patterns
  • Point-coincident geometries (zero-distance k-NN)
  • Single-point tables (degenerate R-tree with 1 leaf)

spatial_knn_join_correctness.test_slow — exact distance verification

  • 1K × 1K random points, compare against brute-force ORDER BY ST_Distance LIMIT k
  • Verifies the k returned neighbors match the k smallest distances
  • Boundary conditions: points exactly on R-tree node splits

spatial_knn_join_threads.test_slow — parallelism correctness

  • Runs with threads = 1, 2, 4, 8
  • Deterministic row counts across thread counts
  • Deterministic result content (sorted comparison)
  • Catches races in the global FlatRTree read during probe

What's NOT tested here

All deferred to follow-up PRs when those features land.

Verification

$ build/release/test/unittest --test-dir . "test/sql/join/spatial_knn*"
All tests passed (115 assertions in 4 test cases)

$ build/release/test/unittest --test-dir . "test/sql/join/*"
All tests passed (248 assertions in 14 test cases)

Existing 133 spatial join assertions untouched.

The Copy() method returns a unique_ptr<BindData> which must be
converted to unique_ptr<FunctionData>. Under GCC 11+ this
conversion requires explicit std::move() — newer compilers
reject the implicit conversion as a copy-initialization.

Fixes: error: could not convert 'copy' from
'unique_ptr<ST_Distance_Sphere::BindData>' to 'unique_ptr<FunctionData>'
Copilot AI review requested due to automatic review settings April 15, 2026 09:40
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a KNN (k-nearest-neighbor) spatial join to the spatial extension, wires it into the optimizer/planner, and introduces a comprehensive SQL test suite to validate correctness, edge cases, and multi-threaded determinism.

Changes:

  • Adds ST_KNN(geom1, geom2, k) as a join-marker scalar function and optimizer rewrite to a dedicated SPATIAL_KNN_JOIN operator.
  • Introduces logical + physical KNN join operators (building a FlatRTree, probing via best-first KNN traversal, and refining by exact distance).
  • Adds SQL tests covering functional behavior, edge cases, brute-force correctness comparisons, and thread-count stability.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
test/sql/join/spatial_knn_join.test Functional correctness coverage for KNN join planning + basic behavior
test/sql/join/spatial_knn_join_edge.test Edge case coverage (empty sides, NULLs, k bounds, degeneracies)
test/sql/join/spatial_knn_join_correctness.test_slow Brute-force distance multiset verification on a larger dataset
test/sql/join/spatial_knn_join_threads.test_slow Multithreaded stability checks against brute-force baseline
src/spatial/modules/main/spatial_functions_scalar.cpp Registers ST_KNN and exposes bind-data extraction helper
src/spatial/operators/spatial_join_optimizer.cpp Detects ST_KNN predicates and rewrites to LogicalSpatialKNNJoin
src/spatial/operators/spatial_knn_join_logical.{hpp,cpp} Logical operator for KNN join + serialization/planning
src/spatial/operators/spatial_knn_join_physical.{hpp,cpp} Physical operator implementing build/probe/refine/emit
src/spatial/operators/flat_rtree.hpp Shared FlatRTree + KNN traversal state/implementation
src/spatial/operators/spatial_operator_extension.cpp Operator extension deserialization support for KNN join
src/spatial/geometry/bbox.hpp Adds MinDistanceSquared helpers used by KNN traversal
src/spatial/util/knn_extract.hpp Declares helper to extract constant k from bind data
src/spatial/operators/CMakeLists.txt Adds new operator sources to build

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/spatial/operators/spatial_join_optimizer.cpp
Comment thread src/spatial/operators/spatial_knn_join_physical.cpp
Comment thread src/spatial/operators/spatial_knn_join_physical.cpp
Comment thread src/spatial/modules/main/spatial_functions_scalar.cpp
Comment thread src/spatial/operators/spatial_operator_extension.cpp Outdated
Comment thread test/sql/join/spatial_knn_join_threads.test_slow
Comment thread test/sql/join/spatial_knn_join_threads.test_slow Outdated
Move FlatRTree from anonymous namespace in spatial_join_physical.cpp
to a shared header (flat_rtree.hpp) so it can be reused by the
upcoming KNN join operator.

Changes:
- Extract FlatRTree class to src/spatial/operators/flat_rtree.hpp
- Add KNNSearch method using Hjaltason-Samet priority queue traversal
- Add FlatRTreeKNNState for KNN search state management
- Add Box2D::MinDistanceSquared for KNN distance lower bound
- Use Allocator instead of BufferManager (in-memory only)
- No functional changes to existing SPATIAL_JOIN behavior
Adds a K-nearest neighbor spatial join to duckdb-spatial:

- ST_KNN(geom1, geom2, k) — scalar stub recognized by the optimizer
- Optimizer rewrites JOIN ON ST_KNN(...) into SPATIAL_KNN_JOIN
- Hjaltason-Samet priority queue KNN on FlatRTree
- Exact distance refinement with adaptive overfetch (2x, retry 8x)
- INNER and LEFT JOIN support
- JOO child swap handling via build_child_idx

Stripped for review (follow-up PRs):
- No out-of-core / spill-to-disk (tree always in memory)
- No spheroid support (planar distance only)
- No partitioning (single R-tree)

The FlatRTree is compact (~24 bytes/entry), so in-memory handles
up to ~100M entries (2.4GB tree) on commodity hardware.
4 test files, 115 assertions:
- spatial_knn_join.test — basic functionality, plan verification, k variants,
  LEFT JOIN, NULL handling, real-world NYC coordinates
- spatial_knn_join_edge.test — edge cases: empty tables, single row,
  k > table size, duplicate geometries, self-join
- spatial_knn_join_correctness.test_slow — exact distance verification,
  result ordering, boundary conditions
- spatial_knn_join_threads.test_slow — multi-threaded correctness
@pierre-warnier
Copy link
Copy Markdown
Author

Addressed in the amended commit (bba96e6):

  1. Thread test 10K × 10K brute-force baseline (100M comparisons) — reduced to 900 × 900 (810K comparisons, ~100ms). Still exercises multi-threaded KNN (PRAGMA threads = 1 vs = 8) with identical-output verification.

  2. PRAGMA threads = 8 / # Reset to default mismatch — removed the redundant reset block.

  3. Error message substring test — kept as-is. "ST_KNN: k must be >= 1" is a valid prefix match for the more informative "ST_KNN: k must be >= 1, got %d" error we throw. sqllogictest uses substring matching so this works and preserves the actual k value in the user-facing message.

All 247 join test assertions pass (133 existing + 114 new KNN).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants