test: comprehensive KNN spatial join test suite#798
test: comprehensive KNN spatial join test suite#798pierre-warnier wants to merge 4 commits intoduckdb:v1.5-variegatafrom
Conversation
The Copy() method returns a unique_ptr<BindData> which must be converted to unique_ptr<FunctionData>. Under GCC 11+ this conversion requires explicit std::move() — newer compilers reject the implicit conversion as a copy-initialization. Fixes: error: could not convert 'copy' from 'unique_ptr<ST_Distance_Sphere::BindData>' to 'unique_ptr<FunctionData>'
There was a problem hiding this comment.
Pull request overview
Adds a KNN (k-nearest-neighbor) spatial join to the spatial extension, wires it into the optimizer/planner, and introduces a comprehensive SQL test suite to validate correctness, edge cases, and multi-threaded determinism.
Changes:
- Adds
ST_KNN(geom1, geom2, k)as a join-marker scalar function and optimizer rewrite to a dedicatedSPATIAL_KNN_JOINoperator. - Introduces logical + physical KNN join operators (building a
FlatRTree, probing via best-first KNN traversal, and refining by exact distance). - Adds SQL tests covering functional behavior, edge cases, brute-force correctness comparisons, and thread-count stability.
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| test/sql/join/spatial_knn_join.test | Functional correctness coverage for KNN join planning + basic behavior |
| test/sql/join/spatial_knn_join_edge.test | Edge case coverage (empty sides, NULLs, k bounds, degeneracies) |
| test/sql/join/spatial_knn_join_correctness.test_slow | Brute-force distance multiset verification on a larger dataset |
| test/sql/join/spatial_knn_join_threads.test_slow | Multithreaded stability checks against brute-force baseline |
| src/spatial/modules/main/spatial_functions_scalar.cpp | Registers ST_KNN and exposes bind-data extraction helper |
| src/spatial/operators/spatial_join_optimizer.cpp | Detects ST_KNN predicates and rewrites to LogicalSpatialKNNJoin |
| src/spatial/operators/spatial_knn_join_logical.{hpp,cpp} | Logical operator for KNN join + serialization/planning |
| src/spatial/operators/spatial_knn_join_physical.{hpp,cpp} | Physical operator implementing build/probe/refine/emit |
| src/spatial/operators/flat_rtree.hpp | Shared FlatRTree + KNN traversal state/implementation |
| src/spatial/operators/spatial_operator_extension.cpp | Operator extension deserialization support for KNN join |
| src/spatial/geometry/bbox.hpp | Adds MinDistanceSquared helpers used by KNN traversal |
| src/spatial/util/knn_extract.hpp | Declares helper to extract constant k from bind data |
| src/spatial/operators/CMakeLists.txt | Adds new operator sources to build |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Move FlatRTree from anonymous namespace in spatial_join_physical.cpp to a shared header (flat_rtree.hpp) so it can be reused by the upcoming KNN join operator. Changes: - Extract FlatRTree class to src/spatial/operators/flat_rtree.hpp - Add KNNSearch method using Hjaltason-Samet priority queue traversal - Add FlatRTreeKNNState for KNN search state management - Add Box2D::MinDistanceSquared for KNN distance lower bound - Use Allocator instead of BufferManager (in-memory only) - No functional changes to existing SPATIAL_JOIN behavior
Adds a K-nearest neighbor spatial join to duckdb-spatial: - ST_KNN(geom1, geom2, k) — scalar stub recognized by the optimizer - Optimizer rewrites JOIN ON ST_KNN(...) into SPATIAL_KNN_JOIN - Hjaltason-Samet priority queue KNN on FlatRTree - Exact distance refinement with adaptive overfetch (2x, retry 8x) - INNER and LEFT JOIN support - JOO child swap handling via build_child_idx Stripped for review (follow-up PRs): - No out-of-core / spill-to-disk (tree always in memory) - No spheroid support (planar distance only) - No partitioning (single R-tree) The FlatRTree is compact (~24 bytes/entry), so in-memory handles up to ~100M entries (2.4GB tree) on commodity hardware.
4 test files, 115 assertions: - spatial_knn_join.test — basic functionality, plan verification, k variants, LEFT JOIN, NULL handling, real-world NYC coordinates - spatial_knn_join_edge.test — edge cases: empty tables, single row, k > table size, duplicate geometries, self-join - spatial_knn_join_correctness.test_slow — exact distance verification, result ordering, boundary conditions - spatial_knn_join_threads.test_slow — multi-threaded correctness
edc296b to
bba96e6
Compare
|
Addressed in the amended commit (
All 247 join test assertions pass (133 existing + 114 new KNN). |
Adds SQL logic tests for the KNN spatial join operator introduced in #797.
Depends on: #797 (ST_KNN operator). Full stack: #795 → #796 → #797 → this PR.
Files
115 assertions across 4 test cases.
Coverage
spatial_knn_join.test— functional correctnessEXPLAINshowsSPATIAL_KNN_JOINoperatorcount(*)correctness fork ∈ {1, 3, 5, 10, 100}|probe| * kfor INNER,|probe| * max(k, 1)for LEFTLEFT JOINwith no matches → NULL right sidespatial_knn_join_edge.test— edge casesk = 1single row in buildk > |build|→ capped resultspatial_knn_join_correctness.test_slow— exact distance verificationORDER BY ST_Distance LIMIT kspatial_knn_join_threads.test_slow— parallelism correctnessthreads = 1, 2, 4, 8FlatRTreeread during probeWhat's NOT tested here
All deferred to follow-up PRs when those features land.
Verification
Existing 133 spatial join assertions untouched.