Skip to content

feat: native bulk ingest paths across adapters#540

Merged
cofin merged 28 commits into
mainfrom
feat/native-bulk-ingest
Jun 25, 2026
Merged

feat: native bulk ingest paths across adapters#540
cofin merged 28 commits into
mainfrom
feat/native-bulk-ingest

Conversation

@cofin

@cofin cofin commented Jun 25, 2026

Copy link
Copy Markdown
Member

Summary

Expose native bulk-ingest fast paths across every adapter behind the existing storage-bridge surface (load_from_arrow / load_from_storage / new load_from_records) — no new database-specific public APIs. High-volume writes use each driver's primitive (COPY, LOAD DATA, direct path load, BulkCopy, Arrow ingest, load jobs, mutations) instead of generic row-by-row execution. Builds on the cloud job session controls already on main.

Each change is landed test-first (unit + integration) and is independently revertable.

What's included

PostgreSQL / SQLite already-native paths — covered by a new supports_native_bulk_ingest contract family (row-count fidelity, overwrite, append) for asyncpg, psycopg (sync/async), psqlpy, adbc, duckdb.

MySQL family — opt-in LOAD DATA LOCAL INFILE (pymysql, asyncmy, aiomysql, mysql-connector sync+async) via enable_local_infile_bulk_load, gated on the connection's local-infile setting (raises at config construction otherwise).

Oracle — per-call oracle_batch_errors / oracle_array_dml_row_counts on execute_many (surfaced on SQLResult.metadata); opt-in enable_direct_path_load (Thin mode) for load_from_arrow.

SQLite / aiosqliteload_from_arrow wrapped in one BEGIN IMMEDIATE transaction when the driver owns it (atomic ingest; caller transactions pass through).

BigQueryload_from_storage honors Parquet/JSON/JSONL/CSV/Avro/ORC; opt-in enable_storage_write_api Arrow Storage Write API transport for load_from_arrow (Parquet fallback).

Spannerload_from_arrow uses chunked Transaction.insert_or_update mutations (idempotent upsert); opt-in enable_batch_write_api routes through Database.mutation_groups().batch_write() for high-throughput groups.

SQL Server (mssql-python)load_from_arrow / load_from_storage wired to cursor.bulkcopy() (sync + async); BulkCopy stats regression-locked.

arrow-odbcload_from_arrow / load_from_storage wired to bulk_insert_arrow.

Cross-adapter — new generic load_from_records(table, records, *, columns=None, overwrite=False) on the sync/async base classes: dict or positional records are normalized and routed through each adapter's native load_from_arrow, so every Arrow-capable adapter (now including mssql-python and arrow-odbc) supports it. Invalid input raises ImproperConfigurationError. Covered by a supports_load_from_records contract family.

Docsdocs/usage/bulk_ingest.rst capability matrix + opt-in/security gates + examples.

Notes for review

  • The BigQuery Storage Write API and Spanner Batch Write API paths are unit-tested for call orchestration and fallbacks; the BigQuery Storage Write API needs validation against a live BigQuery instance (the emulator does not implement it), and the Spanner Batch Write integration test skips if the emulator returns UNIMPLEMENTED.

Testing

All new unit tests pass locally; SQLite/aiosqlite atomicity, load_from_records, and the duckdb/adbc-duckdb contract cases run containerless and pass. Docs build clean under sphinx-build -W. make type-check (mypy + pyright) is clean on the changed modules. Container/cloud integration tests run in CI.

@codecov-commenter

codecov-commenter commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 83.92857% with 108 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.54%. Comparing base (f94fa6c) to head (2abbfa4).

Files with missing lines Patch % Lines
sqlspec/adapters/bigquery/driver.py 69.15% 24 Missing and 9 partials ⚠️
sqlspec/adapters/oracledb/driver.py 76.11% 9 Missing and 7 partials ⚠️
sqlspec/adapters/bigquery/core.py 80.85% 8 Missing and 1 partial ⚠️
sqlspec/adapters/mysqlconnector/driver.py 82.60% 4 Missing and 4 partials ⚠️
sqlspec/adapters/spanner/driver.py 83.33% 4 Missing and 3 partials ⚠️
sqlspec/adapters/mssql_python/driver.py 84.61% 4 Missing and 2 partials ⚠️
sqlspec/utils/arrow_helpers.py 84.61% 5 Missing and 1 partial ⚠️
sqlspec/utils/text.py 88.00% 2 Missing and 4 partials ⚠️
sqlspec/adapters/aiomysql/driver.py 83.33% 2 Missing and 2 partials ⚠️
sqlspec/adapters/pymysql/driver.py 83.33% 2 Missing and 2 partials ⚠️
... and 4 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #540      +/-   ##
==========================================
+ Coverage   75.35%   75.54%   +0.19%     
==========================================
  Files         444      444              
  Lines       58256    58791     +535     
  Branches     9030     9132     +102     
==========================================
+ Hits        43896    44412     +516     
+ Misses      11514    11506       -8     
- Partials     2846     2873      +27     
Flag Coverage Δ
integration 59.34% <47.17%> (-0.06%) ⬇️
py3.10 73.91% <83.92%> (+0.21%) ⬆️
py3.11 73.93% <83.92%> (+0.21%) ⬆️
py3.12 73.92% <83.92%> (+0.21%) ⬆️
py3.13 73.92% <83.92%> (+0.21%) ⬆️
py3.14 74.74% <83.78%> (+0.21%) ⬆️
unit 61.41% <82.73%> (+0.45%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
sqlspec/adapters/aiomysql/config.py 91.47% <100.00%> (+0.24%) ⬆️
sqlspec/adapters/aiomysql/core.py 74.13% <100.00%> (+1.71%) ⬆️
sqlspec/adapters/aiosqlite/core.py 90.32% <100.00%> (+1.01%) ⬆️
sqlspec/adapters/asyncmy/config.py 89.10% <100.00%> (+0.16%) ⬆️
sqlspec/adapters/asyncmy/core.py 72.05% <100.00%> (ø)
sqlspec/adapters/asyncmy/driver.py 79.86% <100.00%> (+0.13%) ⬆️
sqlspec/adapters/bigquery/config.py 87.67% <100.00%> (+0.08%) ⬆️
sqlspec/adapters/mssql_python/config.py 81.62% <100.00%> (ø)
sqlspec/adapters/mssql_python/events/store.py 85.18% <100.00%> (+7.63%) ⬆️
sqlspec/adapters/mssql_python/migrations.py 72.97% <100.00%> (+3.52%) ⬆️
... and 26 more

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@cofin cofin marked this pull request as ready for review June 25, 2026 17:23
cofin added 15 commits June 25, 2026 12:24
Add enable_local_infile_bulk_load driver feature that routes
load_from_arrow through LOAD DATA LOCAL INFILE when local_infile=True.
TSV encoding helpers duplicated in pymysql/core.py; config raises
ImproperConfigurationError at construction when the gate is unmet.
Add enable_local_infile_bulk_load driver feature routing load_from_arrow
through LOAD DATA LOCAL INFILE; consumes the existing allow_local_infile
gate (effective local_infile must be True). TSV helpers duplicated in
asyncmy/core.py; config raises ImproperConfigurationError when unmet.
Add enable_local_infile_bulk_load driver feature routing load_from_arrow
through LOAD DATA LOCAL INFILE; gate requires enable_local_infile=True
(or local_infile=True). TSV helpers duplicated in aiomysql/core.py.
Add enable_local_infile_bulk_load driver feature to both sync and async
drivers, routing load_from_arrow through LOAD DATA LOCAL INFILE. Gate
requires allow_local_infile=True; both config classes raise
ImproperConfigurationError when unmet. TSV helpers duplicated in core.py.
execute_many now reads oracle_batch_errors / oracle_array_dml_row_counts
from statement_config.execution_args (asyncpg COPY precedent). Errors
surface on SQLResult.metadata[oracle_batch_errors]; array DML counts on
[oracle_dml_row_counts] with rowcount_override = sum(counts). Default
path passes batcherrors=False/arraydmlrowcounts=False (unchanged).
Add enable_direct_path_load driver feature routing load_from_arrow
through Connection.direct_path_load (Thin-mode only). Runtime guard
(hasattr + thin) falls back to executemany for Thick mode or older
drivers. Schema resolved from qualified table or connection.username.
…ction

When the driver owns the transaction (connection not already in one),
overwrite-DELETE and executemany run inside one BEGIN IMMEDIATE -> commit,
with rollback on sqlite3.Error. Caller-managed transactions are left
untouched (no commit/rollback). Makes Arrow ingest atomic.
…nsaction

Async mirror of the sqlite transaction wrap: overwrite-DELETE and
executemany run inside one BEGIN IMMEDIATE -> commit when the driver owns
the transaction, with rollback on aiosqlite/sqlite3 errors. Caller-owned
transactions are passed through untouched.
Lock _coerce_bulk_copy_result stat passthrough (incl. unknown future
keys) and rowcount fallback, and assert bulk_copy() defaults match
upstream cursor.bulkcopy() for both sync and async wrappers. Tests only;
no production change (behavior already shipped).
Extend StorageFormat with avro/orc, map csv/avro/orc in
_map_bigquery_source_format, and drop the stale parquet-only guard in
load_from_storage so the generic storage bridge routes every BigQuery
load_table_from_uri source format. arrow-ipc still raises. No new public
surface.
…tations

Replace the per-row INSERT...VALUES batch_update construction in
load_from_arrow with chunked Transaction.insert_or_update mutations
(<=20k cells per flush). Upsert semantics make re-runs idempotent; no
database handle, no new public surface. overwrite still DELETEs first.
Update _SpannerWriteProtocol.
Add supports_native_bulk_ingest DriverCase flag (True for the six
native-COPY/Arrow adapters: asyncpg, psycopg sync/async, psqlpy, adbc,
duckdb) plus paired sync/async behavior asserts covering row-count
fidelity, overwrite, and append. Non-flagged adapters skip.

Error-surfacing is intentionally not asserted: duckdb/adbc propagate raw
driver exceptions rather than SQLSpecError, so a portable assertion is
not possible without driver-side exception mapping (out of scope here).
…m_arrow

Add enable_storage_write_api driver feature. When enabled (and not
overwrite), load_from_arrow ingests via a PENDING BigQuery Storage write
stream using native arrow_rows (no Parquet round-trip): create stream ->
append serialized Arrow batches -> finalize -> batch-commit. Falls back
to the Parquet load job on ImportError; overwrite always uses Parquet
WRITE_TRUNCATE. Default-off, so existing behavior is unchanged.

The streaming/serialization path is covered by unit tests for call
orchestration and fallbacks; it needs validation against a live BigQuery
instance (the emulator does not implement the Storage Write API).
Add load_from_records(table, records, *, columns=None, overwrite=False)
to the sync and async driver base classes. Records (dict or positional +
columns) are normalized via _records_to_arrow_table and routed through
each adapter's native load_from_arrow path (COPY, executemany, Spanner
mutations, etc.), so every adapter with load_from_arrow gains it for free
and mssql_python/arrow_odbc keep raising. Empty/mismatched input raises
ImproperConfigurationError.

Add supports_load_from_records contract coverage (dict + positional +
input validation).
New docs/usage/bulk_ingest.rst documents load_from_arrow /
load_from_storage / load_from_records, a per-adapter capability matrix,
the security/opt-in gates (MySQL local infile, Oracle direct path load,
BigQuery Storage Write API), and MySQL/Oracle examples. Wired into the
usage toctree with a seealso cross-link from the ETL page.
@cofin cofin force-pushed the feat/native-bulk-ingest branch from 491f90c to a4a6e8d Compare June 25, 2026 17:24
cofin added 12 commits June 25, 2026 17:36
_coerce_bulk_copy_result expects MssqlPythonRawCursor; cast the test's
_FakeCursor at the call sites so pyright (make type-check) passes.
Add load_from_arrow / load_from_storage overrides (sync + async) that
ingest Arrow data through cursor.bulkcopy(), and flip
supports_native_arrow_import. overwrite issues DELETE first. This also
lights up the generic load_from_records for SQL Server via the base
delegation path.
…rt_arrow

Wire arrow-odbc's bulk_insert_arrow into the storage-bridge
load_from_arrow / load_from_storage methods (overwrite issues DELETE
first). This also enables the generic load_from_records for arrow-odbc
through the base delegation path.
Add enable_batch_write_api driver feature. When set, load_from_arrow
routes through Database.mutation_groups().batch_write() (one mutation
group per <=20k-cell chunk) for high-throughput, independently committed
insert_or_update groups; the Database is reached from the connection's
session. Default path (single in-transaction insert_or_update) is
unchanged. Upsert semantics keep replay idempotent.

Docs: refresh the bulk-ingest matrix for the mssql/arrow_odbc
load_from_arrow wiring and the Spanner Batch Write opt-in.
BigQuery Storage Write API now splits AppendRowsRequest payloads to stay under the 20MB append limit, surfacing an oversized single row as a storage error, and the load-job source-format mapping is broadened. Spanner insert_or_update mutation batching is tidied. Records-to-Arrow normalization moves into the shared arrow/storage helpers so compiled drivers reuse it.

Adds unit coverage for the MySQL family LOAD DATA LOCAL INFILE and executemany paths (pymysql, aiomysql, asyncmy, mysql-connector), the Oracle async direct-path-load branch, the sqlite/aiosqlite overwrite and rollback transaction paths, and the BigQuery, Spanner, and load_from_records helpers.
@cofin cofin merged commit 77aafaa into main Jun 25, 2026
27 checks passed
@cofin cofin deleted the feat/native-bulk-ingest branch June 25, 2026 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants