feat: native bulk ingest paths across adapters by cofin · Pull Request #540 · litestar-org/sqlspec

cofin · 2026-06-25T16:46:00Z

Summary

Expose native bulk-ingest fast paths across every adapter behind the existing storage-bridge surface (load_from_arrow / load_from_storage / new load_from_records) — no new database-specific public APIs. High-volume writes use each driver's primitive (COPY, LOAD DATA, direct path load, BulkCopy, Arrow ingest, load jobs, mutations) instead of generic row-by-row execution. Builds on the cloud job session controls already on main.

Each change is landed test-first (unit + integration) and is independently revertable.

What's included

PostgreSQL / SQLite already-native paths — covered by a new supports_native_bulk_ingest contract family (row-count fidelity, overwrite, append) for asyncpg, psycopg (sync/async), psqlpy, adbc, duckdb.

MySQL family — opt-in LOAD DATA LOCAL INFILE (pymysql, asyncmy, aiomysql, mysql-connector sync+async) via enable_local_infile_bulk_load, gated on the connection's local-infile setting (raises at config construction otherwise).

Oracle — per-call oracle_batch_errors / oracle_array_dml_row_counts on execute_many (surfaced on SQLResult.metadata); opt-in enable_direct_path_load (Thin mode) for load_from_arrow.

SQLite / aiosqlite — load_from_arrow wrapped in one BEGIN IMMEDIATE transaction when the driver owns it (atomic ingest; caller transactions pass through).

BigQuery — load_from_storage honors Parquet/JSON/JSONL/CSV/Avro/ORC; opt-in enable_storage_write_api Arrow Storage Write API transport for load_from_arrow (Parquet fallback).

Spanner — load_from_arrow uses chunked Transaction.insert_or_update mutations (idempotent upsert); opt-in enable_batch_write_api routes through Database.mutation_groups().batch_write() for high-throughput groups.

SQL Server (mssql-python) — load_from_arrow / load_from_storage wired to cursor.bulkcopy() (sync + async); BulkCopy stats regression-locked.

arrow-odbc — load_from_arrow / load_from_storage wired to bulk_insert_arrow.

Cross-adapter — new generic load_from_records(table, records, *, columns=None, overwrite=False) on the sync/async base classes: dict or positional records are normalized and routed through each adapter's native load_from_arrow, so every Arrow-capable adapter (now including mssql-python and arrow-odbc) supports it. Invalid input raises ImproperConfigurationError. Covered by a supports_load_from_records contract family.

Docs — docs/usage/bulk_ingest.rst capability matrix + opt-in/security gates + examples.

Notes for review

The BigQuery Storage Write API and Spanner Batch Write API paths are unit-tested for call orchestration and fallbacks; the BigQuery Storage Write API needs validation against a live BigQuery instance (the emulator does not implement it), and the Spanner Batch Write integration test skips if the emulator returns UNIMPLEMENTED.

Testing

All new unit tests pass locally; SQLite/aiosqlite atomicity, load_from_records, and the duckdb/adbc-duckdb contract cases run containerless and pass. Docs build clean under sphinx-build -W. make type-check (mypy + pyright) is clean on the changed modules. Container/cloud integration tests run in CI.

codecov-commenter · 2026-06-25T16:48:44Z

Codecov Report

❌ Patch coverage is 83.92857% with 108 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.54%. Comparing base (f94fa6c) to head (2abbfa4).

Files with missing lines	Patch %	Lines
sqlspec/adapters/bigquery/driver.py	69.15%	24 Missing and 9 partials ⚠️
sqlspec/adapters/oracledb/driver.py	76.11%	9 Missing and 7 partials ⚠️
sqlspec/adapters/bigquery/core.py	80.85%	8 Missing and 1 partial ⚠️
sqlspec/adapters/mysqlconnector/driver.py	82.60%	4 Missing and 4 partials ⚠️
sqlspec/adapters/spanner/driver.py	83.33%	4 Missing and 3 partials ⚠️
sqlspec/adapters/mssql_python/driver.py	84.61%	4 Missing and 2 partials ⚠️
sqlspec/utils/arrow_helpers.py	84.61%	5 Missing and 1 partial ⚠️
sqlspec/utils/text.py	88.00%	2 Missing and 4 partials ⚠️
sqlspec/adapters/aiomysql/driver.py	83.33%	2 Missing and 2 partials ⚠️
sqlspec/adapters/pymysql/driver.py	83.33%	2 Missing and 2 partials ⚠️
... and 4 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #540      +/-   ##
==========================================
+ Coverage   75.35%   75.54%   +0.19%     
==========================================
  Files         444      444              
  Lines       58256    58791     +535     
  Branches     9030     9132     +102     
==========================================
+ Hits        43896    44412     +516     
+ Misses      11514    11506       -8     
- Partials     2846     2873      +27

Flag	Coverage Δ
integration	`59.34% <47.17%> (-0.06%)`	⬇️
py3.10	`73.91% <83.92%> (+0.21%)`	⬆️
py3.11	`73.93% <83.92%> (+0.21%)`	⬆️
py3.12	`73.92% <83.92%> (+0.21%)`	⬆️
py3.13	`73.92% <83.92%> (+0.21%)`	⬆️
py3.14	`74.74% <83.78%> (+0.21%)`	⬆️
unit	`61.41% <82.73%> (+0.45%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
sqlspec/adapters/aiomysql/config.py	`91.47% <100.00%> (+0.24%)`	⬆️
sqlspec/adapters/aiomysql/core.py	`74.13% <100.00%> (+1.71%)`	⬆️
sqlspec/adapters/aiosqlite/core.py	`90.32% <100.00%> (+1.01%)`	⬆️
sqlspec/adapters/asyncmy/config.py	`89.10% <100.00%> (+0.16%)`	⬆️
sqlspec/adapters/asyncmy/core.py	`72.05% <100.00%> (ø)`
sqlspec/adapters/asyncmy/driver.py	`79.86% <100.00%> (+0.13%)`	⬆️
sqlspec/adapters/bigquery/config.py	`87.67% <100.00%> (+0.08%)`	⬆️
sqlspec/adapters/mssql_python/config.py	`81.62% <100.00%> (ø)`
sqlspec/adapters/mssql_python/events/store.py	`85.18% <100.00%> (+7.63%)`	⬆️
sqlspec/adapters/mssql_python/migrations.py	`72.97% <100.00%> (+3.52%)`	⬆️
... and 26 more

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add enable_local_infile_bulk_load driver feature that routes load_from_arrow through LOAD DATA LOCAL INFILE when local_infile=True. TSV encoding helpers duplicated in pymysql/core.py; config raises ImproperConfigurationError at construction when the gate is unmet.

Add enable_local_infile_bulk_load driver feature routing load_from_arrow through LOAD DATA LOCAL INFILE; consumes the existing allow_local_infile gate (effective local_infile must be True). TSV helpers duplicated in asyncmy/core.py; config raises ImproperConfigurationError when unmet.

Add enable_local_infile_bulk_load driver feature routing load_from_arrow through LOAD DATA LOCAL INFILE; gate requires enable_local_infile=True (or local_infile=True). TSV helpers duplicated in aiomysql/core.py.

Add enable_local_infile_bulk_load driver feature to both sync and async drivers, routing load_from_arrow through LOAD DATA LOCAL INFILE. Gate requires allow_local_infile=True; both config classes raise ImproperConfigurationError when unmet. TSV helpers duplicated in core.py.

execute_many now reads oracle_batch_errors / oracle_array_dml_row_counts from statement_config.execution_args (asyncpg COPY precedent). Errors surface on SQLResult.metadata[oracle_batch_errors]; array DML counts on [oracle_dml_row_counts] with rowcount_override = sum(counts). Default path passes batcherrors=False/arraydmlrowcounts=False (unchanged).

Add enable_direct_path_load driver feature routing load_from_arrow through Connection.direct_path_load (Thin-mode only). Runtime guard (hasattr + thin) falls back to executemany for Thick mode or older drivers. Schema resolved from qualified table or connection.username.

…ction When the driver owns the transaction (connection not already in one), overwrite-DELETE and executemany run inside one BEGIN IMMEDIATE -> commit, with rollback on sqlite3.Error. Caller-managed transactions are left untouched (no commit/rollback). Makes Arrow ingest atomic.

…nsaction Async mirror of the sqlite transaction wrap: overwrite-DELETE and executemany run inside one BEGIN IMMEDIATE -> commit when the driver owns the transaction, with rollback on aiosqlite/sqlite3 errors. Caller-owned transactions are passed through untouched.

Lock _coerce_bulk_copy_result stat passthrough (incl. unknown future keys) and rowcount fallback, and assert bulk_copy() defaults match upstream cursor.bulkcopy() for both sync and async wrappers. Tests only; no production change (behavior already shipped).

Extend StorageFormat with avro/orc, map csv/avro/orc in _map_bigquery_source_format, and drop the stale parquet-only guard in load_from_storage so the generic storage bridge routes every BigQuery load_table_from_uri source format. arrow-ipc still raises. No new public surface.

…tations Replace the per-row INSERT...VALUES batch_update construction in load_from_arrow with chunked Transaction.insert_or_update mutations (<=20k cells per flush). Upsert semantics make re-runs idempotent; no database handle, no new public surface. overwrite still DELETEs first. Update _SpannerWriteProtocol.

Add supports_native_bulk_ingest DriverCase flag (True for the six native-COPY/Arrow adapters: asyncpg, psycopg sync/async, psqlpy, adbc, duckdb) plus paired sync/async behavior asserts covering row-count fidelity, overwrite, and append. Non-flagged adapters skip. Error-surfacing is intentionally not asserted: duckdb/adbc propagate raw driver exceptions rather than SQLSpecError, so a portable assertion is not possible without driver-side exception mapping (out of scope here).

…m_arrow Add enable_storage_write_api driver feature. When enabled (and not overwrite), load_from_arrow ingests via a PENDING BigQuery Storage write stream using native arrow_rows (no Parquet round-trip): create stream -> append serialized Arrow batches -> finalize -> batch-commit. Falls back to the Parquet load job on ImportError; overwrite always uses Parquet WRITE_TRUNCATE. Default-off, so existing behavior is unchanged. The streaming/serialization path is covered by unit tests for call orchestration and fallbacks; it needs validation against a live BigQuery instance (the emulator does not implement the Storage Write API).

Add load_from_records(table, records, *, columns=None, overwrite=False) to the sync and async driver base classes. Records (dict or positional + columns) are normalized via _records_to_arrow_table and routed through each adapter's native load_from_arrow path (COPY, executemany, Spanner mutations, etc.), so every adapter with load_from_arrow gains it for free and mssql_python/arrow_odbc keep raising. Empty/mismatched input raises ImproperConfigurationError. Add supports_load_from_records contract coverage (dict + positional + input validation).

New docs/usage/bulk_ingest.rst documents load_from_arrow / load_from_storage / load_from_records, a per-adapter capability matrix, the security/opt-in gates (MySQL local infile, Oracle direct path load, BigQuery Storage Write API), and MySQL/Oracle examples. Wired into the usage toctree with a seealso cross-link from the ETL page.

_coerce_bulk_copy_result expects MssqlPythonRawCursor; cast the test's _FakeCursor at the call sites so pyright (make type-check) passes.

Add load_from_arrow / load_from_storage overrides (sync + async) that ingest Arrow data through cursor.bulkcopy(), and flip supports_native_arrow_import. overwrite issues DELETE first. This also lights up the generic load_from_records for SQL Server via the base delegation path.

…rt_arrow Wire arrow-odbc's bulk_insert_arrow into the storage-bridge load_from_arrow / load_from_storage methods (overwrite issues DELETE first). This also enables the generic load_from_records for arrow-odbc through the base delegation path.

Add enable_batch_write_api driver feature. When set, load_from_arrow routes through Database.mutation_groups().batch_write() (one mutation group per <=20k-cell chunk) for high-throughput, independently committed insert_or_update groups; the Database is reached from the connection's session. Default path (single in-transaction insert_or_update) is unchanged. Upsert semantics keep replay idempotent. Docs: refresh the bulk-ingest matrix for the mssql/arrow_odbc load_from_arrow wiring and the Spanner Batch Write opt-in.

BigQuery Storage Write API now splits AppendRowsRequest payloads to stay under the 20MB append limit, surfacing an oversized single row as a storage error, and the load-job source-format mapping is broadened. Spanner insert_or_update mutation batching is tidied. Records-to-Arrow normalization moves into the shared arrow/storage helpers so compiled drivers reuse it. Adds unit coverage for the MySQL family LOAD DATA LOCAL INFILE and executemany paths (pymysql, aiomysql, asyncmy, mysql-connector), the Oracle async direct-path-load branch, the sqlite/aiosqlite overwrite and rollback transaction paths, and the BigQuery, Spanner, and load_from_records helpers.

… API support

cofin marked this pull request as ready for review June 25, 2026 17:23

cofin added 15 commits June 25, 2026 12:24

feat(aiomysql): add LOAD DATA LOCAL INFILE bulk-load fast path

c49828b

Add enable_local_infile_bulk_load driver feature routing load_from_arrow through LOAD DATA LOCAL INFILE; gate requires enable_local_infile=True (or local_infile=True). TSV helpers duplicated in aiomysql/core.py.

cofin force-pushed the feat/native-bulk-ingest branch from 491f90c to a4a6e8d Compare June 25, 2026 17:24

cofin added 12 commits June 25, 2026 17:36

test(mssql_python): cast fake cursor to satisfy pyright

9aebcf2

_coerce_bulk_copy_result expects MssqlPythonRawCursor; cast the test's _FakeCursor at the call sites so pyright (make type-check) passes.

Add initial test suite for Arrow ODBC adapter

e9bf844

fix(oracledb): execute overwrite truncate via cursor

df8f6b1

feat(oracledb): default to direct path bulk loads

a656c43

perf(arrow): use pyarrow constructors for record coercion

4109491

feat(bigquery): add load_from_records method and enable storage write…

a15186e

… API support

fix: remove unsupported asyncmy local infile path

0b68195

fix: commit spanner mutation-only write sessions

6eeca5b

fix: parse quoted table identifiers for bulk ingest

2abbfa4

cofin merged commit 77aafaa into main Jun 25, 2026
27 checks passed

cofin deleted the feat/native-bulk-ingest branch June 25, 2026 22:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: native bulk ingest paths across adapters#540

feat: native bulk ingest paths across adapters#540
cofin merged 28 commits into
mainfrom
feat/native-bulk-ingest

cofin commented Jun 25, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jun 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Uh oh!

Conversation

cofin commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's included

Notes for review

Testing

Uh oh!

codecov-commenter commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cofin commented Jun 25, 2026 •

edited

Loading

codecov-commenter commented Jun 25, 2026 •

edited

Loading