-
Notifications
You must be signed in to change notification settings - Fork 102
Description
Describe the enhancement requested
Arrow Flight SQL has a feature for ingesting massive datasets, Bulk Ingestion: apache/arrow#38255
It would be beneficial to use those special RPC methods for batched prepared statement calls when the prepared statement is strictly for inserting data.
E.g. when Spark is used for writing data, it generates a simple SQL query like "INSERT INTO table(field1, field2, ...) VALUES (?, ?, ...)", creates a prepared statement, and then uses the prepared statement update RPC method to insert the rows of the dataset. If this feature is implemented, it would be possible for the driver to instead use the DoPut(CommandStatementIngest) command associated with the Bulk Ingestion feature instead.
There are Arrow Flight SQL server implementations that work like this: when a DoAction(ActionCreatePreparedStatementRequest) is executed, the server creates up to two versions of the data structure underlying the instance of the PreparedStatement. One is a handle to a full-scale query engine execution procedure (e.g. DataFusion's logical plan), and another is a handle to a very simple procedure that just stores the received record batches in the storage - of course, the second procedure is only possible to be generated when the query is of a certain form; like the one used by Spark. The point is that this simple procedure also works without overheads associated with the more general interface of prepared statement API - for example, it does not need to do a transposition of PreparedStatement parameters into record batches.
I think that it should be possible to move this logic for deciding to use Bulk Ingestion into the jdbc driver.
Usecase for this integration is this: developers of Arrow Flight SQL servers could implement bulk ingestion command handlers and avoid implementing special logic for handling batched inserts in a special manner. Then the client would use this newly introduced driver option to allow the driver to decide to use the bulk ingestion RPC methods for inserting data.