Foreign Table: Persistent Postgres table for managing external data queried by DuckDB readers#951
Foreign Table: Persistent Postgres table for managing external data queried by DuckDB readers#951YuweiXiao wants to merge 12 commits intoduckdb:mainfrom
Conversation
1c1c595 to
b5ffdc6
Compare
|
This is a much needed improvement @YuweiXiao , thank you! Does this implementation of external table support handle partitioned Parquet datasets for example, when using wildcard paths or recursive directory patterns such as:
In other words, if I create an external table pointing to a directory of Parquet partitions, will it automatically discover and read all matching files, or does it only support a single file path per table definition? |
YES. External table tracks path / read_options in pg catalog. And file list is triggered for each query. Theoretically, all functionality supported by |
|
Thanks for the work on this! I also had something like this in mind, but I was thinking about using FOREIGN TABLES instead of table access methods for this. So I'm wondering why you went this route instead. (not saying that one is really better than the other, but I'm wondering what tradeoffs you considered) |
Yes, FOREIGN TABLE would definitely work too. I didn’t have a strong tradeoff in mind — mainly wanted to reuse the existing codebase as much as possible, e.g., the DuckDB AM that’s already properly hooked and the registered triggers. I’ll take another look at the FOREIGN TABLE approach — it has a better semantic fit (i.e., metadata only table). |
|
Thinking about it more, I do think FOREIGN TABLE is a better fit for this semantically. Because the CREATE TABLE command that you have now isn't actually creating the backing files. It's only registering some already existing external data in postgres. |
Yeah. I will initiate a discussion thread and let's define the SQL interface (usage) before impl. |
|
The above change (or a similar change using FDW instead) would be great. One of the issues with the current syntax is that it does not play nice with ORM's which is a big annoyance for a lot of teams. Also I could see a usage pattern with pg_duckdb whereby you keep "live data" in postgres tables (or partitions) and "archive data" on s3/parquet. Would be great to be able to access both of these with a uniform interface. |
2d67f45 to
3e362ce
Compare
|
Hi @JelteF , the PR is ready for review:
|
| namespace pgduckdb { | ||
|
|
||
| // The name of the foreign server that we use for DuckDB foreign tables. | ||
| #define DUCKDB_FOREIGN_SERVER_NAME "ddb_foreign_server" |
There was a problem hiding this comment.
Let's just use duckdb for this. I think that will make the SQL look nicer:
CREATE FOREIGN TABLE external_parquet ()
SERVER duckdb
OPTIONS (
location '../../data/iris.parquet'
);
| #define DUCKDB_FOREIGN_SERVER_NAME "ddb_foreign_server" | |
| #define DUCKDB_FOREIGN_SERVER_NAME "duckdb" |
02a4708 to
b6941f0
Compare
|
hi @JelteF , can we target it for 1.1.0 release? will put in effect to address any feedback. |
|
I would love to include this in 1.1.0, but it's primarily my own time that's the bottleneck here (not you). I need to spend some quality time playing with this, and reading the code. Your two PRs (this one and the INSERT one) are definitely the number one features that I'd like to get released. But they're both non-trivial to review. But the main branch has been accumulating small fixes that I just want to release somewhere in the next few days. So that's why I moved this to the 1.2.0. |
|
One thing I noticed now. Can you update the PR description to use the new syntax? |
|
I know that's though question, but when is version 1.2.0 roughly scheduled for release? |
|
I am also very eager to get this feature, I was so hoping to get it in v1.1.0 |
|
Could you fix the merge conflicts, that will make it easier to review this (which I'm planning to do early january). |
b6941f0 to
5281eba
Compare
5281eba to
f50706a
Compare
This PR introduces foreign table support, allowing users to persist an external file's view queried through DuckDB's readers (
read_csv,read_parquet, andread_json).Previously, users had to embed file locations and options directly in queries, and use
r[xx]syntax for column reference. Foreign tables simplify this by defining file paths and reader options once at CREATE time, enabling clean SELECT statements withoutr[xx]syntax. This also opens room for access control on external files, such as fine-grained permissions like column-level visibility for different users.CREATE TABLE Syntax
CREATE FOREIGN TABLE external_csv () SERVER duckdb OPTIONS ( location = '../../data/iris.csv', format = 'csv', options = '{"header": true}' ); -- Query like a regular table SELECT * FROM external_csv; SELECT "sepal.length" FROM external_csv; -- Raw SQL way SELECT r['sepal.length'] FROM read_parquet('../../data/iris.csv')Features
CREATE FOREIGN TABLE,DROP FOREIGN TABLE,ALTER TABLE NAME