Skip to content

Commit 70734a2

Browse files
committed
Integrate/dlt: Pull README and usage guide from upstream repository
Includes: - Overview about supported features - Usage guide based on `dlt init`
1 parent 914fd22 commit 70734a2

File tree

2 files changed

+213
-10
lines changed

2 files changed

+213
-10
lines changed

docs/integrate/dlt/index.md

Lines changed: 116 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,9 @@
1313
[dlt] (data load tool)—think ELT as Python code—is a popular,
1414
production-ready Python library for moving data. It loads data from
1515
various and often messy data sources into well-structured, live datasets.
16-
dlt is used by {ref}`ingestr`.
16+
17+
dlt supports [30+ databases supported by SQLAlchemy],
18+
and is also the workhorse behind the {ref}`ingestr` toolkit.
1719

1820
::::{grid}
1921

@@ -75,32 +77,136 @@ pipeline = dlt.pipeline(
7577
pipeline.run(source)
7678
```
7779

78-
## Learn
80+
## Supported features
81+
82+
### Data loading
83+
84+
Data is loaded into CrateDB using the most efficient method depending on the data source.
85+
86+
- For local files, the `psycopg2` library is used to directly load files into
87+
CrateDB tables using the `INSERT` command.
88+
- For files in remote storage like S3 or Azure Blob Storage,
89+
CrateDB data loading functions are used to read the files and insert the data into tables.
90+
91+
### Datasets
92+
93+
Use `dataset_name="doc"` to address CrateDB's default schema `doc`.
94+
When addressing other schemas, make sure they contain at least one table. [^create-schema]
95+
96+
### File formats
97+
98+
- The [SQL INSERT file format] is the preferred format for both direct loading and staging.
99+
100+
### Column types
101+
102+
The `cratedb` destination has a few specific deviations from the default SQL destinations.
103+
104+
- CrateDB does not support the `time` datatype. Time will be loaded to a `text` column.
105+
- CrateDB does not support the `binary` datatype. Binary will be loaded to a `text` column.
106+
- CrateDB can produce rounding errors under certain conditions when using the `float/double` datatype.
107+
Make sure to use the `decimal` datatype if you can’t afford to have rounding errors.
108+
109+
### Column hints
110+
111+
CrateDB supports the following [column hints].
112+
113+
- `primary_key` - marks the column as part of the primary key. Multiple columns can have this hint to create a composite primary key.
114+
115+
### File staging
116+
117+
CrateDB supports Amazon S3, Google Cloud Storage, and Azure Blob Storage as file staging destinations.
118+
119+
`dlt` will upload CSV or JSONL files to the staging location and use CrateDB data loading functions
120+
to load the data directly from the staged files.
121+
122+
Please refer to the filesystem documentation to learn how to configure credentials for the staging destinations.
123+
124+
- [AWS S3]
125+
- [Azure Blob Storage]
126+
- [Google Storage]
127+
128+
Invoke a pipeline with staging enabled.
129+
130+
```python
131+
pipeline = dlt.pipeline(
132+
pipeline_name='chess_pipeline',
133+
destination='cratedb',
134+
staging='filesystem', # add this to activate staging
135+
dataset_name='chess_data'
136+
)
137+
```
138+
139+
### dbt support
140+
141+
Integration with [dbt] is generally supported via [dbt-cratedb2] but not tested by us.
142+
143+
### dlt state sync
144+
145+
The CrateDB destination fully supports [dlt state sync].
146+
147+
148+
## See also
149+
150+
:::{rubric} Examples
151+
:::
79152

80153
::::{grid}
81154

155+
:::{grid-item-card} Usage guide: Load API data with dlt
156+
:link: dlt-usage
157+
:link-type: ref
158+
Exercise a canonical `dlt init` example with CrateDB.
159+
:::
160+
82161
:::{grid-item-card} Examples: Use dlt with CrateDB
83162
:link: https://github.com/crate/cratedb-examples/tree/main/framework/dlt
84163
:link-type: url
85-
Executable code examples that demonstrate how to use dlt with CrateDB.
164+
Executable code examples on GitHub that demonstrate how to use dlt with CrateDB.
165+
:::
166+
167+
::::
168+
169+
:::{rubric} Resources
86170
:::
87171

88-
:::{grid-item-card} Adapter: The dlt destination adapter for CrateDB
89-
:link: https://github.com/crate/dlt-cratedb
172+
::::{grid}
173+
174+
:::{grid-item-card} Package: `dlt-cratedb`
175+
:link: https://pypi.org/project/dlt-cratedb/
90176
:link-type: url
91-
Based on the dlt PostgreSQL adapter, the package enables you to work
92-
with dlt and CrateDB.
177+
The dlt destination adapter for CrateDB is
178+
based on the dlt PostgreSQL adapter.
93179
:::
94180

95-
:::{grid-item-card} See also: ingestr
181+
:::{grid-item-card} Related: `ingestr`
96182
:link: ingestr
97183
:link-type: ref
98-
The ingestr data import/export application uses dlt.
184+
The ingestr data import/export application uses dlt as a workhorse.
99185
:::
100186

101187
::::
102188

103189

190+
:::{toctree}
191+
:maxdepth: 1
192+
:hidden:
193+
Usage <usage>
194+
:::
195+
196+
197+
[^create-schema]: CrateDB does not support `CREATE SCHEMA` yet, see [CRATEDB-14601].
198+
This means by default, unless any table exists within a schema, the schema appears
199+
to not exist at all. However, it also can't be created explicitly. Schemas are
200+
currently implicitly created when tables exist in them.
104201

105-
[databases supported by SQLAlchemy]: https://docs.sqlalchemy.org/en/20/dialects/
202+
[30+ databases supported by SQLAlchemy]: https://dlthub.com/docs/dlt-ecosystem/destinations/sqlalchemy
203+
[AWS S3]: https://dlthub.com/docs/dlt-ecosystem/destinations/filesystem#aws-s3
204+
[Azure Blob Storage]: https://dlthub.com/docs/dlt-ecosystem/destinations/filesystem#azure-blob-storage
205+
[column hints]: https://dlthub.com/docs/general-usage/schema#column-hint-rules
206+
[CRATEDB-14601]: https://github.com/crate/crate/issues/14601
207+
[dbt]: https://dlthub.com/docs/hub/features/transformations/dbt-transformations
208+
[dbt-cratedb2]: https://pypi.org/project/dbt-cratedb2/
106209
[dlt]: https://dlthub.com/
210+
[dlt state sync]: https://dlthub.com/docs/general-usage/state#syncing-state-with-destination
211+
[Google Storage]: https://dlthub.com/docs/dlt-ecosystem/destinations/filesystem#google-storage
212+
[SQL INSERT file format]: https://dlthub.com/docs/dlt-ecosystem/file-formats/insert-format

docs/integrate/dlt/usage.md

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
---
2+
title: CrateDB
3+
description: CrateDB `dlt` destination
4+
keywords: [ cratedb, destination, data warehouse ]
5+
---
6+
7+
(dlt-usage)=
8+
# Load API data with dlt
9+
10+
:::{div} sd-text-muted
11+
Exercise a canonical `dlt init` example with CrateDB.
12+
:::
13+
14+
## Install the package
15+
16+
Install the dlt destination adapter for CrateDB.
17+
```shell
18+
pip install dlt-cratedb
19+
```
20+
21+
## Initialize the dlt project
22+
23+
Start by initializing a new example `dlt` project.
24+
25+
```shell
26+
export DESTINATION__CRATEDB__DESTINATION_TYPE=postgres
27+
dlt init chess cratedb
28+
```
29+
30+
The `dlt init` command will initialize your pipeline with `chess` [^chess-source]
31+
as the source, and `cratedb` as the destination. It generates several files and directories.
32+
33+
## Edit the pipeline definition
34+
35+
The pipeline definition is stored in the Python file `chess_pipeline.py`.
36+
37+
- Because the dlt adapter currently only supports writing to the default `doc` schema
38+
of CrateDB [^create-schema], please replace `dataset_name="chess_players_games_data"`
39+
by `dataset_name="doc"` within the generated `chess_pipeline.py` file.
40+
41+
- To initialize the CrateDB destination adapter, insert the `import dlt_cratedb`
42+
statement at the top of the file. Otherwise, the destination will not be found,
43+
so you will receive a corresponding error [^not-initialized-error].
44+
45+
## Configure credentials
46+
47+
Next, set up the CrateDB credentials in the `.dlt/secrets.toml` file as shown below.
48+
CrateDB is compatible with PostgreSQL and uses the `psycopg2` driver, like the
49+
`postgres` destination.
50+
51+
```toml
52+
[destination.cratedb.credentials]
53+
host = "localhost" # CrateDB server host.
54+
port = 5432 # CrateDB PostgreSQL TCP protocol port, default is 5432.
55+
username = "crate" # CrateDB username, default is usually "crate".
56+
password = "crate" # CrateDB password, if any.
57+
database = "crate" # CrateDB only knows a single database called `crate`.
58+
connect_timeout = 15
59+
```
60+
61+
Alternatively, you can pass a database connection string as shown below.
62+
```toml
63+
destination.cratedb.credentials="postgres://crate:crate@localhost:5432/"
64+
```
65+
Keep it at the top of your TOML file, before any section starts.
66+
Because CrateDB uses `psycopg2`, using `postgres://` is the right choice.
67+
68+
## Start CrateDB
69+
70+
Use Docker or Podman to run an instance of CrateDB for evaluation purposes.
71+
```shell
72+
docker run --rm --name=cratedb --publish=4200:4200 --publish=5432:5432 crate:latest '-Cdiscovery.type=single-node'
73+
```
74+
75+
## Run pipeline
76+
77+
```shell
78+
python chess_pipeline.py
79+
```
80+
81+
## Explore data
82+
```shell
83+
crash -c 'SELECT * FROM players_profiles LIMIT 10;'
84+
crash -c 'SELECT * FROM players_online_status LIMIT 10;'
85+
```
86+
87+
88+
[^chess-source]: The `chess` dlt source pulls publicly available data from
89+
the [Chess.com Published-Data API].
90+
[^create-schema]: CrateDB does not support `CREATE SCHEMA` yet, see [CRATEDB-14601].
91+
This means by default, unless any table exists within a schema, the schema appears
92+
to not exist at all. However, it also can't be created explicitly. Schemas are
93+
currently implicitly created when tables exist in them.
94+
[^not-initialized-error]: `UnknownDestinationModule: Destination "cratedb" is not one of the standard dlt destinations`
95+
96+
[Chess.com Published-Data API]: https://www.chess.com/news/view/published-data-api
97+
[CRATEDB-14601]: https://github.com/crate/crate/issues/14601

0 commit comments

Comments
 (0)