You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/content/docs/dev/table/tuning.md
+49Lines changed: 49 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -368,3 +368,52 @@ FROM TenantKafka t
368
368
LEFT JOIN InventoryKafka i ONt.tenant_id=i.tenant_idAND ...;
369
369
```
370
370
371
+
## Delta Joins
372
+
373
+
In stream jobs, regular joins store all historical data from both sides of the input to ensure the accuracy of the computation results. When an input record is received, regular joins query the state of the other side to find matching records to output, while simultaneously updating its own state.
374
+
However, as the job runs for a long term and the input data increases, the state of regular joins will gradually grow larger. This may lead to computational resources becoming a bottleneck, impacting the overall performance of the job and potentially causing a series of stability issues.
375
+
376
+
To address this, we have introduced a new delta join operator. The core idea is to leverage a bidirectional lookup join approach to reuse data from source tables, replacing the state required by regular joins. Compared to traditional regular joins, delta joins significantly reduce the state size, improve the stability of the job, and also decrease the demand for computational resources.
377
+
378
+
This feature is currently enabled by default and regular join will be optimized into delta join when the following conditions are simultaneously met:
379
+
380
+
1. The sql pattern satisfies the optimization criteria. For details, please refer to [Supported Features and Limitations]({{< ref "docs/dev/table/tuning" >}}#supported-features-and-limitations)
381
+
2. The external storage system of the source table provides index information for fast querying for delta joins. Currently, [Apache Fluss(Incubating)](https://fluss.apache.org/blog/fluss-open-source/) has provided index information at the table level for Flink, allowing such tables to be used as source tables for delta joins. Please refer to the [Fluss documentation](https://fluss.apache.org/docs/0.8/engine-flink/delta-joins/#flink-version-support) for more details.
382
+
383
+
### Working Principle
384
+
385
+
In Flink, regular joins require completely storing incoming data from both sides in the state, matching that data when the opposite side's data arrives. In contrast, delta joins utilize the indexing capabilities provided by external storage systems to convert the behavior of querying state data into efficient queries against data in the external storage system using index keys. This approach avoids the need for duplicate storage of the same data in both the external storage system and the state.
Currently, the optimization for delta joins is enabled by default. You can disable this feature manually by setting the following configuration. Please see [Configuration]({{< ref "docs/dev/table/config" >}}#optimizer-options) page for more details.
392
+
393
+
```sql
394
+
SET'table.optimizer.delta-join.strategy'='NONE';
395
+
```
396
+
397
+
Additionally, you can adjust the performance of delta joins by configuring the following configurations. Please see [Configuration]({{< ref "docs/dev/table/config" >}}#execution-options) page for more details.
398
+
399
+
-`table.exec.delta-join.cache-enabled`
400
+
-`table.exec.delta-join.left.cache-size`
401
+
-`table.exec.delta-join.right.cache-size`
402
+
403
+
### Supported Features and Limitations
404
+
405
+
Delta joins are continuously evolving, and supports the following features currently.
406
+
407
+
1. Support INSERT-ONLY tables as source tables for delta join.
408
+
2. Support CDC tables without DELETE as source tables for delta join.
409
+
3. Support project and filter between source and delta join.
410
+
4. Support cache in delta join.
411
+
412
+
However, delta joins has the following limitations. Any job containing one of these conditions cannot be optimized into a delta join.
413
+
414
+
1. The index key of the tables must be included as part of the equivalence conditions in the join.
415
+
2. The join must be a INNER join.
416
+
3. The downstream nodes of the join can accept duplicate changes, such as a sink that provides UPSERT mode without `upsertMaterialize`.
417
+
4. When consuming a CDC stream, the join key used in the delta join must be part of the primary key.
418
+
5. When consuming a CDC stream, all filters must be applied on the upsert key.
419
+
6. Neither filters nor projections should contain non-deterministic functions.
0 commit comments