-
Notifications
You must be signed in to change notification settings - Fork 0
Late materialization experiment #12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Progress so far:
TODO:
|
|
Okay figured out the issue with @westonpace I think I'll need to switch to using arrow-rs for the Parquet side of this experiment. It sounds like you might have some code for this already :) |
|
I'll put up a PR with what I have so far tonight. |
|
@wjones127 the parquet random take is here: #13 |
d754453 to
76a4182
Compare
|
Here is the rendered draft so far: paper.pdf |
|
|
||
| Late materialization is an engine optimization, and can be applied to any columnar format. However, the performance benefit of this optimization depends on the page structure of the format. If pages are large and cannot be sliced, then late materialization only will be beneficial to the extent that whole pages can be skipped. Put another way, any IO savings brought by late materialization can be outweighted by the read amplification from the serialization format. In Lance, vector and binary columns are laid out in a flat layout, which can be sliced at the cell-level. Therefore, Lance can read these large columns with zero read amplication, if we choose to. In practice, there is often a minimum IO size used that means that some small amount of read amplification is actually beneficial to reduce the total number of IO calls. | ||
|
|
||
| To demonstrate the performance benefit, we measured the performance of early versus late materialization strategies in Lance and Parquet. We compare against both PyArrow and DataFusion's Parquet scanners. PyArrow is commonly used in Parquet benchmarks in the literature but, unlike DataFusion, lacks a late materialization implementation. Therefore, we only provide results for late materialization in DataFusion. DataFusion currently does not support scanning vector columns (`FixedSizeList` in Arrow parlance), so it's results for vector embeddings are omitted as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised datafusion doesn't support scanning vector columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a minor bug that prevents it. I actually couldn't reproduce the issue in datafusion proper, so I can look further into it.
|
|
||
| At this row group size, Lance performance the scan faster than the two Parquet implementations, even for the small `int` column. In the early materialization case, Lance is reading roughly the same or more bytes from disk as the Parquet scans, as shown in @fig-late-mat-total-bytes. Despite this, Lance is able to read the image column with 4.5x less latency than Parquet (@fig-late-mat-table). One possible explanation for this difference is Lance's encodings require less decoding than Parquet to read into Arrow format. In fact, beyond some concatenation of buffers, Lance requires no transformation of binary column. | ||
|
|
||
| In cases where the projection contains a large column and is relatively selection, Lance is even faster. For the `img` column with 12.1% selectivity, Lance scanned 21 times faster than DataFusion. A significant portion of this difference comes from the amount of data read from disk: during the scan, Lance reads 70% less data than DataFusion does. This difference is enabled by Lance's ability to pushdown slicing at the IO level, reading only the relevant parts of the pages from disk. Meanwhile DataFusion can only slice Parquet at the row group boundaries. (It's possible a future implementation of Parquet could slice at the page boundaries.) This pattern is clearly shown in @fig-late-mat-total-bytes, where the bytes read from disk by Lance smoothly scales with the number of rows selected by the filter, while DataFusion's jump each time a row group boundary is crossed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
12.1%?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's up-to-date, but I can double check.
|
|
||
| Late materialization can reduce IO costs by deferring the decision whether to load certain cells depending on the result of a filter. This is especially important when the projected columns are large, since the potential IO cost savings are substantial. | ||
|
|
||
| Late materialization is an engine optimization, and can be applied to any columnar format. However, the performance benefit of this optimization depends on the page structure of the format. If pages are large and cannot be sliced, then late materialization only will be beneficial to the extent that whole pages can be skipped. Put another way, any IO savings brought by late materialization can be outweighted by the read amplification from the serialization format. In Lance, vector and binary columns are laid out in a flat layout, which can be sliced at the cell-level. Therefore, Lance can read these large columns with zero read amplication, if we choose to. In practice, there is often a minimum IO size used that means that some small amount of read amplification is actually beneficial to reduce the total number of IO calls. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused. How does a filter select part of a page? I thought pushdown filtering worked on statistics which are only recorded at the page level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The query here is SELECT img FROM table WHERE id > 100. In late materialization, the following steps are performed:
- Read
id. Find row positions that match the predicate. - Read
img, but only the matching pages or portions of the page that match the predicate.
In DataFusion, the second step only seems to prune at the row group level. If an entire row group has no matches found for the predicate, it skips reading the projected columns for that row group.
Meanwhile in Lance, we do this pruning at page level and can slice into a page. If the predicate only matches the first 50 rows of a page, we can just load those first 50 rows of the img column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the predicate only matches the first 50 rows of a page
How do we know this? Because of a scalar index? Or is it because the predicate is on the row id column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we load the relevant columns and evaluate the predicate
This adds an experiment showing the impact of late materialization in combination with buffer slicing. We show that we are able to read less data than any given Parquet.
There are significant runtime differences between Lance and Parquet. The one variable I don't feel confident I've isolated is parallelism. It's possible Lance is just faster than Parquet because we do more IOs in parallel.
There also seems to be another caveat: our slicing shows advantages in read amplification over DataFusion because DataFusion seems to be only able to do late materialization pruning at the row group level. However, if page-level pruning were implemented we might not have as much as an advantage.