Late materialization experiment #12

wjones127 · 2023-12-27T23:01:02Z

This adds an experiment showing the impact of late materialization in combination with buffer slicing. We show that we are able to read less data than any given Parquet.

There are significant runtime differences between Lance and Parquet. The one variable I don't feel confident I've isolated is parallelism. It's possible Lance is just faster than Parquet because we do more IOs in parallel.

There also seems to be another caveat: our slicing shows advantages in read amplification over DataFusion because DataFusion seems to be only able to do late materialization pruning at the row group level. However, if page-level pruning were implemented we might not have as much as an advantage.

wjones127 · 2023-12-29T23:32:03Z

Progress so far:

I've switched to using a synthetic dataset as I think it makes the point sufficiently.
I had to disable statistics to make this work correctly. This better isolates the effect we are discussing.
I tried to make a synthetic Parquet late materialization implementation, but the IO measurement shows it making far more IO calls than I expect. I need to look into how take() works in PyArrow.

TODO:

Investigate take() in Parquet
Add discussion of late materialization in relation to other paper
Fix figure captions
Add description of dataset used
Clean up figure legend labels
Change to use a separate int column
Update discussion to include Datafusion based version

wjones127 · 2023-12-29T23:52:16Z

Okay figured out the issue with dataset.take(). The implementation basically scans the whole dataset on each call and takes the rows it wants. It does not skip any row groups based on position. https://github.com/apache/arrow/blob/7c3480e2f028f5881242f227f42155cf833efee7/cpp/src/arrow/dataset/scanner.cc#L534

@westonpace I think I'll need to switch to using arrow-rs for the Parquet side of this experiment. It sounds like you might have some code for this already :)

westonpace · 2023-12-30T00:02:32Z

I'll put up a PR with what I have so far tonight.

westonpace · 2023-12-30T14:31:55Z

@wjones127 the parquet random take is here: #13

wjones127 · 2024-01-17T00:24:25Z

Here is the rendered draft so far: paper.pdf

westonpace · 2024-01-18T19:58:49Z

introductory/paper/experiments.qmd

+
+Late materialization is an engine optimization, and can be applied to any columnar format. However, the performance benefit of this optimization depends on the page structure of the format. If pages are large and cannot be sliced, then late materialization only will be beneficial to the extent that whole pages can be skipped. Put another way, any IO savings brought by late materialization can be outweighted by the read amplification from the serialization format. In Lance, vector and binary columns are laid out in a flat layout, which can be sliced at the cell-level. Therefore, Lance can read these large columns with zero read amplication, if we choose to. In practice, there is often a minimum IO size used that means that some small amount of read amplification is actually beneficial to reduce the total number of IO calls.
+
+To demonstrate the performance benefit, we measured the performance of early versus late materialization strategies in Lance and Parquet. We compare against both PyArrow and DataFusion's Parquet scanners. PyArrow is commonly used in Parquet benchmarks in the literature but, unlike DataFusion, lacks a late materialization implementation. Therefore, we only provide results for late materialization in DataFusion. DataFusion currently does not support scanning vector columns (`FixedSizeList` in Arrow parlance), so it's results for vector embeddings are omitted as well.


I'm surprised datafusion doesn't support scanning vector columns.

There's a minor bug that prevents it. I actually couldn't reproduce the issue in datafusion proper, so I can look further into it.

westonpace · 2024-01-18T20:00:39Z

introductory/paper/experiments.qmd

+
+At this row group size, Lance performance the scan faster than the two Parquet implementations, even for the small `int` column. In the early materialization case, Lance is reading roughly the same or more bytes from disk as the Parquet scans, as shown in @fig-late-mat-total-bytes. Despite this, Lance is able to read the image column with 4.5x less latency than Parquet (@fig-late-mat-table). One possible explanation for this difference is Lance's encodings require less decoding than Parquet to read into Arrow format. In fact, beyond some concatenation of buffers, Lance requires no transformation of binary column.
+
+In cases where the projection contains a large column and is relatively selection, Lance is even faster. For the `img` column with 12.1% selectivity, Lance scanned 21 times faster than DataFusion. A significant portion of this difference comes from the amount of data read from disk: during the scan, Lance reads 70% less data than DataFusion does. This difference is enabled by Lance's ability to pushdown slicing at the IO level, reading only the relevant parts of the pages from disk. Meanwhile DataFusion can only slice Parquet at the row group boundaries. (It's possible a future implementation of Parquet could slice at the page boundaries.) This pattern is clearly shown in @fig-late-mat-total-bytes, where the bytes read from disk by Lance smoothly scales with the number of rows selected by the filter, while DataFusion's jump each time a row group boundary is crossed.


I think that's up-to-date, but I can double check.

westonpace · 2024-01-18T20:03:30Z

introductory/paper/experiments.qmd

+
+Late materialization can reduce IO costs by deferring the decision whether to load certain cells depending on the result of a filter. This is especially important when the projected columns are large, since the potential IO cost savings are substantial.
+
+Late materialization is an engine optimization, and can be applied to any columnar format. However, the performance benefit of this optimization depends on the page structure of the format. If pages are large and cannot be sliced, then late materialization only will be beneficial to the extent that whole pages can be skipped. Put another way, any IO savings brought by late materialization can be outweighted by the read amplification from the serialization format. In Lance, vector and binary columns are laid out in a flat layout, which can be sliced at the cell-level. Therefore, Lance can read these large columns with zero read amplication, if we choose to. In practice, there is often a minimum IO size used that means that some small amount of read amplification is actually beneficial to reduce the total number of IO calls.


I'm a little confused. How does a filter select part of a page? I thought pushdown filtering worked on statistics which are only recorded at the page level?

The query here is SELECT img FROM table WHERE id > 100. In late materialization, the following steps are performed:

Read id. Find row positions that match the predicate.

Read img, but only the matching pages or portions of the page that match the predicate.

In DataFusion, the second step only seems to prune at the row group level. If an entire row group has no matches found for the predicate, it skips reading the projected columns for that row group.

Meanwhile in Lance, we do this pruning at page level and can slice into a page. If the predicate only matches the first 50 rows of a page, we can just load those first 50 rows of the img column.

If the predicate only matches the first 50 rows of a page

How do we know this? Because of a scalar index? Or is it because the predicate is on the row id column?

Because we load the relevant columns and evaluate the predicate

wjones127 changed the title ~~start the experiment~~ Late materialization experiment Dec 27, 2023

wjones127 added 11 commits January 11, 2024 14:47

start the experiment

8c02604

get Lance IO tracking working

45407b9

get Parquet IO measurements

55b2eb7

add late materialization for Parquet

ed8726e

run experiments and start charting

ead6db4

a little more cleanup

a067418

add datafusion to the mix

97ff69c

finish refactor

e17db78

start measuring datafusion IO

eeae398

get datafusion late materialization working

0a767fe

experiment with theme

76a4182

wjones127 force-pushed the late-materialization-experiment branch from d754453 to 76a4182 Compare January 11, 2024 23:08

improvements to benchmark

7198d0e

wjones127 requested review from eddyxu and westonpace January 17, 2024 00:24

wjones127 marked this pull request as ready for review January 17, 2024 00:30

update numbers

2575f9b

westonpace reviewed Jan 18, 2024

View reviewed changes

wjones127 marked this pull request as draft January 25, 2024 17:58


		Late materialization is an engine optimization, and can be applied to any columnar format. However, the performance benefit of this optimization depends on the page structure of the format. If pages are large and cannot be sliced, then late materialization only will be beneficial to the extent that whole pages can be skipped. Put another way, any IO savings brought by late materialization can be outweighted by the read amplification from the serialization format. In Lance, vector and binary columns are laid out in a flat layout, which can be sliced at the cell-level. Therefore, Lance can read these large columns with zero read amplication, if we choose to. In practice, there is often a minimum IO size used that means that some small amount of read amplification is actually beneficial to reduce the total number of IO calls.

		To demonstrate the performance benefit, we measured the performance of early versus late materialization strategies in Lance and Parquet. We compare against both PyArrow and DataFusion's Parquet scanners. PyArrow is commonly used in Parquet benchmarks in the literature but, unlike DataFusion, lacks a late materialization implementation. Therefore, we only provide results for late materialization in DataFusion. DataFusion currently does not support scanning vector columns (`FixedSizeList` in Arrow parlance), so it's results for vector embeddings are omitted as well.


		At this row group size, Lance performance the scan faster than the two Parquet implementations, even for the small `int` column. In the early materialization case, Lance is reading roughly the same or more bytes from disk as the Parquet scans, as shown in @fig-late-mat-total-bytes. Despite this, Lance is able to read the image column with 4.5x less latency than Parquet (@fig-late-mat-table). One possible explanation for this difference is Lance's encodings require less decoding than Parquet to read into Arrow format. In fact, beyond some concatenation of buffers, Lance requires no transformation of binary column.

		In cases where the projection contains a large column and is relatively selection, Lance is even faster. For the `img` column with 12.1% selectivity, Lance scanned 21 times faster than DataFusion. A significant portion of this difference comes from the amount of data read from disk: during the scan, Lance reads 70% less data than DataFusion does. This difference is enabled by Lance's ability to pushdown slicing at the IO level, reading only the relevant parts of the pages from disk. Meanwhile DataFusion can only slice Parquet at the row group boundaries. (It's possible a future implementation of Parquet could slice at the page boundaries.) This pattern is clearly shown in @fig-late-mat-total-bytes, where the bytes read from disk by Lance smoothly scales with the number of rows selected by the filter, while DataFusion's jump each time a row group boundary is crossed.


		Late materialization can reduce IO costs by deferring the decision whether to load certain cells depending on the result of a filter. This is especially important when the projected columns are large, since the potential IO cost savings are substantial.

		Late materialization is an engine optimization, and can be applied to any columnar format. However, the performance benefit of this optimization depends on the page structure of the format. If pages are large and cannot be sliced, then late materialization only will be beneficial to the extent that whole pages can be skipped. Put another way, any IO savings brought by late materialization can be outweighted by the read amplification from the serialization format. In Lance, vector and binary columns are laid out in a flat layout, which can be sliced at the cell-level. Therefore, Lance can read these large columns with zero read amplication, if we choose to. In practice, there is often a minimum IO size used that means that some small amount of read amplification is actually beneficial to reduce the total number of IO calls.

Late materialization experiment #12

Are you sure you want to change the base?

Late materialization experiment #12

Uh oh!

Conversation

wjones127 commented Dec 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wjones127 commented Dec 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wjones127 commented Dec 29, 2023

Uh oh!

westonpace commented Dec 30, 2023

Uh oh!

westonpace commented Dec 30, 2023

Uh oh!

wjones127 commented Jan 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

westonpace Jan 18, 2024

Choose a reason for hiding this comment

Uh oh!

wjones127 Jan 18, 2024

Choose a reason for hiding this comment

Uh oh!

westonpace Jan 18, 2024

Choose a reason for hiding this comment

Uh oh!

wjones127 Jan 18, 2024

Choose a reason for hiding this comment

Uh oh!

westonpace Jan 18, 2024

Choose a reason for hiding this comment

Uh oh!

wjones127 Jan 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

westonpace Jan 18, 2024

Choose a reason for hiding this comment

Uh oh!

wjones127 Jan 18, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wjones127 commented Dec 27, 2023 •

edited

Loading

wjones127 commented Dec 29, 2023 •

edited

Loading

wjones127 commented Jan 17, 2024 •

edited

Loading

wjones127 Jan 18, 2024 •

edited

Loading