Add faq answer comparing kerchunk format to icechunk format #818

TomNicholas · 2025-10-22T22:28:48Z

Closes #578

Changes are documented in docs/releases.rst

for more information, see https://pre-commit.ci

ianhi · 2025-10-23T01:29:38Z

docs/faq.md

+    "Kerchunk" is really two things: a python library and an on-disk format for storing virtual references.
+
+    This question compares the Kerchunk python library to the VirtualiZarr python library.
+    For a discussion of the pros and cons of serializing into the Kerchunk references format, see the next question.


jsignell · 2025-10-23T19:24:26Z

docs/faq.md

+- **Read performance** - Reading data from Icechunk is faster than reading from Kerchunk references. This is because reading from Kerchunk references is done using the fsspec python library, whereas reading data from Icechunk (virtual references or native chunks) uses the Icechunk rust library. For this and a number of other reasons, reading data from Icechunk generally provides a much higher throughput.
+- **Incremental overwriting** - VirtualiZarr's `.to_icechunk` API allows you to write to a specific region. This is more difficult to do safely when writing to Kerchunk's format because it would generally require editing part of a single file.
+- **Mix "native" and virtual chunks** - Icechunk's manifests can store any mixture of virtual chunks and "native" zarr chunks. Kerchunk's formats cannot do this ("inlined" chunks are something separate).
+- **Ensure referenced data has not changed** - An inherent risk of the virtual references approach is someone could overwrite, update, or delete the referenced archival file between the time when the virtual references were parsed and the time that a user attempts to read the data. With Kerchunk this scenario could lead to incorrect data being returned silently. In Icechunk, the last-modified time of each file is also saved, and checked at read-time. Therefore a user will get a clear error if a file has been touched since the virtual references were created.


These docs are great! This feels like the most important bullet to me. Maybe it should be first?

…rtualiZarr into format_comparison

for more information, see https://pre-commit.ci

maxrjones

Nice, I left a couple small suggestions to take or leave

maxrjones · 2025-10-24T20:10:47Z

docs/faq.md

+Conversely, the two Kerchunk formats have some advantages over Icechunk:
+- **Standard file formats** - JSON and Parquet are very standard formats, readable by many tools, and JSON is even human-readable. Icechunk uses flatbuffers, which are standardized but not human-readable.


Suggested change

Conversely, the two Kerchunk formats have some advantages over Icechunk:

- **Standard file formats** - JSON and Parquet are very standard formats, readable by many tools, and JSON is even human-readable. Icechunk uses flatbuffers, which are standardized but not human-readable.

Conversely, the two Kerchunk formats have some advantages over Icechunk:

- **Standard file formats** - JSON and Parquet are very standard formats, readable by many tools, and JSON is even human-readable. Icechunk uses flatbuffers, which are standardized but not human-readable.

Gap is needed for this to be rendered as a bulleted list

maxrjones · 2025-10-24T20:11:56Z

docs/faq.md

+- **Standard file formats** - JSON and Parquet are very standard formats, readable by many tools, and JSON is even human-readable. Icechunk uses flatbuffers, which are standardized but not human-readable.
+- **Write latency** - In theory writing a single JSON or writing Parquet to object storage can be done with a smaller number of roundtrips. However this time taken will almost always be negligible compared to the time taken to parse the archival file formats.
+
+(Note that in theory both formats could generalize to store data which does not use the Zarr data model, but in practice neither has ever been used for this purpose.)


Suggested change

(Note that in theory both formats could generalize to store data which does not use the Zarr data model, but in practice neither has ever been used for this purpose.)

IMO, this doesn't help answer the question "Which format should I save my virtual references as?"

ianhi · 2025-10-27T20:30:01Z

docs/faq.md

+
+Icechunk provides several compelling advantages over either Kerchunk format:
+
+- **Ensure referenced data has not changed** - An inherent risk of the virtual references approach is someone could overwrite, update, or delete the referenced archival file between the time when the virtual references were parsed and the time that a user attempts to read the data. With Kerchunk this scenario could lead to incorrect data being returned silently. In Icechunk, the last-modified time of each file is also saved, and checked at read-time. Therefore a user will get a clear error if a file has been touched since the virtual references were created.


Possible that on some systems a malicious actor could change the file and leave the last modified as is. But there is no defense against this, so may not be worht mentioning

ianhi · 2025-10-27T20:30:42Z

docs/faq.md

+Icechunk provides several compelling advantages over either Kerchunk format:
+
+- **Ensure referenced data has not changed** - An inherent risk of the virtual references approach is someone could overwrite, update, or delete the referenced archival file between the time when the virtual references were parsed and the time that a user attempts to read the data. With Kerchunk this scenario could lead to incorrect data being returned silently. In Icechunk, the last-modified time of each file is also saved, and checked at read-time. Therefore a user will get a clear error if a file has been touched since the virtual references were created.
+- **Transactions** - Icechunk stores are updated via commits, each of which is effectively a single database-like transaction. This helps guarantee consistency of the virtual references you write, by making it impossible for someone reading the data to see a half-written state.


Half written slightly unclear here, what's being written is the references?

ianhi · 2025-10-27T20:31:32Z

docs/faq.md

+- **Transactions** - Icechunk stores are updated via commits, each of which is effectively a single database-like transaction. This helps guarantee consistency of the virtual references you write, by making it impossible for someone reading the data to see a half-written state.
+- **Version Control and Time Travel** - Icechunk stores a git-like history of all commits, allowing you to roll back to any previous version, or even create multiple branches and tags. See the [Icechunk docs on Version Control](https://icechunk.io/en/latest/version-control/).
+- **Read performance** - Reading data from Icechunk is faster than reading from Kerchunk references. This is because reading from Kerchunk references is done using the fsspec python library, whereas reading data from Icechunk (virtual references or native chunks) uses the Icechunk rust library. For this and a number of other reasons, reading data from Icechunk generally provides a much higher throughput.
+- **Incremental overwriting** - VirtualiZarr's `.to_icechunk` API allows you to write to a specific region. This is more difficult to do safely when writing to Kerchunk's format because it would generally require editing part of a single file.


I found this a bit confusing. safely because I might muck up the kerchunk json file?

ianhi · 2025-10-27T20:32:34Z

docs/faq.md

+
+Conversely, the two Kerchunk formats have some advantages over Icechunk:
+- **Standard file formats** - JSON and Parquet are very standard formats, readable by many tools, and JSON is even human-readable. Icechunk uses flatbuffers, which are standardized but not human-readable.
+- **Write latency** - In theory writing a single JSON or writing Parquet to object storage can be done with a smaller number of roundtrips. However this time taken will almost always be negligible compared to the time taken to parse the archival file formats.


"smaller number of roundtrips" than icechunk manifest?

ianhi · 2025-10-27T20:32:58Z

docs/faq.md

+
+(Note that in theory both formats could generalize to store data which does not use the Zarr data model, but in practice neither has ever been used for this purpose.)
+
+Overall we strongly recommend using Icechunk over the Kerchunk formats, though VirtualiZarr will continue to support writing to both.


I'd put this at the top rather than the bottom.

Also might be nice to clarify the support. support both in perpetuity? both read and write, or eventually just reading the kerchunk format?

add faq ans comparing kerchunk format to icechunk

4997a1f

TomNicholas added the documentation Improvements or additions to documentation label Oct 22, 2025

[pre-commit.ci] auto fixes from pre-commit.com hooks

099f730

for more information, see https://pre-commit.ci

pre-commit-ci bot temporarily deployed to test-release October 22, 2025 22:29 Inactive

ianhi reviewed Oct 23, 2025

View reviewed changes

jsignell reviewed Oct 23, 2025

View reviewed changes

TomNicholas and others added 6 commits October 24, 2025 09:38

Merge branch 'main' into format_comparison

ddb479a

update release notes

d9982db

move staleness protection to be the first bullet

41c2fa2

add link

51440f9

Merge branch 'format_comparison' of https://github.com/TomNicholas/Vi…

f88dc7f

…rtualiZarr into format_comparison

[pre-commit.ci] auto fixes from pre-commit.com hooks

b3c6438

for more information, see https://pre-commit.ci

pre-commit-ci bot temporarily deployed to test-release October 24, 2025 14:04 Inactive

TomNicholas mentioned this pull request Oct 24, 2025

VirtualiZarr example cubed-dev/cubed#646

Draft

fix bad merge

fd17d62

TomNicholas temporarily deployed to test-release October 24, 2025 19:14 — with GitHub Actions Inactive

Merge branch 'main' into format_comparison

1ba159c

maxrjones temporarily deployed to test-release October 24, 2025 19:52 — with GitHub Actions Inactive

maxrjones approved these changes Oct 24, 2025

View reviewed changes

ianhi reviewed Oct 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add faq answer comparing kerchunk format to icechunk format #818

Add faq answer comparing kerchunk format to icechunk format #818

Uh oh!

TomNicholas commented Oct 22, 2025 •

edited

Loading

Uh oh!

ianhi Oct 23, 2025

Uh oh!

jsignell Oct 23, 2025

Uh oh!

maxrjones left a comment

Uh oh!

maxrjones Oct 24, 2025

Uh oh!

maxrjones Oct 24, 2025

Uh oh!

ianhi Oct 27, 2025

Uh oh!

ianhi Oct 27, 2025

Uh oh!

ianhi Oct 27, 2025

Uh oh!

ianhi Oct 27, 2025

Uh oh!

ianhi Oct 27, 2025

Uh oh!

ianhi Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		Conversely, the two Kerchunk formats have some advantages over Icechunk:
		- Standard file formats - JSON and Parquet are very standard formats, readable by many tools, and JSON is even human-readable. Icechunk uses flatbuffers, which are standardized but not human-readable.


		Icechunk provides several compelling advantages over either Kerchunk format:

		- Ensure referenced data has not changed - An inherent risk of the virtual references approach is someone could overwrite, update, or delete the referenced archival file between the time when the virtual references were parsed and the time that a user attempts to read the data. With Kerchunk this scenario could lead to incorrect data being returned silently. In Icechunk, the last-modified time of each file is also saved, and checked at read-time. Therefore a user will get a clear error if a file has been touched since the virtual references were created.


		(Note that in theory both formats could generalize to store data which does not use the Zarr data model, but in practice neither has ever been used for this purpose.)

		Overall we strongly recommend using Icechunk over the Kerchunk formats, though VirtualiZarr will continue to support writing to both.

Add faq answer comparing kerchunk format to icechunk format #818

Are you sure you want to change the base?

Add faq answer comparing kerchunk format to icechunk format #818

Uh oh!

Conversation

TomNicholas commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maxrjones left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

TomNicholas commented Oct 22, 2025 •

edited

Loading