Skip to content

Conversation

@TomNicholas
Copy link
Member

@TomNicholas TomNicholas commented Oct 22, 2025

Closes #578

  • Changes are documented in docs/releases.rst

@TomNicholas TomNicholas added the documentation Improvements or additions to documentation label Oct 22, 2025
docs/faq.md Outdated
"Kerchunk" is really two things: a python library and an on-disk format for storing virtual references.

This question compares the Kerchunk python library to the VirtualiZarr python library.
For a discussion of the pros and cons of serializing into the Kerchunk references format, see the next question.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link

docs/faq.md Outdated
- **Read performance** - Reading data from Icechunk is faster than reading from Kerchunk references. This is because reading from Kerchunk references is done using the fsspec python library, whereas reading data from Icechunk (virtual references or native chunks) uses the Icechunk rust library. For this and a number of other reasons, reading data from Icechunk generally provides a much higher throughput.
- **Incremental overwriting** - VirtualiZarr's `.to_icechunk` API allows you to write to a specific region. This is more difficult to do safely when writing to Kerchunk's format because it would generally require editing part of a single file.
- **Mix "native" and virtual chunks** - Icechunk's manifests can store any mixture of virtual chunks and "native" zarr chunks. Kerchunk's formats cannot do this ("inlined" chunks are something separate).
- **Ensure referenced data has not changed** - An inherent risk of the virtual references approach is someone could overwrite, update, or delete the referenced archival file between the time when the virtual references were parsed and the time that a user attempts to read the data. With Kerchunk this scenario could lead to incorrect data being returned silently. In Icechunk, the last-modified time of each file is also saved, and checked at read-time. Therefore a user will get a clear error if a file has been touched since the virtual references were created.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These docs are great! This feels like the most important bullet to me. Maybe it should be first?

Copy link
Member

@maxrjones maxrjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I left a couple small suggestions to take or leave

Comment on lines +185 to +186
Conversely, the two Kerchunk formats have some advantages over Icechunk:
- **Standard file formats** - JSON and Parquet are very standard formats, readable by many tools, and JSON is even human-readable. Icechunk uses flatbuffers, which are standardized but not human-readable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Conversely, the two Kerchunk formats have some advantages over Icechunk:
- **Standard file formats** - JSON and Parquet are very standard formats, readable by many tools, and JSON is even human-readable. Icechunk uses flatbuffers, which are standardized but not human-readable.
Conversely, the two Kerchunk formats have some advantages over Icechunk:
- **Standard file formats** - JSON and Parquet are very standard formats, readable by many tools, and JSON is even human-readable. Icechunk uses flatbuffers, which are standardized but not human-readable.

Gap is needed for this to be rendered as a bulleted list

- **Standard file formats** - JSON and Parquet are very standard formats, readable by many tools, and JSON is even human-readable. Icechunk uses flatbuffers, which are standardized but not human-readable.
- **Write latency** - In theory writing a single JSON or writing Parquet to object storage can be done with a smaller number of roundtrips. However this time taken will almost always be negligible compared to the time taken to parse the archival file formats.

(Note that in theory both formats could generalize to store data which does not use the Zarr data model, but in practice neither has ever been used for this purpose.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(Note that in theory both formats could generalize to store data which does not use the Zarr data model, but in practice neither has ever been used for this purpose.)

IMO, this doesn't help answer the question "Which format should I save my virtual references as?"


Icechunk provides several compelling advantages over either Kerchunk format:

- **Ensure referenced data has not changed** - An inherent risk of the virtual references approach is someone could overwrite, update, or delete the referenced archival file between the time when the virtual references were parsed and the time that a user attempts to read the data. With Kerchunk this scenario could lead to incorrect data being returned silently. In Icechunk, the last-modified time of each file is also saved, and checked at read-time. Therefore a user will get a clear error if a file has been touched since the virtual references were created.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible that on some systems a malicious actor could change the file and leave the last modified as is. But there is no defense against this, so may not be worht mentioning

Icechunk provides several compelling advantages over either Kerchunk format:

- **Ensure referenced data has not changed** - An inherent risk of the virtual references approach is someone could overwrite, update, or delete the referenced archival file between the time when the virtual references were parsed and the time that a user attempts to read the data. With Kerchunk this scenario could lead to incorrect data being returned silently. In Icechunk, the last-modified time of each file is also saved, and checked at read-time. Therefore a user will get a clear error if a file has been touched since the virtual references were created.
- **Transactions** - Icechunk stores are updated via commits, each of which is effectively a single database-like transaction. This helps guarantee consistency of the virtual references you write, by making it impossible for someone reading the data to see a half-written state.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Half written slightly unclear here, what's being written is the references?

- **Transactions** - Icechunk stores are updated via commits, each of which is effectively a single database-like transaction. This helps guarantee consistency of the virtual references you write, by making it impossible for someone reading the data to see a half-written state.
- **Version Control and Time Travel** - Icechunk stores a git-like history of all commits, allowing you to roll back to any previous version, or even create multiple branches and tags. See the [Icechunk docs on Version Control](https://icechunk.io/en/latest/version-control/).
- **Read performance** - Reading data from Icechunk is faster than reading from Kerchunk references. This is because reading from Kerchunk references is done using the fsspec python library, whereas reading data from Icechunk (virtual references or native chunks) uses the Icechunk rust library. For this and a number of other reasons, reading data from Icechunk generally provides a much higher throughput.
- **Incremental overwriting** - VirtualiZarr's `.to_icechunk` API allows you to write to a specific region. This is more difficult to do safely when writing to Kerchunk's format because it would generally require editing part of a single file.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this a bit confusing. safely because I might muck up the kerchunk json file?


Conversely, the two Kerchunk formats have some advantages over Icechunk:
- **Standard file formats** - JSON and Parquet are very standard formats, readable by many tools, and JSON is even human-readable. Icechunk uses flatbuffers, which are standardized but not human-readable.
- **Write latency** - In theory writing a single JSON or writing Parquet to object storage can be done with a smaller number of roundtrips. However this time taken will almost always be negligible compared to the time taken to parse the archival file formats.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"smaller number of roundtrips" than icechunk manifest?


(Note that in theory both formats could generalize to store data which does not use the Zarr data model, but in practice neither has ever been used for this purpose.)

Overall we strongly recommend using Icechunk over the Kerchunk formats, though VirtualiZarr will continue to support writing to both.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put this at the top rather than the bottom.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also might be nice to clarify the support. support both in perpetuity? both read and write, or eventually just reading the kerchunk format?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

References formats comparison

4 participants