-
Notifications
You must be signed in to change notification settings - Fork 51
Add faq answer comparing kerchunk format to icechunk format #818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
docs/faq.md
Outdated
| "Kerchunk" is really two things: a python library and an on-disk format for storing virtual references. | ||
|
|
||
| This question compares the Kerchunk python library to the VirtualiZarr python library. | ||
| For a discussion of the pros and cons of serializing into the Kerchunk references format, see the next question. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
link
docs/faq.md
Outdated
| - **Read performance** - Reading data from Icechunk is faster than reading from Kerchunk references. This is because reading from Kerchunk references is done using the fsspec python library, whereas reading data from Icechunk (virtual references or native chunks) uses the Icechunk rust library. For this and a number of other reasons, reading data from Icechunk generally provides a much higher throughput. | ||
| - **Incremental overwriting** - VirtualiZarr's `.to_icechunk` API allows you to write to a specific region. This is more difficult to do safely when writing to Kerchunk's format because it would generally require editing part of a single file. | ||
| - **Mix "native" and virtual chunks** - Icechunk's manifests can store any mixture of virtual chunks and "native" zarr chunks. Kerchunk's formats cannot do this ("inlined" chunks are something separate). | ||
| - **Ensure referenced data has not changed** - An inherent risk of the virtual references approach is someone could overwrite, update, or delete the referenced archival file between the time when the virtual references were parsed and the time that a user attempts to read the data. With Kerchunk this scenario could lead to incorrect data being returned silently. In Icechunk, the last-modified time of each file is also saved, and checked at read-time. Therefore a user will get a clear error if a file has been touched since the virtual references were created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These docs are great! This feels like the most important bullet to me. Maybe it should be first?
…rtualiZarr into format_comparison
for more information, see https://pre-commit.ci
maxrjones
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I left a couple small suggestions to take or leave
| Conversely, the two Kerchunk formats have some advantages over Icechunk: | ||
| - **Standard file formats** - JSON and Parquet are very standard formats, readable by many tools, and JSON is even human-readable. Icechunk uses flatbuffers, which are standardized but not human-readable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Conversely, the two Kerchunk formats have some advantages over Icechunk: | |
| - **Standard file formats** - JSON and Parquet are very standard formats, readable by many tools, and JSON is even human-readable. Icechunk uses flatbuffers, which are standardized but not human-readable. | |
| Conversely, the two Kerchunk formats have some advantages over Icechunk: | |
| - **Standard file formats** - JSON and Parquet are very standard formats, readable by many tools, and JSON is even human-readable. Icechunk uses flatbuffers, which are standardized but not human-readable. |
Gap is needed for this to be rendered as a bulleted list
| - **Standard file formats** - JSON and Parquet are very standard formats, readable by many tools, and JSON is even human-readable. Icechunk uses flatbuffers, which are standardized but not human-readable. | ||
| - **Write latency** - In theory writing a single JSON or writing Parquet to object storage can be done with a smaller number of roundtrips. However this time taken will almost always be negligible compared to the time taken to parse the archival file formats. | ||
|
|
||
| (Note that in theory both formats could generalize to store data which does not use the Zarr data model, but in practice neither has ever been used for this purpose.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| (Note that in theory both formats could generalize to store data which does not use the Zarr data model, but in practice neither has ever been used for this purpose.) |
IMO, this doesn't help answer the question "Which format should I save my virtual references as?"
|
|
||
| Icechunk provides several compelling advantages over either Kerchunk format: | ||
|
|
||
| - **Ensure referenced data has not changed** - An inherent risk of the virtual references approach is someone could overwrite, update, or delete the referenced archival file between the time when the virtual references were parsed and the time that a user attempts to read the data. With Kerchunk this scenario could lead to incorrect data being returned silently. In Icechunk, the last-modified time of each file is also saved, and checked at read-time. Therefore a user will get a clear error if a file has been touched since the virtual references were created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible that on some systems a malicious actor could change the file and leave the last modified as is. But there is no defense against this, so may not be worht mentioning
| Icechunk provides several compelling advantages over either Kerchunk format: | ||
|
|
||
| - **Ensure referenced data has not changed** - An inherent risk of the virtual references approach is someone could overwrite, update, or delete the referenced archival file between the time when the virtual references were parsed and the time that a user attempts to read the data. With Kerchunk this scenario could lead to incorrect data being returned silently. In Icechunk, the last-modified time of each file is also saved, and checked at read-time. Therefore a user will get a clear error if a file has been touched since the virtual references were created. | ||
| - **Transactions** - Icechunk stores are updated via commits, each of which is effectively a single database-like transaction. This helps guarantee consistency of the virtual references you write, by making it impossible for someone reading the data to see a half-written state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Half written slightly unclear here, what's being written is the references?
| - **Transactions** - Icechunk stores are updated via commits, each of which is effectively a single database-like transaction. This helps guarantee consistency of the virtual references you write, by making it impossible for someone reading the data to see a half-written state. | ||
| - **Version Control and Time Travel** - Icechunk stores a git-like history of all commits, allowing you to roll back to any previous version, or even create multiple branches and tags. See the [Icechunk docs on Version Control](https://icechunk.io/en/latest/version-control/). | ||
| - **Read performance** - Reading data from Icechunk is faster than reading from Kerchunk references. This is because reading from Kerchunk references is done using the fsspec python library, whereas reading data from Icechunk (virtual references or native chunks) uses the Icechunk rust library. For this and a number of other reasons, reading data from Icechunk generally provides a much higher throughput. | ||
| - **Incremental overwriting** - VirtualiZarr's `.to_icechunk` API allows you to write to a specific region. This is more difficult to do safely when writing to Kerchunk's format because it would generally require editing part of a single file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this a bit confusing. safely because I might muck up the kerchunk json file?
|
|
||
| Conversely, the two Kerchunk formats have some advantages over Icechunk: | ||
| - **Standard file formats** - JSON and Parquet are very standard formats, readable by many tools, and JSON is even human-readable. Icechunk uses flatbuffers, which are standardized but not human-readable. | ||
| - **Write latency** - In theory writing a single JSON or writing Parquet to object storage can be done with a smaller number of roundtrips. However this time taken will almost always be negligible compared to the time taken to parse the archival file formats. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"smaller number of roundtrips" than icechunk manifest?
|
|
||
| (Note that in theory both formats could generalize to store data which does not use the Zarr data model, but in practice neither has ever been used for this purpose.) | ||
|
|
||
| Overall we strongly recommend using Icechunk over the Kerchunk formats, though VirtualiZarr will continue to support writing to both. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd put this at the top rather than the bottom.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also might be nice to clarify the support. support both in perpetuity? both read and write, or eventually just reading the kerchunk format?
Closes #578
docs/releases.rst