-
Notifications
You must be signed in to change notification settings - Fork 3
feat: add segmentation spec #91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
jm-clius
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few comments below, none of them major. Thanks!
| name: Message Segmentation and Reconstruction | ||
| tags: [waku-application, segmentation] | ||
| version: 0.1 | ||
| status: draft |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jazzz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I think the idea has good bones.
message_hash, segment_index, segment_count provide base functionality for segmentation within an encrypted context.
Relaxing some of the implementation constraints would make it easier for other engineers to incorporate this work into their projects.
| ### Reed–Solomon | ||
|
|
||
| Implementations that apply parity **SHALL** use fixed-size shards of length `segmentSize`. | ||
| The last data chunk **MUST** be padded to `segmentSize` for encoding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the motivation for padding to segment size? I can see this being beneficial in some cases, but curious what makes it required?
| Receivers **MUST** enforce: | ||
|
|
||
| - `entire_message_hash.length == 32` | ||
| - **Data segments:** `segments_count >= 2` **AND** `index < segments_count` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it strictly necessary that segments_count > 1 ?
While it would be a waste of 4 bytes, is there anything stopping implementers from wrapping all payloads in a SegmentMessageProto message even if the message is less than segmentSize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding on here; Implementations like ReliableChannels or Conversations may want to include the segmentation frame regardless to maintain a deterministic parsing tree.
If the output from the "Segmentation" step is sometimes a SegmentationMessageProto and sometimes a Application frame, this adds extra uncertainty as to how payloads ought to be decoded.
While deterministic parsing trees can be debated, it seems that forcing segments_count > 1 is an optimization rather than a compatibility issue
fryorcraken
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can and should make the spec mostly Waku agnostic. Waku can still be mentioned in the motivation.
|
|
||
| This specification defines an application-layer protocol for **segmentation** and **reconstruction** of messages carried over a message transport/delivery services with size limitation, when the original payload exceeds said limitation. | ||
| Applications partition the payload into multiple wire-messages envelopes and reconstruct the original on receipt, | ||
| even when segments arrive out of order or up to a **predefined percentage** of segments are lost. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With my proposal https://forum.vac.dev/t/introducing-the-reliable-channel-api/580/14 of apply segmentation before SDS, it means that even if segmentation may have re-constructed the messages (enough chunks received), SDS may still try to retrieve missing chunks.
What do we think of this?
The alternative being to apply SDS first, and then chunk messages. Meaning that when enough chunk arrive across, then the SDS message can be completed and added to SDS log.
However, it would also mean evolving the retrieval hint, so that a list of waku message id can be passed.
@jm-clius @jazzz @kaichaosun @shash256 thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm, open for debate, but I still think it's most natural for SDS to apply after segmentation - otherwise SDS-R would also be a very large hammer, requesting repairs based on a message id that requires multiple broadcast chunks. It will be possible to work around it, but IMO SDS is best applied if the causality tree matches what is actually broadcast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, segmentation and SDS could be optional add-ons for application, since not every app use large messages, and SDS may even not suitable in some cases.
If both segmentation and SDS are required, we actually can not segment first then apply SDS later, because SDS adds extra loads and may exceed the threshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I too am open to discussion, but I'm aligned with @jm-clius on this . Segmentation -> SDS is the most natural flow.
we actually can not segment first then apply SDS later, because SDS adds extra loads and may exceed the threshold.
@kaichaosun is correct that the payload size of subsequent layers is an issue. However its possible to reduce segmentSize to account for the extra overhead.
In the case where SDS -> Segmentation is the preferred approach, there will still be overhead to account for such as application level encryption and protobuf serialization. Some buffer room will be needed regardless.
Ensuring that protocols have a upper bound on bytes added can be help inform implementers of how much extra overhead they can expect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I make an argument for Segmentation -> SDS -> Encryption here -- Spirited discussion welcome
| - **parity segment**: an erasure-coded segment derived from the set of data segments. | ||
| - **segment message**: a wire-message whose `payload` field carries a serialized `SegmentMessageProto`. | ||
| - **`segmentSize`**: configured maximum size in bytes of each data segment's `payload` chunk (before protobuf serialization). | ||
| - **sender public key**: the origin identifier used for indexing persistence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plausible deniability flag cc @jazzz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum, this is actually only mentioned at the end, maybe we just remove it because from a reader, it is unclear of how such a key should be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its fair to point out to implementers that when persisting segments, indexing by both a unique sender identifier and the entire_message_hash would be more performant.
I'm not sure that this specification needs to be identity aware though. Id leave it generic, and remove "Sender public key" as to not confuse readers.
| uint32 index = 2; // zero-based sequence number; valid only if segments_count > 0 | ||
| uint32 segment_count = 3; // number of data segments (>= 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is segments_count and segment_count. Is that the same?
valid only if segments_count > 0
do you mean segments_count >1 because if we only have one segment, we don't really need segmentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be segment_count I changed to singular from a comment from @jazzz, I might have missed one with plural
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ref: #91 (comment)
|
|
||
| - `entire_message_hash`: A 32-byte Keccak256 hash of the original complete payload, used to identify which segments belong together and verify reconstruction integrity. | ||
| - `index`: Zero-based sequence number identifying this data segment's position (0, 1, 2, ..., segments_count - 1). | ||
| - `segment_count`: Total number of data segments the original message was split into. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - `segment_count`: Total number of data segments the original message was split into. | |
| - `segments_count`: Total number of data segments the original message was split into. |
are they 2 different things or the same?
| - `parity_segment_index`: Zero-based sequence number for parity segments. | ||
| - `parity_segments_count`: Total number of parity segments generated. | ||
|
|
||
| A message is either a **data segment** (when `segment_count > 0`) or a **parity segment** (when `segment_count == 0`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok, so what we are saying is that we set segments_count: 0 when this is a parity segment. I think it could be written in a clearer manner.
| `entire_message_hash` enables correlation of segments that belong to the same original message but does not reveal content. | ||
| Traffic analysis may still identify segmented flows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add encryption considerations.
|
|
||
| ### Storage / Persistence | ||
|
|
||
| Segments **MAY** be persisted (e.g., SQLite) and indexed by `entire_message_hash` and sender public key. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, maybe just mention that "by sender" and somethin glike "sender may be authenticated, out of scope of spec"
Added the specs for Application level segmentation of messages to be able to send messages larger than 150kb using Waku by partitioning them at source and reconstructing them on the recipient
Already implemented in https://github.com/waku-org/nim-chat-sdk/blob/main/chat_sdk/segmentation.nim