-
Notifications
You must be signed in to change notification settings - Fork 3
feat: add segmentation spec #91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
9c46e11
7866a68
6384633
83182e8
fdb9595
fa2993b
9b5e7b9
f1804a8
831af45
0ff0c97
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,213 @@ | ||||||
| --- | ||||||
| title: Message Segmentation and Reconstruction | ||||||
| name: Message Segmentation and Reconstruction | ||||||
| tags: [waku-application, segmentation] | ||||||
| version: 0.1 | ||||||
| status: draft | ||||||
| --- | ||||||
|
|
||||||
| ## Abstract | ||||||
|
|
||||||
| This specification defines an application-layer protocol for **segmentation** and **reconstruction** of messages carried over a message transport/delivery services with size limitation, when the original payload exceeds said limitation. | ||||||
| Applications partition the payload into multiple wire-messages envelopes and reconstruct the original on receipt, | ||||||
| even when segments arrive out of order or up to a **predefined percentage** of segments are lost. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With my proposal https://forum.vac.dev/t/introducing-the-reliable-channel-api/580/14 of apply segmentation before SDS, it means that even if segmentation may have re-constructed the messages (enough chunks received), SDS may still try to retrieve missing chunks. What do we think of this? The alternative being to apply SDS first, and then chunk messages. Meaning that when enough chunk arrive across, then the SDS message can be completed and added to SDS log. However, it would also mean evolving the retrieval hint, so that a list of waku message id can be passed. @jm-clius @jazzz @kaichaosun @shash256 thoughts?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Mmm, open for debate, but I still think it's most natural for SDS to apply after segmentation - otherwise SDS-R would also be a very large hammer, requesting repairs based on a message id that requires multiple broadcast chunks. It will be possible to work around it, but IMO SDS is best applied if the causality tree matches what is actually broadcast.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In my opinion, segmentation and SDS could be optional add-ons for application, since not every app use large messages, and SDS may even not suitable in some cases.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I too am open to discussion, but I'm aligned with @jm-clius on this . Segmentation -> SDS is the most natural flow.
@kaichaosun is correct that the payload size of subsequent layers is an issue. However its possible to reduce segmentSize to account for the extra overhead. In the case where SDS -> Segmentation is the preferred approach, there will still be overhead to account for such as application level encryption and protobuf serialization. Some buffer room will be needed regardless. Ensuring that protocols have a upper bound on bytes added can be help inform implementers of how much extra overhead they can expect.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I make an argument for |
||||||
| The protocol uses **Reed–Solomon** erasure coding for fault tolerance. | ||||||
| Messages whose payload size is **≤ `segmentSize`** are sent unmodified. | ||||||
|
|
||||||
| ## Motivation | ||||||
|
|
||||||
| Waku Relay deployments typically propagate envelopes up to **150 KB** as per [64/WAKU2-NETWORK - Message](https://rfc.vac.dev/waku/standards/core/64/network#message-size). | ||||||
| To support larger application payloads, | ||||||
| a segmentation layer is required. | ||||||
| This specification enables larger messages by partitioning them into multiple envelopes and reconstructing them at the receiver. | ||||||
| Erasure-coded parity segments provide resilience against partial loss or reordering. | ||||||
|
|
||||||
| ## Terminology | ||||||
|
|
||||||
| - **original payload**: the full application payload before segmentation. | ||||||
| - **data segment**: one of the partitioned chunks of the original message payload. | ||||||
| - **parity segment**: an erasure-coded segment derived from the set of data segments. | ||||||
| - **segment message**: a wire-message whose `payload` field carries a serialized `SegmentMessageProto`. | ||||||
| - **`segmentSize`**: configured maximum size in bytes of each data segment's `payload` chunk (before protobuf serialization). | ||||||
| - **sender public key**: the origin identifier used for indexing persistence. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. plausible deniability flag cc @jazzz
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hum, this is actually only mentioned at the end, maybe we just remove it because from a reader, it is unclear of how such a key should be used.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think its fair to point out to implementers that when persisting segments, indexing by both a unique sender identifier and the entire_message_hash would be more performant. I'm not sure that this specification needs to be identity aware though. Id leave it generic, and remove "Sender public key" as to not confuse readers. |
||||||
|
|
||||||
| The key words **"MUST"**, **"MUST NOT"**, **"REQUIRED"**, **"SHALL"**, **"SHALL NOT"**, **"SHOULD"**, **"SHOULD NOT"**, **"RECOMMENDED"**, **"NOT RECOMMENDED"**, **"MAY"**, and **"OPTIONAL"** in this document are to be interpreted as described in [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt). | ||||||
|
|
||||||
| ## Wire Format | ||||||
|
|
||||||
| Each segmented message is encoded as a `SegmentMessageProto` protobuf message: | ||||||
|
|
||||||
| ```protobuf | ||||||
| syntax = "proto3"; | ||||||
|
|
||||||
| message SegmentMessageProto { | ||||||
| // Keccak256(original payload), 32 bytes | ||||||
| bytes entire_message_hash = 1; | ||||||
|
|
||||||
| // Data segment indexing | ||||||
| uint32 index = 2; // zero-based sequence number; valid only if segments_count > 0 | ||||||
| uint32 segment_count = 3; // number of data segments (>= 2) | ||||||
|
Comment on lines
+48
to
+49
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. there is
do you mean
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should be
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ref: #91 (comment) |
||||||
|
|
||||||
| // Segment payload (data or parity shard) | ||||||
| bytes payload = 4; | ||||||
|
|
||||||
| // Parity segment indexing (used if segments_count == 0) | ||||||
| uint32 parity_segment_index = 5; // zero-based sequence number for parity segments | ||||||
| uint32 parity_segments_count = 6; // number of parity segments (> 0) | ||||||
| } | ||||||
| ``` | ||||||
|
|
||||||
| **Field descriptions:** | ||||||
|
|
||||||
| - `entire_message_hash`: A 32-byte Keccak256 hash of the original complete payload, used to identify which segments belong together and verify reconstruction integrity. | ||||||
| - `index`: Zero-based sequence number identifying this data segment's position (0, 1, 2, ..., segments_count - 1). | ||||||
| - `segment_count`: Total number of data segments the original message was split into. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
are they 2 different things or the same? |
||||||
| - `payload`: The actual chunk of data or parity information for this segment. | ||||||
| - `parity_segment_index`: Zero-based sequence number for parity segments. | ||||||
| - `parity_segments_count`: Total number of parity segments generated. | ||||||
|
|
||||||
| A message is either a **data segment** (when `segment_count > 0`) or a **parity segment** (when `segment_count == 0`). | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ah ok, so what we are saying is that we set |
||||||
|
|
||||||
| ### Validation | ||||||
|
|
||||||
| Receivers **MUST** enforce: | ||||||
|
|
||||||
| - `entire_message_hash.length == 32` | ||||||
| - **Data segments:** | ||||||
| `segments_count >= 2` **AND** `index < segments_count` | ||||||
| - **Parity segments:** | ||||||
| `segments_count == 0` **AND** `parity_segments_count > 0` **AND** `parity_segment_index < parity_segments_count` | ||||||
|
|
||||||
| No other combinations are permitted. | ||||||
|
|
||||||
| ## Segmentation | ||||||
|
|
||||||
| ### Sending | ||||||
|
|
||||||
| When the original payload exceeds `segmentSize`, the sender: | ||||||
|
|
||||||
| - **MUST** compute a 32-byte `entire_message_hash = Keccak256(original_payload)`. | ||||||
| - **MUST** split the payload into one or more **data segments**, | ||||||
| each of size up to `segmentSize` bytes. | ||||||
| - **MAY** use Reed–Solomon erasure coding at the predefined parity rate. | ||||||
| - Encode each segment as a `SegmentMessageProto` with: | ||||||
| - The `entire_message_hash` | ||||||
| - Either data-segment indices (`segments_count`, `index`) or parity-segment indices (`parity_segments_count`, `parity_segment_index`) | ||||||
| - The raw payload data | ||||||
| - Send all segments as individual Waku envelopes, | ||||||
| preserving application-level metadata (e.g., content topic). | ||||||
|
|
||||||
| Messages smaller than or equal to `segmentSize` **SHALL** be transmitted unmodified. | ||||||
|
|
||||||
| ### Receiving | ||||||
|
|
||||||
| Upon receiving a segmented message, the receiver: | ||||||
|
|
||||||
| - **MUST** validate each segment according to [Wire Format → Validation](#validation). | ||||||
| - **MUST** cache received segments | ||||||
| - **MUST** attempt reconstruction when the number of available (data + parity) segments equals or exceeds the data segment count: | ||||||
| - Concatenating data segments if all are present, or | ||||||
| - Applying Reed–Solomon decoding if parity segments are available. | ||||||
| - **MUST** verify `Keccak256(reconstructed_payload)` matches `entire_message_hash`. | ||||||
| On mismatch, | ||||||
| the message **MUST** be discarded and logged as invalid. | ||||||
| - Once verified, | ||||||
| the reconstructed payload **SHALL** be delivered to the application. | ||||||
| - Incomplete reconstructions **SHOULD** be garbage-collected after a timeout. | ||||||
|
|
||||||
| --- | ||||||
|
|
||||||
| ## Implementation Suggestions | ||||||
|
|
||||||
| ### Reed–Solomon | ||||||
|
|
||||||
| Implementations that apply parity **SHALL** use fixed-size shards of length `segmentSize`. | ||||||
| The last data chunk **MUST** be padded to `segmentSize` for encoding. | ||||||
| The reference implementation uses **nim-leopard** (Leopard-RS) with a maximum of **256 total shards**. | ||||||
|
|
||||||
| ### Storage / Persistence | ||||||
|
|
||||||
| Segments **MAY** be persisted (e.g., SQLite) and indexed by `entire_message_hash` and sender public key. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here, maybe just mention that "by sender" and somethin glike "sender may be authenticated, out of scope of spec" |
||||||
| Implementations **SHOULD** support: | ||||||
|
|
||||||
| - Duplicate detection and idempotent saves | ||||||
| - Completion flags to prevent duplicate processing | ||||||
| - Timeout-based cleanup of incomplete reconstructions | ||||||
| - Per-sender quotas for stored bytes and concurrent reconstructions | ||||||
|
|
||||||
| ### Configuration | ||||||
|
|
||||||
| **Required parameters:** | ||||||
|
|
||||||
| - `segmentSize` — **REQUIRED** configurable parameter; | ||||||
| maximum size in bytes of each data segment's payload chunk (before protobuf serialization). | ||||||
|
|
||||||
| **Fixed parameters:** | ||||||
|
|
||||||
| - `parityRate` — fixed at **0.125** (12.5%) | ||||||
| - `maxTotalSegments` — **256** | ||||||
|
|
||||||
| **Reconstruction capability:** | ||||||
| With the predefined parity rate, | ||||||
| reconstruction is possible if **all data segments** are received or if **any combination of data + parity** totals at least `dataSegments` (i.e., up to the predefined percentage of loss tolerated). | ||||||
|
|
||||||
| **API simplicity:** | ||||||
| Libraries **SHOULD** require only `segmentSize` from the application for normal operation. | ||||||
|
|
||||||
| ### Support | ||||||
|
|
||||||
| - **Language / Package:** Nim; | ||||||
| **Nimble** package manager | ||||||
| - **Intended for:** all Waku nodes at the application layer | ||||||
|
|
||||||
| --- | ||||||
|
|
||||||
| ## Security Considerations | ||||||
|
|
||||||
| ### Privacy | ||||||
|
|
||||||
| `entire_message_hash` enables correlation of segments that belong to the same original message but does not reveal content. | ||||||
| Traffic analysis may still identify segmented flows. | ||||||
|
Comment on lines
+169
to
+170
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe add encryption considerations. |
||||||
|
|
||||||
| ### Integrity | ||||||
|
|
||||||
| Implementations **MUST** verify the Keccak256 hash post-reconstruction and discard on mismatch. | ||||||
|
|
||||||
| ### Denial of Service | ||||||
|
|
||||||
| To mitigate resource exhaustion: | ||||||
|
|
||||||
| - Limit concurrent reconstructions and per-sender storage | ||||||
| - Enforce timeouts and size caps | ||||||
| - Validate segment counts (≤ 256) | ||||||
| - Consider rate-limiting using [17/WAKU2-RLN-RELAY](https://rfc.vac.dev/waku/standards/core/17/rln-relay) | ||||||
|
|
||||||
| ### Compatibility | ||||||
|
|
||||||
| Nodes that do **not** implement this specification cannot reconstruct large messages. | ||||||
|
|
||||||
| --- | ||||||
|
|
||||||
| ## Deployment Considerations | ||||||
|
|
||||||
| **Overhead:** | ||||||
|
|
||||||
| - Bandwidth overhead ≈ the predefined parity rate from parity (if enabled) | ||||||
| - Additional per-segment overhead ≤ **100 bytes** (protobuf + metadata) | ||||||
|
|
||||||
| **Network impact:** | ||||||
|
|
||||||
| - Larger messages increase gossip traffic and storage; | ||||||
| operators **SHOULD** consider policy limits | ||||||
|
|
||||||
| --- | ||||||
|
|
||||||
| ## References | ||||||
|
|
||||||
| 1. [10/WAKU2 – Waku](https://rfc.vac.dev/waku/standards/core/10/waku2) | ||||||
| 2. [11/WAKU2-RELAY – Relay](https://rfc.vac.dev/waku/standards/core/11/relay) | ||||||
| 3. [14/WAKU2-MESSAGE – Message](https://rfc.vac.dev/waku/standards/core/14/message) | ||||||
| 4. [64/WAKU2-NETWORK](https://rfc.vac.dev/waku/standards/core/64/network#message-size) | ||||||
| 5. [nim-leopard](https://github.com/status-im/nim-leopard) – Nim bindings for Leopard-RS (Reed–Solomon) | ||||||
| 6. [Leopard-RS](https://github.com/catid/leopard) – Fast Reed–Solomon erasure coding library | ||||||
| 7. [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt) – Key words for use in RFCs to Indicate Requirement Levels | ||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Cofson @jimstir this is an interesting case. The spec is introduced as
draft, likely because a matching reference implementation already exists. However, I think we usually promotedraftspecs to Vac RFC? Should specs skip therawstate in such instances?