Skip to content

Conversation

@plopezlpz
Copy link

@plopezlpz plopezlpz commented Oct 19, 2025

Added the specs for Application level segmentation of messages to be able to send messages larger than 150kb using Waku by partitioning them at source and reconstructing them on the recipient

Already implemented in https://github.com/waku-org/nim-chat-sdk/blob/main/chat_sdk/segmentation.nim

Copy link
Contributor

@jm-clius jm-clius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few comments below, none of them major. Thanks!

name: Message Segmentation and Reconstruction
tags: [waku-application, segmentation]
version: 0.1
status: draft
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Cofson @jimstir this is an interesting case. The spec is introduced as draft, likely because a matching reference implementation already exists. However, I think we usually promote draft specs to Vac RFC? Should specs skip the raw state in such instances?

Copy link
Contributor

@jazzz jazzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think the idea has good bones.

message_hash, segment_index, segment_count provide base functionality for segmentation within an encrypted context.

Relaxing some of the implementation constraints would make it easier for other engineers to incorporate this work into their projects.

### Reed–Solomon

Implementations that apply parity **SHALL** use fixed-size shards of length `segmentSize`.
The last data chunk **MUST** be padded to `segmentSize` for encoding.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the motivation for padding to segment size? I can see this being beneficial in some cases, but curious what makes it required?

Receivers **MUST** enforce:

- `entire_message_hash.length == 32`
- **Data segments:** `segments_count >= 2` **AND** `index < segments_count`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it strictly necessary that segments_count > 1 ?

While it would be a waste of 4 bytes, is there anything stopping implementers from wrapping all payloads in a SegmentMessageProto message even if the message is less than segmentSize

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding on here; Implementations like ReliableChannels or Conversations may want to include the segmentation frame regardless to maintain a deterministic parsing tree.

If the output from the "Segmentation" step is sometimes a SegmentationMessageProto and sometimes a Application frame, this adds extra uncertainty as to how payloads ought to be decoded.

While deterministic parsing trees can be debated, it seems that forcing segments_count > 1 is an optimization rather than a compatibility issue

Copy link
Contributor

@fryorcraken fryorcraken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can and should make the spec mostly Waku agnostic. Waku can still be mentioned in the motivation.

@plopezlpz plopezlpz requested a review from jm-clius October 27, 2025 08:11

This specification defines an application-layer protocol for **segmentation** and **reconstruction** of messages carried over a message transport/delivery services with size limitation, when the original payload exceeds said limitation.
Applications partition the payload into multiple wire-messages envelopes and reconstruct the original on receipt,
even when segments arrive out of order or up to a **predefined percentage** of segments are lost.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With my proposal https://forum.vac.dev/t/introducing-the-reliable-channel-api/580/14 of apply segmentation before SDS, it means that even if segmentation may have re-constructed the messages (enough chunks received), SDS may still try to retrieve missing chunks.

What do we think of this?

The alternative being to apply SDS first, and then chunk messages. Meaning that when enough chunk arrive across, then the SDS message can be completed and added to SDS log.

However, it would also mean evolving the retrieval hint, so that a list of waku message id can be passed.

@jm-clius @jazzz @kaichaosun @shash256 thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm, open for debate, but I still think it's most natural for SDS to apply after segmentation - otherwise SDS-R would also be a very large hammer, requesting repairs based on a message id that requires multiple broadcast chunks. It will be possible to work around it, but IMO SDS is best applied if the causality tree matches what is actually broadcast.

Copy link
Contributor

@kaichaosun kaichaosun Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, segmentation and SDS could be optional add-ons for application, since not every app use large messages, and SDS may even not suitable in some cases.
If both segmentation and SDS are required, we actually can not segment first then apply SDS later, because SDS adds extra loads and may exceed the threshold.

Copy link
Contributor

@jazzz jazzz Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I too am open to discussion, but I'm aligned with @jm-clius on this . Segmentation -> SDS is the most natural flow.

we actually can not segment first then apply SDS later, because SDS adds extra loads and may exceed the threshold.

@kaichaosun is correct that the payload size of subsequent layers is an issue. However its possible to reduce segmentSize to account for the extra overhead.

In the case where SDS -> Segmentation is the preferred approach, there will still be overhead to account for such as application level encryption and protobuf serialization. Some buffer room will be needed regardless.

Ensuring that protocols have a upper bound on bytes added can be help inform implementers of how much extra overhead they can expect.

Copy link
Contributor

@jazzz jazzz Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I make an argument for Segmentation -> SDS -> Encryption here -- Spirited discussion welcome

- **parity segment**: an erasure-coded segment derived from the set of data segments.
- **segment message**: a wire-message whose `payload` field carries a serialized `SegmentMessageProto`.
- **`segmentSize`**: configured maximum size in bytes of each data segment's `payload` chunk (before protobuf serialization).
- **sender public key**: the origin identifier used for indexing persistence.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plausible deniability flag cc @jazzz

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, this is actually only mentioned at the end, maybe we just remove it because from a reader, it is unclear of how such a key should be used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its fair to point out to implementers that when persisting segments, indexing by both a unique sender identifier and the entire_message_hash would be more performant.

I'm not sure that this specification needs to be identity aware though. Id leave it generic, and remove "Sender public key" as to not confuse readers.

Comment on lines +48 to +49
uint32 index = 2; // zero-based sequence number; valid only if segments_count > 0
uint32 segment_count = 3; // number of data segments (>= 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is segments_count and segment_count. Is that the same?

valid only if segments_count > 0

do you mean segments_count >1 because if we only have one segment, we don't really need segmentation?

Copy link
Author

@plopezlpz plopezlpz Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be segment_count I changed to singular from a comment from @jazzz, I might have missed one with plural

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


- `entire_message_hash`: A 32-byte Keccak256 hash of the original complete payload, used to identify which segments belong together and verify reconstruction integrity.
- `index`: Zero-based sequence number identifying this data segment's position (0, 1, 2, ..., segments_count - 1).
- `segment_count`: Total number of data segments the original message was split into.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `segment_count`: Total number of data segments the original message was split into.
- `segments_count`: Total number of data segments the original message was split into.

are they 2 different things or the same?

- `parity_segment_index`: Zero-based sequence number for parity segments.
- `parity_segments_count`: Total number of parity segments generated.

A message is either a **data segment** (when `segment_count > 0`) or a **parity segment** (when `segment_count == 0`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok, so what we are saying is that we set segments_count: 0 when this is a parity segment. I think it could be written in a clearer manner.

Comment on lines +169 to +170
`entire_message_hash` enables correlation of segments that belong to the same original message but does not reveal content.
Traffic analysis may still identify segmented flows.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add encryption considerations.


### Storage / Persistence

Segments **MAY** be persisted (e.g., SQLite) and indexed by `entire_message_hash` and sender public key.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, maybe just mention that "by sender" and somethin glike "sender may be authenticated, out of scope of spec"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants