feat: add segmentation spec #91

plopezlpz · 2025-10-19T12:45:57Z

Added the specs for Application level segmentation of messages to be able to send messages larger than 150kb using Waku by partitioning them at source and reconstructing them on the recipient

Already implemented in https://github.com/waku-org/nim-chat-sdk/blob/main/chat_sdk/segmentation.nim

already implemented in https://github.com/waku-org/nim-chat-sdk/blob/main/chat_sdk/segmentation.nim

standards/application/segmentation.md

jm-clius

I have a few comments below, none of them major. Thanks!

standards/application/segmentation.md

jm-clius · 2025-10-21T16:30:47Z

standards/application/segmentation.md

+name: Message Segmentation and Reconstruction
+tags: [waku-application, segmentation]
+version: 0.1
+status: draft


@Cofson @jimstir this is an interesting case. The spec is introduced as draft, likely because a matching reference implementation already exists. However, I think we usually promote draft specs to Vac RFC? Should specs skip the raw state in such instances?

jazzz

Overall, I think the idea has good bones.

message_hash, segment_index, segment_count provide base functionality for segmentation within an encrypted context.

Relaxing some of the implementation constraints would make it easier for other engineers to incorporate this work into their projects.

jazzz · 2025-10-21T22:13:06Z

standards/application/segmentation.md

+### Reed–Solomon
+
+Implementations that apply parity **SHALL** use fixed-size shards of length `segmentSize`.  
+The last data chunk **MUST** be padded to `segmentSize` for encoding.  


What is the motivation for padding to segment size? I can see this being beneficial in some cases, but curious what makes it required?

standards/application/segmentation.md

jazzz · 2025-10-21T22:34:59Z

standards/application/segmentation.md

+Receivers **MUST** enforce:
+
+- `entire_message_hash.length == 32`
+- **Data segments:** `segments_count >= 2` **AND** `index < segments_count`


Is it strictly necessary that segments_count > 1 ?

While it would be a waste of 4 bytes, is there anything stopping implementers from wrapping all payloads in a SegmentMessageProto message even if the message is less than segmentSize

Adding on here; Implementations like ReliableChannels or Conversations may want to include the segmentation frame regardless to maintain a deterministic parsing tree.

If the output from the "Segmentation" step is sometimes a SegmentationMessageProto and sometimes a Application frame, this adds extra uncertainty as to how payloads ought to be decoded.

While deterministic parsing trees can be debated, it seems that forcing segments_count > 1 is an optimization rather than a compatibility issue

standards/application/segmentation.md

fryorcraken

I think we can and should make the spec mostly Waku agnostic. Waku can still be mentioned in the motivation.

standards/application/segmentation.md

fryorcraken · 2025-10-28T04:03:20Z

standards/application/segmentation.md

+
+This specification defines an application-layer protocol for **segmentation** and **reconstruction** of messages carried over a message transport/delivery services with size limitation, when the original payload exceeds said limitation.
+Applications partition the payload into multiple wire-messages envelopes and reconstruct the original on receipt,
+even when segments arrive out of order or up to a **predefined percentage** of segments are lost.


With my proposal https://forum.vac.dev/t/introducing-the-reliable-channel-api/580/14 of apply segmentation before SDS, it means that even if segmentation may have re-constructed the messages (enough chunks received), SDS may still try to retrieve missing chunks.

What do we think of this?

The alternative being to apply SDS first, and then chunk messages. Meaning that when enough chunk arrive across, then the SDS message can be completed and added to SDS log.

However, it would also mean evolving the retrieval hint, so that a list of waku message id can be passed.

@jm-clius @jazzz @kaichaosun @shash256 thoughts?

Mmm, open for debate, but I still think it's most natural for SDS to apply after segmentation - otherwise SDS-R would also be a very large hammer, requesting repairs based on a message id that requires multiple broadcast chunks. It will be possible to work around it, but IMO SDS is best applied if the causality tree matches what is actually broadcast.

In my opinion, segmentation and SDS could be optional add-ons for application, since not every app use large messages, and SDS may even not suitable in some cases.
If both segmentation and SDS are required, we actually can not segment first then apply SDS later, because SDS adds extra loads and may exceed the threshold.

I too am open to discussion, but I'm aligned with @jm-clius on this . Segmentation -> SDS is the most natural flow.

we actually can not segment first then apply SDS later, because SDS adds extra loads and may exceed the threshold.

@kaichaosun is correct that the payload size of subsequent layers is an issue. However its possible to reduce segmentSize to account for the extra overhead.

In the case where SDS -> Segmentation is the preferred approach, there will still be overhead to account for such as application level encryption and protobuf serialization. Some buffer room will be needed regardless.

Ensuring that protocols have a upper bound on bytes added can be help inform implementers of how much extra overhead they can expect.

I make an argument for Segmentation -> SDS -> Encryption here -- Spirited discussion welcome

fryorcraken · 2025-10-28T04:04:09Z

standards/application/segmentation.md

+- **parity segment**: an erasure-coded segment derived from the set of data segments.
+- **segment message**: a wire-message whose `payload` field carries a serialized `SegmentMessageProto`.
+- **`segmentSize`**: configured maximum size in bytes of each data segment's `payload` chunk (before protobuf serialization).
+- **sender public key**: the origin identifier used for indexing persistence.


plausible deniability flag cc @jazzz

Hum, this is actually only mentioned at the end, maybe we just remove it because from a reader, it is unclear of how such a key should be used.

I think its fair to point out to implementers that when persisting segments, indexing by both a unique sender identifier and the entire_message_hash would be more performant.

I'm not sure that this specification needs to be identity aware though. Id leave it generic, and remove "Sender public key" as to not confuse readers.

fryorcraken · 2025-10-28T04:05:38Z

standards/application/segmentation.md

+  uint32 index                  = 2; // zero-based sequence number; valid only if segments_count > 0
+  uint32 segment_count          = 3; // number of data segments (>= 2)


there is segments_count and segment_count. Is that the same?

valid only if segments_count > 0

do you mean segments_count >1 because if we only have one segment, we don't really need segmentation?

should be segment_count I changed to singular from a comment from @jazzz, I might have missed one with plural

ref: #91 (comment)

fryorcraken · 2025-10-28T04:06:05Z

standards/application/segmentation.md

+
+- `entire_message_hash`: A 32-byte Keccak256 hash of the original complete payload, used to identify which segments belong together and verify reconstruction integrity.
+- `index`: Zero-based sequence number identifying this data segment's position (0, 1, 2, ..., segments_count - 1).
+- `segment_count`: Total number of data segments the original message was split into.


Suggested change

- `segment_count`: Total number of data segments the original message was split into.

- `segments_count`: Total number of data segments the original message was split into.

are they 2 different things or the same?

fryorcraken · 2025-10-28T04:07:03Z

standards/application/segmentation.md

+- `parity_segment_index`: Zero-based sequence number for parity segments.
+- `parity_segments_count`: Total number of parity segments generated.
+
+A message is either a **data segment** (when `segment_count > 0`) or a **parity segment** (when `segment_count == 0`).


ah ok, so what we are saying is that we set segments_count: 0 when this is a parity segment. I think it could be written in a clearer manner.

fryorcraken · 2025-10-28T04:11:08Z

standards/application/segmentation.md

+`entire_message_hash` enables correlation of segments that belong to the same original message but does not reveal content.
+Traffic analysis may still identify segmented flows.


Maybe add encryption considerations.

fryorcraken · 2025-10-28T04:12:43Z

standards/application/segmentation.md

+
+### Storage / Persistence
+
+Segments **MAY** be persisted (e.g., SQLite) and indexed by `entire_message_hash` and sender public key.


Here, maybe just mention that "by sender" and somethin glike "sender may be authenticated, out of scope of spec"

feat: add segmentation spec

9c46e11

already implemented in https://github.com/waku-org/nim-chat-sdk/blob/main/chat_sdk/segmentation.nim

plopezlpz requested review from jm-clius and kaichaosun October 19, 2025 12:45

plopezlpz added 2 commits October 19, 2025 15:56

fix: dictionary

7866a68

fix: spell

6384633

kaichaosun reviewed Oct 21, 2025

View reviewed changes

standards/application/segmentation.md Outdated Show resolved Hide resolved

standards/application/segmentation.md Outdated Show resolved Hide resolved

standards/application/segmentation.md Outdated Show resolved Hide resolved

plopezlpz added 2 commits October 21, 2025 17:36

fix: pr feedback

83182e8

fix: spelling

fdb9595

plopezlpz requested a review from kaichaosun October 21, 2025 14:40

plopezlpz mentioned this pull request Oct 21, 2025

[Deliverable] Create Segmentation Library logos-messaging/pm#318

Open

4 tasks

jazzz reviewed Oct 21, 2025

View reviewed changes

standards/application/segmentation.md Outdated Show resolved Hide resolved

fix: pr feedback

fa2993b

jm-clius requested changes Oct 21, 2025

View reviewed changes

jazzz reviewed Oct 21, 2025

View reviewed changes

jazzz reviewed Oct 22, 2025

View reviewed changes

standards/application/segmentation.md Outdated Show resolved Hide resolved

fryorcraken reviewed Oct 22, 2025

View reviewed changes

plopezlpz added 3 commits October 22, 2025 12:03

fix: pr feedback

9b5e7b9

fix: remove limit from spec, leave in the implementation

f1804a8

fix: add wire format before descriptions

831af45

plopezlpz requested a review from jm-clius October 27, 2025 08:11

fix: rename

0ff0c97

fryorcraken approved these changes Oct 28, 2025

View reviewed changes

		uint32 index = 2; // zero-based sequence number; valid only if segments_count > 0
		uint32 segment_count = 3; // number of data segments (>= 2)

	- `segment_count`: Total number of data segments the original message was split into.
	- `segments_count`: Total number of data segments the original message was split into.

		`entire_message_hash` enables correlation of segments that belong to the same original message but does not reveal content.
		Traffic analysis may still identify segmented flows.


		### Storage / Persistence

		Segments MAY be persisted (e.g., SQLite) and indexed by `entire_message_hash` and sender public key.

feat: add segmentation spec #91

Are you sure you want to change the base?

feat: add segmentation spec #91

Uh oh!

Conversation

plopezlpz commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jm-clius left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jazzz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fryorcraken left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaichaosun Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jazzz Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jazzz Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

plopezlpz Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

plopezlpz commented Oct 19, 2025 •

edited

Loading

kaichaosun Oct 28, 2025 •

edited

Loading

jazzz Oct 29, 2025 •

edited

Loading

jazzz Oct 29, 2025 •

edited

Loading

plopezlpz Oct 29, 2025 •

edited

Loading