Skip to content

Commit 539fe9b

Browse files
committed
Move "Durable Storage Architecture" to design doc
1 parent 001980c commit 539fe9b

File tree

4 files changed

+18
-59
lines changed

4 files changed

+18
-59
lines changed

en_US/design/durable-storage.md

Lines changed: 16 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,10 @@ EMQX provides two embedded backends that do not rely on third-party services:
1515

1616
### Data Storage Hierarchy
1717

18+
The database storage engine powering EMQX's built-in durability facilities organizes data into a hierarchical structure. The following figure illustrates how the durable storage databases are distributed across an EMQX cluster:
19+
20+
![emqx_ds_sharding](./assets/emqx_ds_sharding.png)
21+
1822
Internally, DS organizes data into a multi-layered hierarchy designed for both horizontal scalability and temporal partitioning. The structure is transparent to applications and ensures efficient data management across distributed EMQX nodes.
1923

2024
The complete DS hierarchy can be represented as follows:
@@ -26,7 +30,7 @@ flowchart TB
2630
SH["Shard"]
2731
GEN(["Generation (logical, time-based)"])
2832
SL["Slab (physical container)"]
29-
ST["Stream"]
33+
ST["Durable Storage Stream"]
3034
TTV["Topic–Timestamp–Value (TTV)"]
3135
3236
%% Main hierarchy
@@ -46,7 +50,7 @@ flowchart TB
4650

4751
#### Database (DB)
4852

49-
The top-level logical container for data. Each DS database is independent and manages its own shards, slabs, and streams, and it can be created, managed, and dropped as needed. For instance:
53+
A database is the top-level logical container for data. Each DS database is independent and manages its own shards, slabs, and streams, and it can be created, managed, and dropped as needed. For instance:
5054

5155
- **Sessions DB** stores durable session states.
5256
- **Messages DB** holds the corresponding MQTT message data.
@@ -55,30 +59,32 @@ A single EMQX cluster can host multiple DS databases.
5559

5660
#### Shard
5761

58-
A horizontal partition of the database. Data from different MQTT clients is distributed across shards to support parallelism and high availability.
62+
A shard is the horizontal partition of a durable storage database. Data is distributed across shards based on the publisher's client ID, enabling parallel processing and high availability. Each EMQX node can host one or more shards, and the total number of shards is determined by [n_shards](./managing-replication.md#number-of-shards) configuration parameter during the initial startup of EMQX.
5963

60-
Each EMQX node can host one or more shards, and shards can be replicated across nodes for redundancy.
64+
Shards also serve as the fundamental unit of replication. Each shard is replicated across multiple nodes according to the `durable_storage.messages.replication_factor` setting, ensuring that all replicas maintain identical message sets for redundancy and fault tolerance.
6165

6266
#### Generation
6367

64-
A logical partition of the database by time. Data written at different times may be assigned to different generations. Periodically creating new generations serves several main purposes:
68+
A generation is a logical, time-based partition of the database. Data written during different time periods is grouped into separate generations. New messages are always written to the current generation, while older generations become immutable and read-only. EMQX periodically creates new generations for several main purposes:
6569

6670
1. **Backward compatibility and data migrations:** New data is appended to new generations, possibly with improved encoding, while old generations remain immutable and read-only.
67-
2. **Time-based data retention:** Since each generation covers a specific time period, old data can be removed by dropping entire generations.
71+
2. **Time-based data retention:** Because each generation corresponds to a specific time range, expired data can be efficiently removed by dropping entire generations.
72+
73+
Although conceptually related to slabs, generations are not physical storage units. Instead, they define temporal boundaries that organize slabs within each shard.
6874

69-
Although conceptually related to slabs, generations are not physical containers. They serve as temporal boundaries that organize slabs within each shard.
75+
Generations may also differ in how they internally structure and store data, depending on the configured storage layout. Currently, DS supports a single layout optimized for high-throughput wildcard and single-topic subscriptions. Future releases will introduce additional layouts designed for different workloads. The layout used for new generations is configured via the `durable_storage.messages.layout` parameter, with each layout engine providing its own set of configuration options.
7076

7177
#### Slab
7278

73-
A physical partition of data identified by both shard ID and generation ID. Each slab acts as a durable container for one or more streams. All data in a slab shares the same encoding schema, eliminating the need for storing extra metadata. Atomicity and consistency properties are guaranteed within a slab.
79+
A slab is a physical partition of data identified by both shard ID and generation ID. Each slab acts as a durable container for one or more streams. All data in a slab shares the same encoding schema, eliminating the need for storing extra metadata. Atomicity and consistency properties are guaranteed within a slab.
7480

7581
Example: `shard 2, gen 3` represents a distinct slab that stores all streams written during that generation’s time range.
7682

7783
#### Stream
7884

79-
Logical units of batching and serialization inside each slab. Streams group **Topic–Timestamp–Value (TTV)** triples with similar topics, allowing data to be read in time-ordered, deterministic chunks.
85+
A durable storage stream is a logical unit of batching and serialization inside each slab. Streams group **Topic–Timestamp–Value (TTV)** triples with similar topic structures, allowing data to be read in time-ordered, deterministic chunks. A single durable storage stream may contain messages from multiple topics, and different storage layouts may apply different strategies for mapping topics into streams.
8086

81-
Streams are also the unit of subscription and iteration in DS, enabling efficient handling of wildcard topic filters.
87+
Durable storage streams are also the fundamental unit of subscription and iteration in Durable Storage, enabling efficient handling of wildcard topic filters and consistent replay of ordered data. [Durable sessions](../durability/durability_introduction.md#durable-sessions) read messages from streams in batches, with the batch size controlled by the `durable_sessions.batch_size` configuration parameter.
8288

8389
#### Topic–Timestamp–Value
8490

en_US/durability/durability_introduction.md

Lines changed: 2 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -159,56 +159,9 @@ Even if durable sessions are not enabled, following steps 2-4 will still retain
159159

160160
## Durable Storage Architecture
161161

162-
The database engine powering EMQX's built-in durability facilities organizes data into a hierarchical structure. The following figure illustrates how the durable storage databases are distributed across an EMQX cluster:
162+
Durable Sessions rely on the Durable Storage for persisting session state and messages. To understand how this storage layer is structured and operates, refer to the *Architecture: Backends and Storage Hierarchy* section in [Design for Durable Storage](../design/durable-storage.md).
163163

164-
![Diagram of EMQX durable storage sharding](./assets/emqx_ds_sharding.png)
165-
166-
### Database (DS)
167-
168-
The top-level logical container for data. Each DS database is independent and manages its own shards, slabs, and streams, and it can be created, managed, and dropped as needed. For instance:
169-
170-
- **Sessions DB** stores durable session states.
171-
- **Messages DB** holds the corresponding MQTT message data.
172-
173-
A single EMQX cluster can host multiple DS databases.
174-
175-
### Shard
176-
177-
Messages are segregated by clients and stored in shards based on the publisher's client ID. The number of shards is determined by [n_shards](./managing-replication.md#number-of-shards) configuration parameter during the initial startup of EMQX. A shard is also a unit of replication. Each shard is consistently replicated the number of times specified by `durable_storage.messages.replication_factor` across different nodes, ensuring identical message sets in each replica.
178-
179-
### Generation
180-
181-
A generation is a logical partition of the database by time. Messages are segmented into generations corresponding to specific time frames. New messages are written to the current generation, while previous generations are read-only. EMQX cleans up old MQTT messages by deleting old generations in their entirety. The retention period for old MQTT messages is determined by the `durable_sessions.message_retention_period` parameter.
182-
183-
Generations can organize data differently according to the storage layout specification. Currently, only one layout is supported, optimized for high throughput of wildcard and single-topic subscriptions. Future updates will introduce layouts optimized for different workloads.
184-
185-
The storage layout for new generations is configured by the `durable_storage.messages.layout` parameter, with each layout engine defining its own configuration parameters.
186-
187-
### Slab
188-
189-
A slab is a physical partition of data identified by both shard ID and generation ID. Each slab acts as a durable container for one or more streams. All data in a slab share the same encoding schema, and writes are atomic, eliminating the need for extra metadata.
190-
191-
Example: `shard 2, gen 3` represents a distinct slab that stores all streams written during that generation’s time range.
192-
193-
### Stream
194-
195-
A stream is a logical unit of batching and serialization inside each slab. Streams group **Topic–Timestamp–Value (TTV)** triples with similar topics, allowing data to be read in time-ordered, deterministic chunks.
196-
197-
Streams can contain messages from multiple topics. Various storage layouts can employ different strategies for mapping topics into streams.
198-
199-
Durable sessions fetch messages in batches from the streams, with batch size adjustable via the `durable_sessions.batch_size` parameter.
200-
201-
### Topic–Timestamp–Value
202-
203-
A TTV is a minimal storage unit, representing a single MQTT record. Each TTV includes:
204-
205-
- **Topic:** Follows MQTT semantics.
206-
207-
- **Timestamp:** Write time or logical ordering key.
208-
209-
- **Value:** an arbitrary binary blob.
210-
211-
## Durable Storages Across Cluster
164+
## Durable Storage Across Cluster
212165

213166
Each node within an EMQX cluster is assigned a unique *Site ID*, which serves as a stable identifier, independent of the Erlang node name (`emqx@...`). Site IDs are persistent, and they are randomly generated at the first startup of the node. This stability maintains the integrity of the data, especially in scenarios where nodes might undergo name modifications or reconfigurations.
214167

44.6 KB
Loading

0 commit comments

Comments
 (0)