add state snapshot uploads alongside log uploads by snowp · Pull Request #403 · bitdriftlabs/shared-core

snowp · 2026-02-25T18:13:07Z

Implements state uploads by introducing a new StateUploadWorker that is notified whenever a log batch is uploaded and attempts to schedule the upload of any state snapshots that are necessary in order to provide the state updates required for hydrating the uploaded logs.

The uploads flow looks like:
as logs (either continuous or trigger) are uploaded -> notify the uploader of an ever widening time range that should be covered by state uploads -> uploader attempts to find all existing snapshots that fit within that time range, potentially triggering a new snapshot if there are pending changes in the state store -> file is moved to the artifact uploader

Fixes BIT-7219

…e log uploads Adds state snapshot upload capability to bd-logger: - StateLogCorrelator tracks state changes and coordinates snapshot uploads - State snapshots are uploaded before log uploads to ensure consistency - Integrates with retention registry to manage snapshot lifecycle - Adds skip_intent support and completion tracking to artifact uploader

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

mattklein123

Nice very cool, though hard to wrap my head around all the pieces. Maybe we should have a single document that explains everything (maybe that is agents.md though there are parts in the state/journal code also).

bd-artifact-upload/src/uploader.rs

mattklein123 · 2026-02-27T19:58:17Z

bd-artifact-upload/src/uploader.rs

+      if let Some(index_to_drop) = self.index.iter().position(|entry| {
+        entry.type_id.as_deref() != Some(ArtifactType::StateSnapshot.to_type_id())
+      }) {


But what happens if there is no network? Don't we have to drop eventually? How does this work?

The way this ends up working with this PR is that the list of pending artifacts will grow and focus on keeping the state snapshots. Once this fills out we back pressure to the state upload and the uploader will stop sending snapshots into bd-artifact-upload. The limit that prevents unbounded growth then becomes the max snapshots that we retain on disk as enforced during the rotation process

bd-client-common/src/file.rs

mattklein123 · 2026-02-27T20:12:41Z

bd-logger/AGENTS.md

+### Snapshot Cooldown
+
+Creating a snapshot on every batch flush during high-volume streaming is wasteful. The worker
+tracks `last_snapshot_creation_micros` and will not create a new snapshot if one was created
+within `snapshot_creation_interval_micros` (a runtime-configurable value). During cooldown, the
+worker defers on-demand creation and keeps pending work for retry.


I'm confused about this part. If the snapshot/state files are growing at the rate of state changes which is independent and something we can't easily control (can't remember where we landed on safety mechanisms for that part), what is this about? Aren't we basically just uploading the files? Or is this abouti when we do compaction which we would have to do more often during streaming? Could you clarify the different cases so it's a bit easier to understand?

During normal operation (no logs are uploaded), snapshot creation is fully driven by the rate of snapshots and the file filling up. Once we start streaming logs (default streaming config, session streaming, etc.) we'll need to make sure that we're uploading snapshots that correspond to these logs. So to do this in a sane way I have the uploader check whether or not we have pending changes in the state store and perform a manual rotation in this case to get a new snapshot. This process then has a configurable cooldown that avoids us doing this manual rotation too often

If we hit the cooldown what is the implication? I guess when we hydrate sessions things won't line up or should it still theoretically line up since we have timestamps?

When we try to rotate but the cooldown is active we'll not perform the rotation, but keep remembering that we need to upload snapshots for the desired time range. There is then a 30s retry loop that will attempt to progress that which will attempt rotation again. I could probably make this timeout relative to the cooldown for now that's the mechanism that backs off and avoids creating snapshots too often in the streaming case

snowp · 2026-02-28T16:41:22Z

Added more inline comments and added a bd-state/README.md that summarize the entire system

… state upload retention ownership

mattklein123

Super cool lgtm (with the caveat that I can't really process all the edge cases in my head) just some questions.

mattklein123 · 2026-03-02T23:46:31Z

bd-logger/AGENTS.md

+### Snapshot Cooldown
+
+Creating a snapshot on every batch flush during high-volume streaming is wasteful. The worker
+tracks `last_snapshot_creation_micros` and will not create a new snapshot if one was created
+within `snapshot_creation_interval_micros` (a runtime-configurable value). During cooldown, the
+worker defers on-demand creation and keeps pending work for retry.


If we hit the cooldown what is the implication? I guess when we hydrate sessions things won't line up or should it still theoretically line up since we have timestamps?

mattklein123 · 2026-03-02T23:55:12Z

bd-state/README.md

+(`Store::rotate_journal` from `bd-state`):
+
+1. create a new active generation,
+2. rewrite compacted live state into the new journal (preserving entry timestamps),


This is a stupid question I have asked 5 times I know, but why do we write the compacted live state into the file? Is this simply to make startup faster so that don't have to read the previous rotated files? Naively compaction is only really needed when you roll off all the state files where it was first introduced. I guess what I'm saying is that in the common case of long lived state you get a lot of duplication in every file maybe? Just curious really. Maybe update with more info on this and the tradeoffs.

Will add some more details to the doc but it's because the snapshots are optional (depends on retention period/runtime config) and compressed on disk, so reading them to recover historical state isn't really feasible. With the compaction the state store doesn't need to care about snapshots and can always consult the main journal file for state recover at process start etc

The duplication would be related to the size of the active state map each device is maintaining and how many state changes are made (how often does rotation happen) but this made it easier to manage current state and makes snapshot cleanup a separate problem

snowp added 9 commits February 25, 2026 10:28

cleanup

fe31d0d

remove fallback reporting

00d667e

remove agents

32fbcbb

remove refernces to correlator, better range computation

daa9a71

use helper

a7ad81b

use time helper in more places

11e2113

document in agents

6457e37

fix oneshot consideration

be80130

snowp force-pushed the state-uploads2-rebased branch from 9d651d9 to be80130 Compare February 25, 2026 19:20

snowp and others added 19 commits February 25, 2026 22:20

improve durability

9e1e3c3

better durability guarantees

0ee2d8e

improve test coverage

38340f3

use enum for error propagation

7e4606f

avoid losing ranges if handle ch is full

66b7107

update agents

f3df1da

fix tests

bcbaa44

todo

46c7a1d

persistence etc

0ee1ba4

update continuous buffer snapshot retentin

59e7346

Merge origin/main

298b4ac

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

dedupe enqueue interface

0182771

cleanup

ead0152

cleanup

1a92229

comments

fb979b0

clean up tests some more

f7aa7da

increase default snapshot delay

aafc827

cleanup

c54e0c2

fix agents

7f7ab16

snowp assigned mattklein123 Feb 27, 2026

mattklein123 requested changes Feb 27, 2026

View reviewed changes

snowp added 3 commits February 27, 2026 13:10

handle raw vs checksummed artifact files

b43d7a5

avoid zlib parsing, add more docs

6d6b7b9

clarify more

34462fe

snowp added 2 commits February 28, 2026 08:47

fmt nightly

83b30ab

fix rotation return bug, rework deduping repeat rotations and clarify…

8ee0aae

… state upload retention ownership

mattklein123 approved these changes Mar 2, 2026

View reviewed changes

snowp merged commit ed1c444 into main Mar 3, 2026
6 checks passed

snowp deleted the state-uploads2-rebased branch March 3, 2026 04:26

github-actions bot locked and limited conversation to collaborators Mar 3, 2026

Conversation

snowp commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

snowp commented Feb 28, 2026

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattklein123 Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

snowp commented Feb 25, 2026 •

edited

Loading

mattklein123 Mar 2, 2026 •

edited

Loading