Skip to content

add state snapshot uploads alongside log uploads#403

Merged
snowp merged 33 commits intomainfrom
state-uploads2-rebased
Mar 3, 2026
Merged

add state snapshot uploads alongside log uploads#403
snowp merged 33 commits intomainfrom
state-uploads2-rebased

Conversation

@snowp
Copy link
Contributor

@snowp snowp commented Feb 25, 2026

Implements state uploads by introducing a new StateUploadWorker that is notified whenever a log batch is uploaded and attempts to schedule the upload of any state snapshots that are necessary in order to provide the state updates required for hydrating the uploaded logs.

The uploads flow looks like:
as logs (either continuous or trigger) are uploaded -> notify the uploader of an ever widening time range that should be covered by state uploads -> uploader attempts to find all existing snapshots that fit within that time range, potentially triggering a new snapshot if there are pending changes in the state store -> file is moved to the artifact uploader

Fixes BIT-7219

…e log uploads

Adds state snapshot upload capability to bd-logger:
- StateLogCorrelator tracks state changes and coordinates snapshot uploads
- State snapshots are uploaded before log uploads to ensure consistency
- Integrates with retention registry to manage snapshot lifecycle
- Adds skip_intent support and completion tracking to artifact uploader
@snowp snowp force-pushed the state-uploads2-rebased branch from 9d651d9 to be80130 Compare February 25, 2026 19:20
Copy link
Contributor

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice very cool, though hard to wrap my head around all the pieces. Maybe we should have a single document that explains everything (maybe that is agents.md though there are parts in the state/journal code also).

Comment on lines +660 to +662
if let Some(index_to_drop) = self.index.iter().position(|entry| {
entry.type_id.as_deref() != Some(ArtifactType::StateSnapshot.to_type_id())
}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what happens if there is no network? Don't we have to drop eventually? How does this work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this ends up working with this PR is that the list of pending artifacts will grow and focus on keeping the state snapshots. Once this fills out we back pressure to the state upload and the uploader will stop sending snapshots into bd-artifact-upload. The limit that prevents unbounded growth then becomes the max snapshots that we retain on disk as enforced during the rotation process

Comment on lines +45 to +50
### Snapshot Cooldown

Creating a snapshot on every batch flush during high-volume streaming is wasteful. The worker
tracks `last_snapshot_creation_micros` and will not create a new snapshot if one was created
within `snapshot_creation_interval_micros` (a runtime-configurable value). During cooldown, the
worker defers on-demand creation and keeps pending work for retry.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about this part. If the snapshot/state files are growing at the rate of state changes which is independent and something we can't easily control (can't remember where we landed on safety mechanisms for that part), what is this about? Aren't we basically just uploading the files? Or is this abouti when we do compaction which we would have to do more often during streaming? Could you clarify the different cases so it's a bit easier to understand?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During normal operation (no logs are uploaded), snapshot creation is fully driven by the rate of snapshots and the file filling up. Once we start streaming logs (default streaming config, session streaming, etc.) we'll need to make sure that we're uploading snapshots that correspond to these logs. So to do this in a sane way I have the uploader check whether or not we have pending changes in the state store and perform a manual rotation in this case to get a new snapshot. This process then has a configurable cooldown that avoids us doing this manual rotation too often

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we hit the cooldown what is the implication? I guess when we hydrate sessions things won't line up or should it still theoretically line up since we have timestamps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we try to rotate but the cooldown is active we'll not perform the rotation, but keep remembering that we need to upload snapshots for the desired time range. There is then a 30s retry loop that will attempt to progress that which will attempt rotation again. I could probably make this timeout relative to the cooldown for now that's the mechanism that backs off and avoids creating snapshots too often in the streaming case

@snowp
Copy link
Contributor Author

snowp commented Feb 28, 2026

Added more inline comments and added a bd-state/README.md that summarize the entire system

Copy link
Contributor

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super cool lgtm (with the caveat that I can't really process all the edge cases in my head) just some questions.

Comment on lines +45 to +50
### Snapshot Cooldown

Creating a snapshot on every batch flush during high-volume streaming is wasteful. The worker
tracks `last_snapshot_creation_micros` and will not create a new snapshot if one was created
within `snapshot_creation_interval_micros` (a runtime-configurable value). During cooldown, the
worker defers on-demand creation and keeps pending work for retry.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we hit the cooldown what is the implication? I guess when we hydrate sessions things won't line up or should it still theoretically line up since we have timestamps?

(`Store::rotate_journal` from `bd-state`):

1. create a new active generation,
2. rewrite compacted live state into the new journal (preserving entry timestamps),
Copy link
Contributor

@mattklein123 mattklein123 Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a stupid question I have asked 5 times I know, but why do we write the compacted live state into the file? Is this simply to make startup faster so that don't have to read the previous rotated files? Naively compaction is only really needed when you roll off all the state files where it was first introduced. I guess what I'm saying is that in the common case of long lived state you get a lot of duplication in every file maybe? Just curious really. Maybe update with more info on this and the tradeoffs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add some more details to the doc but it's because the snapshots are optional (depends on retention period/runtime config) and compressed on disk, so reading them to recover historical state isn't really feasible. With the compaction the state store doesn't need to care about snapshots and can always consult the main journal file for state recover at process start etc

The duplication would be related to the size of the active state map each device is maintaining and how many state changes are made (how often does rotation happen) but this made it easier to manage current state and makes snapshot cleanup a separate problem

@snowp snowp merged commit ed1c444 into main Mar 3, 2026
6 checks passed
@snowp snowp deleted the state-uploads2-rebased branch March 3, 2026 04:26
@github-actions github-actions bot locked and limited conversation to collaborators Mar 3, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants