add state snapshot uploads alongside log uploads#403
Conversation
…e log uploads Adds state snapshot upload capability to bd-logger: - StateLogCorrelator tracks state changes and coordinates snapshot uploads - State snapshots are uploaded before log uploads to ensure consistency - Integrates with retention registry to manage snapshot lifecycle - Adds skip_intent support and completion tracking to artifact uploader
9d651d9 to
be80130
Compare
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
mattklein123
left a comment
There was a problem hiding this comment.
Nice very cool, though hard to wrap my head around all the pieces. Maybe we should have a single document that explains everything (maybe that is agents.md though there are parts in the state/journal code also).
| if let Some(index_to_drop) = self.index.iter().position(|entry| { | ||
| entry.type_id.as_deref() != Some(ArtifactType::StateSnapshot.to_type_id()) | ||
| }) { |
There was a problem hiding this comment.
But what happens if there is no network? Don't we have to drop eventually? How does this work?
There was a problem hiding this comment.
The way this ends up working with this PR is that the list of pending artifacts will grow and focus on keeping the state snapshots. Once this fills out we back pressure to the state upload and the uploader will stop sending snapshots into bd-artifact-upload. The limit that prevents unbounded growth then becomes the max snapshots that we retain on disk as enforced during the rotation process
| ### Snapshot Cooldown | ||
|
|
||
| Creating a snapshot on every batch flush during high-volume streaming is wasteful. The worker | ||
| tracks `last_snapshot_creation_micros` and will not create a new snapshot if one was created | ||
| within `snapshot_creation_interval_micros` (a runtime-configurable value). During cooldown, the | ||
| worker defers on-demand creation and keeps pending work for retry. |
There was a problem hiding this comment.
I'm confused about this part. If the snapshot/state files are growing at the rate of state changes which is independent and something we can't easily control (can't remember where we landed on safety mechanisms for that part), what is this about? Aren't we basically just uploading the files? Or is this abouti when we do compaction which we would have to do more often during streaming? Could you clarify the different cases so it's a bit easier to understand?
There was a problem hiding this comment.
During normal operation (no logs are uploaded), snapshot creation is fully driven by the rate of snapshots and the file filling up. Once we start streaming logs (default streaming config, session streaming, etc.) we'll need to make sure that we're uploading snapshots that correspond to these logs. So to do this in a sane way I have the uploader check whether or not we have pending changes in the state store and perform a manual rotation in this case to get a new snapshot. This process then has a configurable cooldown that avoids us doing this manual rotation too often
There was a problem hiding this comment.
If we hit the cooldown what is the implication? I guess when we hydrate sessions things won't line up or should it still theoretically line up since we have timestamps?
There was a problem hiding this comment.
When we try to rotate but the cooldown is active we'll not perform the rotation, but keep remembering that we need to upload snapshots for the desired time range. There is then a 30s retry loop that will attempt to progress that which will attempt rotation again. I could probably make this timeout relative to the cooldown for now that's the mechanism that backs off and avoids creating snapshots too often in the streaming case
|
Added more inline comments and added a bd-state/README.md that summarize the entire system |
… state upload retention ownership
mattklein123
left a comment
There was a problem hiding this comment.
Super cool lgtm (with the caveat that I can't really process all the edge cases in my head) just some questions.
| ### Snapshot Cooldown | ||
|
|
||
| Creating a snapshot on every batch flush during high-volume streaming is wasteful. The worker | ||
| tracks `last_snapshot_creation_micros` and will not create a new snapshot if one was created | ||
| within `snapshot_creation_interval_micros` (a runtime-configurable value). During cooldown, the | ||
| worker defers on-demand creation and keeps pending work for retry. |
There was a problem hiding this comment.
If we hit the cooldown what is the implication? I guess when we hydrate sessions things won't line up or should it still theoretically line up since we have timestamps?
| (`Store::rotate_journal` from `bd-state`): | ||
|
|
||
| 1. create a new active generation, | ||
| 2. rewrite compacted live state into the new journal (preserving entry timestamps), |
There was a problem hiding this comment.
This is a stupid question I have asked 5 times I know, but why do we write the compacted live state into the file? Is this simply to make startup faster so that don't have to read the previous rotated files? Naively compaction is only really needed when you roll off all the state files where it was first introduced. I guess what I'm saying is that in the common case of long lived state you get a lot of duplication in every file maybe? Just curious really. Maybe update with more info on this and the tradeoffs.
There was a problem hiding this comment.
Will add some more details to the doc but it's because the snapshots are optional (depends on retention period/runtime config) and compressed on disk, so reading them to recover historical state isn't really feasible. With the compaction the state store doesn't need to care about snapshots and can always consult the main journal file for state recover at process start etc
The duplication would be related to the size of the active state map each device is maintaining and how many state changes are made (how often does rotation happen) but this made it easier to manage current state and makes snapshot cleanup a separate problem
Implements state uploads by introducing a new StateUploadWorker that is notified whenever a log batch is uploaded and attempts to schedule the upload of any state snapshots that are necessary in order to provide the state updates required for hydrating the uploaded logs.
The uploads flow looks like:
as logs (either continuous or trigger) are uploaded -> notify the uploader of an ever widening time range that should be covered by state uploads -> uploader attempts to find all existing snapshots that fit within that time range, potentially triggering a new snapshot if there are pending changes in the state store -> file is moved to the artifact uploader
Fixes BIT-7219