You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Switching the embedding model in OpenViking requires manual, error-prone steps with no validation and no batch tooling. The process overwrites vectors in-place, causing search quality degradation visible to end-users during the migration window. A blue-green migration strategy would eliminate user-visible impact -- search always hits a complete, consistent vector set.
Current Workflow
# 1. Admin manually edits ov.conf -- no validation at all
vim ~/.openviking/ov.conf
# modify: embedding.dense.model, dimension, api_base...# 2. Restart the server -- downtime, no pre-check
docker restart openviking
# or:
openviking-server # starts without any verification# 3. Reindex resources one by one -- no batch operation
ov reindex viking://resources/doc-a --regenerate --wait
ov reindex viking://resources/doc-b --regenerate --wait
# ... repeat for every resource# 4. User-visible impact:# - Search quality degrades silently (old vectors + new model = mismatch)# - No notification that embedding has changed# - No way to confirm reindex completion
Pain Points
Admin
Problem
Impact
Evidence
No endpoint-level dimension validation
Config layer has a VectorDB vs Embedding dimension consistency WARNING (openviking_cli/utils/config/open_viking_config.py) and auto-syncs -- but it never verifies whether the configured dimension matches the embedding endpoint's actual output
openviking-server doctor only checks if API key is set, does not test endpoint connectivity
No bulk reindex
Must run ov reindex <uri> individually for each resource
Current reindex endpoint accepts only a single URI
No model compatibility check
Config-time model/provider compatibility is not validated. Errors surface only at query time. Example: provider=openai + a downstream model that doesn't support matryoshka representations (e.g. certain OpenAI-compatible Qwen endpoints) -- v0.3.5 passed through the dimensions parameter causing a 400 error (#1442, fixed in v0.3.6). Or provider=litellm + bare model name (Qwen3-Embedding-0.6B instead of dashscope/Qwen3-Embedding-0.6B) -- query fails with LLM Provider NOT provided
#1442 -- users only discover config errors at search time
No progress visibility
ov reindex --wait blocks with no progress indication
No progress events
No rollback path
After reindexing with a new model, old vectors are gone
build_index() overwrites in-place
User
Problem
Impact
Search quality degradation during migration
reindex overwrites vectors in-place. During the migration window, query vectors use the new model while some data still has old model vectors -- results are unpredictable
No atomicity
No all-or-nothing switchover. Users may hit a half-old-half-new vector set
Users should always hit a complete, consistent vector set during migration. Blue-green migration (build new vector set in background, then atomic active-pointer switchover) is the straightforward solution.
Proposed Solution
Dependency on #1439: This proposal builds on #1439 (feat: detect embedding model drift and add rebuild tool), which provides:
This proposal adds blue-green migration, dual-write, and atomic switchover on top of #1439. If #1439 is not yet merged, Section 4 (health gate) and Section 6 (migration resilience) will need the embedding_meta.json persistence part implemented first.
1. Extend ov reindex with batch support
ov reindex <URI> currently accepts only a single URI. When switching embedding models, admins need to reindex every resource -- doing this one at a time is impractical.
# Current:
ov reindex viking://resources/doc-a --regenerate --wait
# Proposed:# Reindex all resources
ov reindex --all --regenerate --wait=false
# Reindex with glob pattern
ov reindex viking://resources/my-project/** --regenerate
# Dry-run: show what would be reindexed (count + estimated time)
ov reindex --all --dry-run
# Output:# Resources to reindex: 47# Estimated time: ~12 min# Current embedding model: text-embedding-v4 (1024d)# WARNING: Dimension mismatch detected -- vectordb rebuild may be required
2. Extend ov config validate with live endpoint check
openviking-server doctor (openviking_cli/doctor.py) already checks config syntax, Python version, native engine, AGFS, embedding API key existence, VLM config, and disk space. It does not test endpoint connectivity or verify actual embedding dimensions. This proposal adds those checks either to doctor or to a new ov config validate --live command. The two don't conflict -- doctor covers operational health, --live covers pre-change validation.
ov config validate currently only checks config syntax (JSON schema via serde). Extend it to verify the endpoint is reachable and the output dimension matches config.
# Current:
ov config validate
# -> only checks JSON schema# Proposed:
ov config validate --live
# Checks:# PASS: Config syntax valid# PASS: Embedding endpoint reachable# PASS: Embedding dimension matches config (1024d)# WARNING: Embedding model differs from stored model (text-embedding-v3 -> text-embedding-v4)# Note: Run `ov reindex --all` to rebuild vectors
2.1 Config structure for blue-green
During migration, the system needs both the current (active) model config and the target model config. We propose a named embedding.migration map in ov.conf, where each entry is a named migration target. The existing embedding config is implicitly named default.
embedding.migration is a map of named configs, not a single block. This design:
Supports read-only ov.conf -- migration targets are pre-defined, no runtime writes needed (works with container read-only mounts)
Allows multiple migration targets -- admins can pre-configure several models and pick one via CLI
Makes default implicit -- the existing top-level embedding config (dense/sparse/hybrid) is the active profile
Background: EmbeddingConfig supports three embedding types:
dense -- dense vectors (most common)
sparse -- sparse vectors (BM25-style)
hybrid -- single model returning both dense + sparse
get_embedder() logic: if hybrid exists, use hybrid embedder; if both dense and sparse exist, use CompositeHybridEmbedder; if only dense, use dense embedder.
Migration configs need to cover all three cases. Each migration entry mirrors the model config fields in embedding:
Note: Each migration entry mirrors only dense/sparse/hybrid model config fields. Top-level runtime settings (max_concurrent, circuit_breaker, max_retries, etc.) are global and don't change during migration.
CLI references migration targets by name:
# List available migration targets
ov reindex --list-targets
# Output:# Available migration targets:# - openai-v3-large (openai/text-embedding-3-large, 3072d)# - qwen-3-large (dashscope/qwen3-embedding, 1024d)# Start migration with a pre-configured target
ov reindex --all --target openai-v3-large
Lifecycle:
Phase
Active profile
Migration state
Behavior
Normal
default
None
Single-model operation
Migration start
default
Target name selected via CLI
Dual-write + bulk re-embed to target
Migration complete
Auto-switched to target
Target entry removed from config
Single-model, new config becomes default
Rollback
Reverted to default
Target re-added
Dual-write back to old
3. Blue-green vector migration
Instead of overwriting vectors in-place during reindex, maintain two vector sets ("blue" = current active, "green" = new model being built). Users always query the active set. Once the green set is fully built and verified, atomically promote it to active.
Changing model and changing dimension are the same operation
With in-place overwrite, changing the embedding dimension (e.g. 1024d to 3072d) requires dropping and recreating the entire vectordb -- destructive and irreversible. With blue-green, both "new model" and "new dimension" are handled the same way: write to the inactive collection, then flip the pointer. No schema migration, no data loss, no downtime.
Migration timeline
flowchart TD
A["Admin starts reindex with new model"] --> B["Phase 1: Enable dual-write<br/>New writes -> default + openai-v3-large<br/>Queries -> default"]
B --> C["Phase 2: Bulk re-embed existing resources<br/>default (active): text-embedding-v3, 47 resources<br/>openai-v3-large (building): 12/47...<br/>Queries -> default<br/>Dual-write active"]
C -->|"openai-v3-large complete"| D["Phase 3: Query switchover<br/>Active pointer: default -> openai-v3-large<br/>Queries -> openai-v3-large<br/>Dual-write still active"]
D --> E["Phase 4: Disable dual-write<br/>Writes -> openai-v3-large only<br/>default retained for rollback until TTL expires"]
E -->|"TTL expires or admin confirms"| F["Phase 5: Delete old set<br/>Delete default collection<br/>openai-v3-large becomes the new default"]
Loading
API surface
classVectorDB:
defget_active_collection(self) ->str:
"""Returns collection name for reads -- e.g. 'default' or 'openai-v3-large'."""returnself.metadata.get("active_embedding_set", "default")
defis_dual_write_enabled(self) ->bool:
returnself.metadata.get("dual_write", False)
defswitch_active(self, target: str) ->None:
"""Atomic metadata write -- instant switchover for all reads."""self.metadata.set("active_embedding_set", target)
defset_dual_write(self, enabled: bool) ->None:
self.metadata.set("dual_write", enabled)
defupsert(self, resource_uri: str, vector) ->None:
"""In dual-write mode, writes to both active and inactive collections."""active=self.get_active_collection()
self._write_to_collection(active, resource_uri, vector)
ifself.is_dual_write_enabled():
inactive=self._get_inactive_collection()
self._write_to_collection(inactive, resource_uri, vector)
The migration controller orchestrates phases through these primitives:
set_dual_write(True) -- enable dual-write
Loop: embed_with_new_model() then _write_to_collection(green, ...) -- bulk re-embed to green only
switch_active(green) -- atomic query switchover
set_dual_write(False) -- disable dual-write
delete_collection(blue) -- cleanup old set
Rollback
# If admin detects quality regression after switchover:
ov reindex --rollback
# Instantly switches back to previous set (still on disk)# Rollback behavior varies by phase:# Phase 1-2 (dual-write/building): disable dual-write, discard green set# Phase 3-4 (switched/dual-write-off): flip pointer back to blue, re-enable dual-write briefly# Phase 5 (cleanup): too late, blue already deleted -- full reindex required
4. Runtime embedding model health gate
On server startup, validate embedding config against existing vectordb state:
# NOTE: Current vectordb metadata does not store embedding_model name (only dimension).# This pseudocode assumes #1066/#1439's embedding identity metadata is implemented.current_model=config.embedding.dense.modelstored_model=vectordb.get_metadata("embedding_model") # requires #1066/#1439stored_dim=vectordb.get_metadata("embedding_dimension")
ifcurrent_model!=stored_modelorconfig.embedding.dense.dimension!=stored_dim:
log.warning(
f"Embedding model changed: {stored_model} ({stored_dim}d) -> {current_model} ({config.embedding.dense.dimension}d). "f"Run 'ov reindex --all' to rebuild vectors with blue-green migration."
)
5. Config validation on load
When ov.conf is loaded, immediately test the embedding endpoint with a trivial input (not on first search):
# Validate on config load, not on first searchtry:
result=awaitembedder.embed("health check", is_query=True)
assertlen(result.dense_vector) ==config.embedding.dense.dimensionexceptExceptionase:
raiseConfigError(f"Embedding endpoint validation failed: {e}")
6. Migration resilience
Migration can take hours. If the server restarts or the CLI disconnects, all progress is lost without persistent state.
Building on #1439: #1439 provides embedding identity persistence (embedding_meta.json) and VectorRebuildService (per-account delete + reindex). This section extends that foundation:
Migration state -- feat: detect embedding model drift and add rebuild tool #1439's embedding_meta.json records the active embedding identity; migration_state.json (this proposal) records blue-green switchover progress and phase. They complement each other, no overlap.
Bulk re-embed -- feat: detect embedding model drift and add rebuild tool #1439's VectorRebuildService.rebuild_account() does in-place overwrite (delete + rebuild from AGFS). Blue-green replaces that with "write to green set + atomic switchover", but reuses discover_accounts() and _discover_directories() traversal logic.
Startup check -- feat: detect embedding model drift and add rebuild tool #1439's ensure_embedding_collection_compatibility() does fail-fast model drift detection. Blue-green adds interrupted-migration detection (check migration_state.json for incomplete migrations).
Why not TaskTracker: The existing TaskTracker (openviking/service/task_tracker.py) is a pure in-memory registry for short-lived background operations (e.g. session commit). Its design explicitly states "v1 is pure in-memory (no persistence). Tasks are lost on restart." Migration runs for hours and must survive restarts -- TaskTracker is fundamentally unsuitable. Migration state is persisted to disk independently.
Design principle: The five-phase flow (dual-write, then bulk re-embed, then query switchover, then disable dual-write, then cleanup) ensures zero data loss. Dual-write is enabled before bulk re-embed starts, so all writes during the migration window enter both collections.
6.1 Migration state persistence
Migration state is persisted separately from #1439's embedding_meta.json -- independent but complementary:
On server startup, check for incomplete migration and offer resume:
# Server startup detection
WARNING: Incomplete migration detected (mig_20260417_001):
Phase: building (12/47 resources processed)
Blue: default, Green: openai-v3-large
Dual-write: enabled
Options:
1. ov reindex --resume # Continue bulk re-embed from checkpoint
2. ov reindex --abort # Disable dual-write, discard green set
3. ov reindex --all # Restart from scratch# If crashed after query switchover:
WARNING: Migration interrupted after query switchover:
Active: openai-v3-large, Dual-write: was enabled
Blue set (default) still on disk (for rollback)
Options:
1. ov reindex --resume # Disable dual-write, continue to cleanup
2. ov reindex --rollback # Switch queries back to default
3. ov reindex --abort # Abort migration entirely
6.2 Interrupted migration recovery
When ov reindex --all is interrupted (Ctrl+C, network disconnect, OOM):
Scenario
Behavior
Graceful stop (SIGINT)
Finish current resource, save state, exit cleanly
Hard crash (OOM, kill -9)
On next startup, detect incomplete state from disk
Network disconnect
Embedder retry logic handles transient failures; persistent failure marks resource as failed and continues
Crash during dual-write
On restart, re-enable dual-write from persisted state, resume bulk re-embed
6.3 Partial failure handling
Individual resource embed failures should not abort the entire migration:
# After migration completes with some failures:
PASS: 45/47 resources migrated successfully
FAIL: 2 resources failed:
- viking://resources/doc-x (embedder timeout)
- viking://resources/doc-y (invalid content)
Run `ov reindex --retry-failed` to retry failed resources.
6.4 Disk space pre-check
--dry-run verifies sufficient disk space for the green collection:
ov reindex --all --dry-run
# Output:# Resources to reindex: 47# Estimated green collection size: ~2.3 GB# Available disk space: 15.7 GB# Estimated time: ~12 min
6.4.1 Migration state writability check
--dry-run also verifies that migration_state.json can be written and persisted:
ov reindex --all --dry-run
# Output:# Resources to reindex: 47# Estimated green collection size: ~2.3 GB# Available disk space: 15.7 GB# Estimated time: ~12 min## Migration state:# PASS: State path writable: <workspace>/.meta/migration_state.json# PASS: Test write successful -- state will survive restarts# WARNING: container with read-only mount
ov reindex --all --dry-run
# Output:# ...# Migration state:# FAIL: State path NOT writable: <workspace>/.meta/migration_state.json# Reason: Read-only filesystem## WARNING: If workspace is on a read-only mount,# migration state will be lost on container restart.# A full reindex will be required if migration is interrupted.## To fix: ensure workspace directory is on a persistent writable volume
6.5 Rollback TTL
Blue collection retention after switchover is configurable:
// ov.conf
{
"embedding": {
"migration": {
"rollback_ttl_hours": 72, // Delete blue set after 72h if not rolled back"auto_confirm": false// If true, auto-cleanup after TTL without admin confirmation
}
}
}
Alternatives Considered
Status quo (manual, in-place overwrite) -- works but tedious, and users see search quality degradation during the migration window. Unacceptable. Additionally, changing model or dimension requires manually dropping and recreating the entire vectordb -- a destructive operation.
Automatic reindex on config change -- too risky. A typo could trigger an expensive reindex. Better to keep it explicit via ov reindex --all.
Block search during migration -- too aggressive. Blue-green achieves zero downtime.
Expose migration state in API responses -- leaks internal operational details to users. Blue-green keeps migration invisible to users, which is the better approach.
Expected Outcome
Who
Before
After
Admin
Manual config edit, restart without validation, individual reindex, no progress, no rollback
ov config validate --live, then ov reindex --all --dry-run, then ov reindex --all with progress and --rollback
Admin
Changing model/dimension requires dropping and recreating the entire vectordb
Blue-green: changing model and changing dimension are the same operation
Admin
Migration progress lost on interruption, no state after container restart
Migration state persisted to <workspace>/.meta/, --resume continues from checkpoint
User
Search quality degrades during migration (half-old-half-new vectors)
Zero user-visible impact -- search always hits a complete, consistent vector set
Related Issues
feat: detect embedding model drift and add rebuild tool #1439 -- [Open PR] feat: detect embedding model drift and add rebuild tool. This proposal's prerequisite: provides embedding_meta.json persistence, compatibility_identity(), VectorRebuildService, openviking-rebuild-vectors CLI.
Problem
Switching the embedding model in OpenViking requires manual, error-prone steps with no validation and no batch tooling. The process overwrites vectors in-place, causing search quality degradation visible to end-users during the migration window. A blue-green migration strategy would eliminate user-visible impact -- search always hits a complete, consistent vector set.
Current Workflow
Pain Points
Admin
openviking_cli/utils/config/open_viking_config.py) and auto-syncs -- but it never verifies whether the configured dimension matches the embedding endpoint's actual outputopenviking-server doctoronly checks if API key is set, does not test endpoint connectivityov reindex <uri>individually for each resourceprovider=openai+ a downstream model that doesn't support matryoshka representations (e.g. certain OpenAI-compatible Qwen endpoints) --v0.3.5passed through thedimensionsparameter causing a 400 error (#1442, fixed in v0.3.6). Orprovider=litellm+ bare model name (Qwen3-Embedding-0.6Binstead ofdashscope/Qwen3-Embedding-0.6B) -- query fails withLLM Provider NOT providedov reindex --waitblocks with no progress indicationbuild_index()overwrites in-placeUser
reindexoverwrites vectors in-place. During the migration window, query vectors use the new model while some data still has old model vectors -- results are unpredictableProposed Solution
1. Extend
ov reindexwith batch supportov reindex <URI>currently accepts only a single URI. When switching embedding models, admins need to reindex every resource -- doing this one at a time is impractical.2. Extend
ov config validatewith live endpoint checkov config validatecurrently only checks config syntax (JSON schema via serde). Extend it to verify the endpoint is reachable and the output dimension matches config.2.1 Config structure for blue-green
During migration, the system needs both the current (active) model config and the target model config. We propose a named
embedding.migrationmap inov.conf, where each entry is a named migration target. The existingembeddingconfig is implicitly nameddefault.embedding.migrationis a map of named configs, not a single block. This design:ov.conf-- migration targets are pre-defined, no runtime writes needed (works with container read-only mounts)defaultimplicit -- the existing top-levelembeddingconfig (dense/sparse/hybrid) is the active profileBackground:
EmbeddingConfigsupports three embedding types:dense-- dense vectors (most common)sparse-- sparse vectors (BM25-style)hybrid-- single model returning both dense + sparseget_embedder()logic: ifhybridexists, use hybrid embedder; if bothdenseandsparseexist, useCompositeHybridEmbedder; if onlydense, use dense embedder.Migration configs need to cover all three cases. Each migration entry mirrors the model config fields in
embedding:Case A: dense only (most common)
{ "embedding": { "dense": { "provider": "volcengine", "model": "doubao-embedding-vision-251215", "dimension": 1024, "api_base": "https://ark.cn-beijing.volces.com/api/v3", "api_key": "..." }, "migration": { "openai-v3-large": { "dense": { "provider": "openai", "model": "text-embedding-3-large", "dimension": 3072, "api_base": "https://api.openai.com/v1", "api_key": "..." } } }, "max_concurrent": 10 } }Case B: dense + sparse (composite hybrid)
{ "embedding": { "dense": { "provider": "volcengine", "model": "...", "dimension": 1024 }, "sparse": { "provider": "volcengine", "model": "..." }, "migration": { "openai-mixed": { "dense": { "provider": "openai", "model": "...", "dimension": 3072 }, "sparse": { "provider": "openai", "model": "..." } } }, "max_concurrent": 10 } }Case C: hybrid (single-model hybrid)
{ "embedding": { "hybrid": { "provider": "volcengine", "model": "...", "dimension": 1024 }, "migration": { "openai-hybrid": { "hybrid": { "provider": "openai", "model": "...", "dimension": 3072 } } }, "max_concurrent": 10 } }Note: Each migration entry mirrors only
dense/sparse/hybridmodel config fields. Top-level runtime settings (max_concurrent,circuit_breaker,max_retries, etc.) are global and don't change during migration.CLI references migration targets by name:
Lifecycle:
defaultdefaultdefaultdefault3. Blue-green vector migration
Instead of overwriting vectors in-place during reindex, maintain two vector sets ("blue" = current active, "green" = new model being built). Users always query the active set. Once the green set is fully built and verified, atomically promote it to active.
Changing model and changing dimension are the same operation
With in-place overwrite, changing the embedding dimension (e.g. 1024d to 3072d) requires dropping and recreating the entire vectordb -- destructive and irreversible. With blue-green, both "new model" and "new dimension" are handled the same way: write to the inactive collection, then flip the pointer. No schema migration, no data loss, no downtime.
Migration timeline
flowchart TD A["Admin starts reindex with new model"] --> B["Phase 1: Enable dual-write<br/>New writes -> default + openai-v3-large<br/>Queries -> default"] B --> C["Phase 2: Bulk re-embed existing resources<br/>default (active): text-embedding-v3, 47 resources<br/>openai-v3-large (building): 12/47...<br/>Queries -> default<br/>Dual-write active"] C -->|"openai-v3-large complete"| D["Phase 3: Query switchover<br/>Active pointer: default -> openai-v3-large<br/>Queries -> openai-v3-large<br/>Dual-write still active"] D --> E["Phase 4: Disable dual-write<br/>Writes -> openai-v3-large only<br/>default retained for rollback until TTL expires"] E -->|"TTL expires or admin confirms"| F["Phase 5: Delete old set<br/>Delete default collection<br/>openai-v3-large becomes the new default"]API surface
The migration controller orchestrates phases through these primitives:
set_dual_write(True)-- enable dual-writeembed_with_new_model()then_write_to_collection(green, ...)-- bulk re-embed to green onlyswitch_active(green)-- atomic query switchoverset_dual_write(False)-- disable dual-writedelete_collection(blue)-- cleanup old setRollback
4. Runtime embedding model health gate
On server startup, validate embedding config against existing vectordb state:
5. Config validation on load
When
ov.confis loaded, immediately test the embedding endpoint with a trivial input (not on first search):6. Migration resilience
Migration can take hours. If the server restarts or the CLI disconnects, all progress is lost without persistent state.
6.1 Migration state persistence
Migration state is persisted separately from #1439's
embedding_meta.json-- independent but complementary:embedding_meta.json(#1439)persist_embedding_metadata()ensure_embedding_collection_compatibility())migration_state.json(this proposal)--resumeOn server startup, check for incomplete migration and offer resume:
6.2 Interrupted migration recovery
When
ov reindex --allis interrupted (Ctrl+C, network disconnect, OOM):6.3 Partial failure handling
Individual resource embed failures should not abort the entire migration:
6.4 Disk space pre-check
--dry-runverifies sufficient disk space for the green collection:6.4.1 Migration state writability check
--dry-runalso verifies thatmigration_state.jsoncan be written and persisted:6.5 Rollback TTL
Blue collection retention after switchover is configurable:
Alternatives Considered
ov reindex --all.Expected Outcome
ov config validate --live, thenov reindex --all --dry-run, thenov reindex --allwith progress and--rollback<workspace>/.meta/,--resumecontinues from checkpointRelated Issues
embedding_meta.jsonpersistence,compatibility_identity(),VectorRebuildService,openviking-rebuild-vectorsCLI.--allproposal)Feature Area
Configuration / CLI / Server Runtime
Contribution