Skip to content

[Feature] Improve embedder model migration experience #1523

@A0nameless0man

Description

@A0nameless0man

Problem

Switching the embedding model in OpenViking requires manual, error-prone steps with no validation and no batch tooling. The process overwrites vectors in-place, causing search quality degradation visible to end-users during the migration window. A blue-green migration strategy would eliminate user-visible impact -- search always hits a complete, consistent vector set.

Current Workflow

# 1. Admin manually edits ov.conf -- no validation at all
vim ~/.openviking/ov.conf
#   modify: embedding.dense.model, dimension, api_base...

# 2. Restart the server -- downtime, no pre-check
docker restart openviking
# or:
openviking-server  # starts without any verification

# 3. Reindex resources one by one -- no batch operation
ov reindex viking://resources/doc-a --regenerate --wait
ov reindex viking://resources/doc-b --regenerate --wait
# ... repeat for every resource

# 4. User-visible impact:
#   - Search quality degrades silently (old vectors + new model = mismatch)
#   - No notification that embedding has changed
#   - No way to confirm reindex completion

Pain Points

Admin

Problem Impact Evidence
No endpoint-level dimension validation Config layer has a VectorDB vs Embedding dimension consistency WARNING (openviking_cli/utils/config/open_viking_config.py) and auto-syncs -- but it never verifies whether the configured dimension matches the embedding endpoint's actual output openviking-server doctor only checks if API key is set, does not test endpoint connectivity
No bulk reindex Must run ov reindex <uri> individually for each resource Current reindex endpoint accepts only a single URI
No model compatibility check Config-time model/provider compatibility is not validated. Errors surface only at query time. Example: provider=openai + a downstream model that doesn't support matryoshka representations (e.g. certain OpenAI-compatible Qwen endpoints) -- v0.3.5 passed through the dimensions parameter causing a 400 error (#1442, fixed in v0.3.6). Or provider=litellm + bare model name (Qwen3-Embedding-0.6B instead of dashscope/Qwen3-Embedding-0.6B) -- query fails with LLM Provider NOT provided #1442 -- users only discover config errors at search time
No progress visibility ov reindex --wait blocks with no progress indication No progress events
No rollback path After reindexing with a new model, old vectors are gone build_index() overwrites in-place

User

Problem Impact
Search quality degradation during migration reindex overwrites vectors in-place. During the migration window, query vectors use the new model while some data still has old model vectors -- results are unpredictable
No atomicity No all-or-nothing switchover. Users may hit a half-old-half-new vector set

Users should always hit a complete, consistent vector set during migration. Blue-green migration (build new vector set in background, then atomic active-pointer switchover) is the straightforward solution.

Proposed Solution

Dependency on #1439: This proposal builds on #1439 (feat: detect embedding model drift and add rebuild tool), which provides:

  • embedding_compat.py -- embedding identity persistence (embedding_meta.json), compatibility_identity(), and startup-time ensure_embedding_collection_compatibility() check
  • vector_rebuild.py -- VectorRebuildService (discover_accounts(), rebuild_account(), rebuild_accounts())
  • openviking-rebuild-vectors CLI -- --all-accounts batch rebuild entry point
  • EmbeddingCompatibilityError -- startup fail-fast mechanism

This proposal adds blue-green migration, dual-write, and atomic switchover on top of #1439. If #1439 is not yet merged, Section 4 (health gate) and Section 6 (migration resilience) will need the embedding_meta.json persistence part implemented first.

1. Extend ov reindex with batch support

ov reindex <URI> currently accepts only a single URI. When switching embedding models, admins need to reindex every resource -- doing this one at a time is impractical.

# Current:
ov reindex viking://resources/doc-a --regenerate --wait

# Proposed:

# Reindex all resources
ov reindex --all --regenerate --wait=false

# Reindex with glob pattern
ov reindex viking://resources/my-project/** --regenerate

# Dry-run: show what would be reindexed (count + estimated time)
ov reindex --all --dry-run
# Output:
#   Resources to reindex: 47
#   Estimated time: ~12 min
#   Current embedding model: text-embedding-v4 (1024d)
#   WARNING: Dimension mismatch detected -- vectordb rebuild may be required

2. Extend ov config validate with live endpoint check

openviking-server doctor (openviking_cli/doctor.py) already checks config syntax, Python version, native engine, AGFS, embedding API key existence, VLM config, and disk space. It does not test endpoint connectivity or verify actual embedding dimensions. This proposal adds those checks either to doctor or to a new ov config validate --live command. The two don't conflict -- doctor covers operational health, --live covers pre-change validation.

ov config validate currently only checks config syntax (JSON schema via serde). Extend it to verify the endpoint is reachable and the output dimension matches config.

# Current:
ov config validate
# -> only checks JSON schema

# Proposed:
ov config validate --live
# Checks:
#   PASS: Config syntax valid
#   PASS: Embedding endpoint reachable
#   PASS: Embedding dimension matches config (1024d)
#   WARNING: Embedding model differs from stored model (text-embedding-v3 -> text-embedding-v4)
#   Note: Run `ov reindex --all` to rebuild vectors

2.1 Config structure for blue-green

During migration, the system needs both the current (active) model config and the target model config. We propose a named embedding.migration map in ov.conf, where each entry is a named migration target. The existing embedding config is implicitly named default.

embedding.migration is a map of named configs, not a single block. This design:

  1. Supports read-only ov.conf -- migration targets are pre-defined, no runtime writes needed (works with container read-only mounts)
  2. Allows multiple migration targets -- admins can pre-configure several models and pick one via CLI
  3. Makes default implicit -- the existing top-level embedding config (dense/sparse/hybrid) is the active profile

Background: EmbeddingConfig supports three embedding types:

  • dense -- dense vectors (most common)
  • sparse -- sparse vectors (BM25-style)
  • hybrid -- single model returning both dense + sparse

get_embedder() logic: if hybrid exists, use hybrid embedder; if both dense and sparse exist, use CompositeHybridEmbedder; if only dense, use dense embedder.

Migration configs need to cover all three cases. Each migration entry mirrors the model config fields in embedding:

Case A: dense only (most common)

{
  "embedding": {
    "dense": {
      "provider": "volcengine",
      "model": "doubao-embedding-vision-251215",
      "dimension": 1024,
      "api_base": "https://ark.cn-beijing.volces.com/api/v3",
      "api_key": "..."
    },
    "migration": {
      "openai-v3-large": {
        "dense": {
          "provider": "openai",
          "model": "text-embedding-3-large",
          "dimension": 3072,
          "api_base": "https://api.openai.com/v1",
          "api_key": "..."
        }
      }
    },
    "max_concurrent": 10
  }
}

Case B: dense + sparse (composite hybrid)

{
  "embedding": {
    "dense": { "provider": "volcengine", "model": "...", "dimension": 1024 },
    "sparse": { "provider": "volcengine", "model": "..." },
    "migration": {
      "openai-mixed": {
        "dense": { "provider": "openai", "model": "...", "dimension": 3072 },
        "sparse": { "provider": "openai", "model": "..." }
      }
    },
    "max_concurrent": 10
  }
}

Case C: hybrid (single-model hybrid)

{
  "embedding": {
    "hybrid": { "provider": "volcengine", "model": "...", "dimension": 1024 },
    "migration": {
      "openai-hybrid": {
        "hybrid": { "provider": "openai", "model": "...", "dimension": 3072 }
      }
    },
    "max_concurrent": 10
  }
}

Note: Each migration entry mirrors only dense/sparse/hybrid model config fields. Top-level runtime settings (max_concurrent, circuit_breaker, max_retries, etc.) are global and don't change during migration.

CLI references migration targets by name:

# List available migration targets
ov reindex --list-targets
# Output:
#   Available migration targets:
#   - openai-v3-large  (openai/text-embedding-3-large, 3072d)
#   - qwen-3-large     (dashscope/qwen3-embedding, 1024d)

# Start migration with a pre-configured target
ov reindex --all --target openai-v3-large

Lifecycle:

Phase Active profile Migration state Behavior
Normal default None Single-model operation
Migration start default Target name selected via CLI Dual-write + bulk re-embed to target
Migration complete Auto-switched to target Target entry removed from config Single-model, new config becomes default
Rollback Reverted to default Target re-added Dual-write back to old

3. Blue-green vector migration

Instead of overwriting vectors in-place during reindex, maintain two vector sets ("blue" = current active, "green" = new model being built). Users always query the active set. Once the green set is fully built and verified, atomically promote it to active.

Changing model and changing dimension are the same operation

With in-place overwrite, changing the embedding dimension (e.g. 1024d to 3072d) requires dropping and recreating the entire vectordb -- destructive and irreversible. With blue-green, both "new model" and "new dimension" are handled the same way: write to the inactive collection, then flip the pointer. No schema migration, no data loss, no downtime.

Migration timeline

flowchart TD
    A["Admin starts reindex with new model"] --> B["Phase 1: Enable dual-write<br/>New writes -> default + openai-v3-large<br/>Queries -> default"]
    B --> C["Phase 2: Bulk re-embed existing resources<br/>default (active): text-embedding-v3, 47 resources<br/>openai-v3-large (building): 12/47...<br/>Queries -> default<br/>Dual-write active"]
    C -->|"openai-v3-large complete"| D["Phase 3: Query switchover<br/>Active pointer: default -> openai-v3-large<br/>Queries -> openai-v3-large<br/>Dual-write still active"]
    D --> E["Phase 4: Disable dual-write<br/>Writes -> openai-v3-large only<br/>default retained for rollback until TTL expires"]
    E -->|"TTL expires or admin confirms"| F["Phase 5: Delete old set<br/>Delete default collection<br/>openai-v3-large becomes the new default"]
Loading

API surface

class VectorDB:
    def get_active_collection(self) -> str:
        """Returns collection name for reads -- e.g. 'default' or 'openai-v3-large'."""
        return self.metadata.get("active_embedding_set", "default")

    def is_dual_write_enabled(self) -> bool:
        return self.metadata.get("dual_write", False)

    def switch_active(self, target: str) -> None:
        """Atomic metadata write -- instant switchover for all reads."""
        self.metadata.set("active_embedding_set", target)

    def set_dual_write(self, enabled: bool) -> None:
        self.metadata.set("dual_write", enabled)

    def upsert(self, resource_uri: str, vector) -> None:
        """In dual-write mode, writes to both active and inactive collections."""
        active = self.get_active_collection()
        self._write_to_collection(active, resource_uri, vector)
        if self.is_dual_write_enabled():
            inactive = self._get_inactive_collection()
            self._write_to_collection(inactive, resource_uri, vector)

The migration controller orchestrates phases through these primitives:

  1. set_dual_write(True) -- enable dual-write
  2. Loop: embed_with_new_model() then _write_to_collection(green, ...) -- bulk re-embed to green only
  3. switch_active(green) -- atomic query switchover
  4. set_dual_write(False) -- disable dual-write
  5. delete_collection(blue) -- cleanup old set

Rollback

# If admin detects quality regression after switchover:
ov reindex --rollback
# Instantly switches back to previous set (still on disk)

# Rollback behavior varies by phase:
#   Phase 1-2 (dual-write/building): disable dual-write, discard green set
#   Phase 3-4 (switched/dual-write-off): flip pointer back to blue, re-enable dual-write briefly
#   Phase 5 (cleanup): too late, blue already deleted -- full reindex required

4. Runtime embedding model health gate

On server startup, validate embedding config against existing vectordb state:

# NOTE: Current vectordb metadata does not store embedding_model name (only dimension).
# This pseudocode assumes #1066/#1439's embedding identity metadata is implemented.
current_model = config.embedding.dense.model
stored_model = vectordb.get_metadata("embedding_model")  # requires #1066/#1439
stored_dim = vectordb.get_metadata("embedding_dimension")

if current_model != stored_model or config.embedding.dense.dimension != stored_dim:
    log.warning(
        f"Embedding model changed: {stored_model} ({stored_dim}d) -> {current_model} ({config.embedding.dense.dimension}d). "
        f"Run 'ov reindex --all' to rebuild vectors with blue-green migration."
    )

5. Config validation on load

When ov.conf is loaded, immediately test the embedding endpoint with a trivial input (not on first search):

# Validate on config load, not on first search
try:
    result = await embedder.embed("health check", is_query=True)
    assert len(result.dense_vector) == config.embedding.dense.dimension
except Exception as e:
    raise ConfigError(f"Embedding endpoint validation failed: {e}")

6. Migration resilience

Migration can take hours. If the server restarts or the CLI disconnects, all progress is lost without persistent state.

Building on #1439: #1439 provides embedding identity persistence (embedding_meta.json) and VectorRebuildService (per-account delete + reindex). This section extends that foundation:

Why not TaskTracker: The existing TaskTracker (openviking/service/task_tracker.py) is a pure in-memory registry for short-lived background operations (e.g. session commit). Its design explicitly states "v1 is pure in-memory (no persistence). Tasks are lost on restart." Migration runs for hours and must survive restarts -- TaskTracker is fundamentally unsuitable. Migration state is persisted to disk independently.

Design principle: The five-phase flow (dual-write, then bulk re-embed, then query switchover, then disable dual-write, then cleanup) ensures zero data loss. Dual-write is enabled before bulk re-embed starts, so all writes during the migration window enter both collections.

6.1 Migration state persistence

Migration state is persisted separately from #1439's embedding_meta.json -- independent but complementary:

File Source Content Purpose
embedding_meta.json (#1439) persist_embedding_metadata() Active embedding identity (provider, model, dimension, mode) Model drift detection at startup (ensure_embedding_collection_compatibility())
migration_state.json (this proposal) Migration controller Blue-green switchover progress (phase, progress, blue/green names) Incomplete migration detection at startup, supports --resume
# Stored in <workspace>/.meta/migration_state.json
# e.g. ./data/.meta/migration_state.json (workspace defaults to ./data)
{
  "migration_id": "mig_20260417_001",
  "blue_name": "default",
  "green_name": "openai-v3-large",
  "active_name": "default",
  "phase": "building",
  "progress": {
    "total": 47,
    "processed": 12,
    "failed": 0,
    "failed_uris": []
  },
  "started_at": "2026-04-17T10:00:00Z",
  "updated_at": "2026-04-17T10:15:00Z"
}

On server startup, check for incomplete migration and offer resume:

# Server startup detection
WARNING: Incomplete migration detected (mig_20260417_001):
   Phase: building (12/47 resources processed)
   Blue: default, Green: openai-v3-large
   Dual-write: enabled

   Options:
   1. ov reindex --resume        # Continue bulk re-embed from checkpoint
   2. ov reindex --abort         # Disable dual-write, discard green set
   3. ov reindex --all           # Restart from scratch

# If crashed after query switchover:
WARNING: Migration interrupted after query switchover:
   Active: openai-v3-large, Dual-write: was enabled
   Blue set (default) still on disk (for rollback)

   Options:
   1. ov reindex --resume        # Disable dual-write, continue to cleanup
   2. ov reindex --rollback      # Switch queries back to default
   3. ov reindex --abort         # Abort migration entirely

6.2 Interrupted migration recovery

When ov reindex --all is interrupted (Ctrl+C, network disconnect, OOM):

Scenario Behavior
Graceful stop (SIGINT) Finish current resource, save state, exit cleanly
Hard crash (OOM, kill -9) On next startup, detect incomplete state from disk
Network disconnect Embedder retry logic handles transient failures; persistent failure marks resource as failed and continues
Crash during dual-write On restart, re-enable dual-write from persisted state, resume bulk re-embed

6.3 Partial failure handling

Individual resource embed failures should not abort the entire migration:

# After migration completes with some failures:
PASS: 45/47 resources migrated successfully
FAIL: 2 resources failed:
   - viking://resources/doc-x (embedder timeout)
   - viking://resources/doc-y (invalid content)

Run `ov reindex --retry-failed` to retry failed resources.

6.4 Disk space pre-check

--dry-run verifies sufficient disk space for the green collection:

ov reindex --all --dry-run
# Output:
#   Resources to reindex: 47
#   Estimated green collection size: ~2.3 GB
#   Available disk space: 15.7 GB
#   Estimated time: ~12 min

6.4.1 Migration state writability check

--dry-run also verifies that migration_state.json can be written and persisted:

ov reindex --all --dry-run
# Output:
#   Resources to reindex: 47
#   Estimated green collection size: ~2.3 GB
#   Available disk space: 15.7 GB
#   Estimated time: ~12 min
#
#   Migration state:
#   PASS: State path writable: <workspace>/.meta/migration_state.json
#   PASS: Test write successful -- state will survive restarts

# WARNING: container with read-only mount
ov reindex --all --dry-run
# Output:
#   ...
#   Migration state:
#   FAIL: State path NOT writable: <workspace>/.meta/migration_state.json
#      Reason: Read-only filesystem
#
#   WARNING: If workspace is on a read-only mount,
#      migration state will be lost on container restart.
#      A full reindex will be required if migration is interrupted.
#
#   To fix: ensure workspace directory is on a persistent writable volume

6.5 Rollback TTL

Blue collection retention after switchover is configurable:

// ov.conf
{
  "embedding": {
    "migration": {
      "rollback_ttl_hours": 72,    // Delete blue set after 72h if not rolled back
      "auto_confirm": false         // If true, auto-cleanup after TTL without admin confirmation
    }
  }
}

Alternatives Considered

  1. Status quo (manual, in-place overwrite) -- works but tedious, and users see search quality degradation during the migration window. Unacceptable. Additionally, changing model or dimension requires manually dropping and recreating the entire vectordb -- a destructive operation.
  2. Automatic reindex on config change -- too risky. A typo could trigger an expensive reindex. Better to keep it explicit via ov reindex --all.
  3. Block search during migration -- too aggressive. Blue-green achieves zero downtime.
  4. Expose migration state in API responses -- leaks internal operational details to users. Blue-green keeps migration invisible to users, which is the better approach.

Expected Outcome

Who Before After
Admin Manual config edit, restart without validation, individual reindex, no progress, no rollback ov config validate --live, then ov reindex --all --dry-run, then ov reindex --all with progress and --rollback
Admin Changing model/dimension requires dropping and recreating the entire vectordb Blue-green: changing model and changing dimension are the same operation
Admin Migration progress lost on interruption, no state after container restart Migration state persisted to <workspace>/.meta/, --resume continues from checkpoint
User Search quality degrades during migration (half-old-half-new vectors) Zero user-visible impact -- search always hits a complete, consistent vector set

Related Issues

Feature Area

Configuration / CLI / Server Runtime

Contribution

  • I am willing to contribute to implementing this feature

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions