Skip to content

task: optimize vector indexes for brujula investigate index rebuild core 104#690

Merged
irumvanselme merged 4 commits intomainfrom
task/optimize-vector-indexes-for-brujula-investigate-index-rebuild-CORE-104
Mar 10, 2026
Merged

task: optimize vector indexes for brujula investigate index rebuild core 104#690
irumvanselme merged 4 commits intomainfrom
task/optimize-vector-indexes-for-brujula-investigate-index-rebuild-CORE-104

Conversation

@irumvanselme
Copy link
Copy Markdown
Collaborator

No description provided.

Comment on lines +422 to +432
preferredLabel=skill_relationship.get("skills").get("preferredLabel", ""),
description=skill_relationship.get("skills").get("description", ""),
scopeNote=skill_relationship.get("skills").get("scopeNote", ""),
altLabels=skill_relationship.get("skills").get("altLabels", []),
skillType=skill_relationship.get("skills").get("skillType", ""),
originUUID=skill_relationship.get("skills").get("originUUID", ""),
UUIDHistory=skill_relationship.get("skills").get("UUIDHistory", []),
relationType=skill_relationship.get("relationType", ""),
signallingValueLabel=skill_relationship.get("signallingValueLabel", ""),
score=0.0
) for skill in skills]
) for skill_relationship in skills_relationships]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The removal of the $group stage from the aggregation pipeline may lead to duplicate skills being returned if the underlying relations collection contains duplicate entries.
Severity: MEDIUM

Suggested Fix

Reintroduce a deduplication mechanism. The safest approach is to add a $group stage back into the MongoDB aggregation pipeline to group results by skillId, mirroring the previous implementation's defensive approach. Alternatively, perform deduplication in the application code after fetching the results.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: backend/app/vector_search/esco_search_service.py#L414-L432

Potential issue: The MongoDB aggregation pipeline was modified by removing a `$group`
stage that previously deduplicated results by `skillId`. The underlying `relations`
collection does not enforce uniqueness on the combination of `requiringOccupationId` and
`requiredSkillId`. If duplicate relation documents exist in the collection, this change
will cause the query to return duplicate `AssociatedSkillEntity` entries, whereas the
previous implementation would have returned a unique list. This is a functional
regression that relies on an unverified assumption that the source data contains no
duplicates.

Did we get this right? 👍 / 👎 to inform future reviews.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sentry, the fact that I added limit 1 in the previous stage pipeline, won't that sovle the problem of removing duplicates?

@irumvanselme irumvanselme force-pushed the task/optimize-vector-indexes-for-brujula-investigate-index-rebuild-CORE-104 branch from 750f4f4 to 8ff4d7b Compare March 10, 2026 09:40
@irumvanselme irumvanselme requested a review from Copilot March 10, 2026 09:43
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the vector search infrastructure used by the "Brujula" feature by improving MongoDB index design, query pipeline efficiency, and fixing a minor script execution log placement bug.

Changes:

  • Adds a new (skillId, modelId) compound index optimized for the $lookup access pattern in create_skills_indexes, and refactors the _find_skills_of_occupation aggregation pipeline to use a $limit: 1 inside the lookup sub-pipeline (replacing a $group deduplication stage) and excludes large embedding fields from query results.
  • Restructures main() in the embedding generation script to correctly guard steps 1-4 under if opts.generate_embeddings:, and fixes a bug where logger.info("Script execution completed") was at module scope (causing it to run at import time) instead of inside main().
  • Collapses all Swagger UI endpoints by default via docExpansion: none.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
backend/scripts/embeddings/_common.py Adds skill_id_model_id_index with skillId as leading field to support the $lookup join pattern
backend/app/vector_search/esco_search_service.py Optimizes _find_skills_of_occupation pipeline: adds $project to strip unused fields, uses $limit: 1 in lookup sub-pipeline to avoid $group deduplication, and restructures result mapping
backend/scripts/embeddings/generate_taxonomy_embeddings.py Nests embedding generation steps inside guard block and fixes logger.info indentation bug
backend/app/server.py Adds swagger_ui_parameters to collapse Swagger UI by default

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +407 to +408
# Limit 1 of the skills (Note: each skill has 3 corresponding documents, pick the first.
{"$limit": 1},
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The $limit: 1 stage in the inner $lookup pipeline picks an arbitrary document (the first one returned by MongoDB) out of the potentially 3 copies of a skill document (one per embedded field: description, preferredLabel, altLabels). Because there is no $sort before the $limit, the document returned is non-deterministic and may vary between queries or after index changes. While the non-embedding fields should be identical across all 3 copies, it would be safer and clearer to add a $sort: {"embedded_field": 1} (or any stable sort) before $limit: 1 to ensure a deterministic and predictable result.

Copilot uses AI. Check for mistakes.
{"$match": {"modelId": self._model_id}},
# Limit 1 of the skills (Note: each skill has 3 corresponding documents, pick the first.
{"$limit": 1},
# Ignore the embedding from the result for latency fixup.
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on line 409 describes the purpose of {"$project": { "embedding": 0, "embedded_field": 0, "embedded_text": 0 }} as "latency fixup," but this phrasing is non-standard and unclear. A more descriptive comment such as "Exclude large/irrelevant fields from the result to reduce network transfer and improve latency" would better communicate the intent.

Copilot uses AI. Check for mistakes.
@irumvanselme irumvanselme force-pushed the task/optimize-vector-indexes-for-brujula-investigate-index-rebuild-CORE-104 branch from 8ff4d7b to 2943fc2 Compare March 10, 2026 09:59
@irumvanselme irumvanselme merged commit 48d7a83 into main Mar 10, 2026
1 of 4 checks passed
@irumvanselme irumvanselme deleted the task/optimize-vector-indexes-for-brujula-investigate-index-rebuild-CORE-104 branch March 10, 2026 09:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants