task: optimize vector indexes for brujula investigate index rebuild core 104 by irumvanselme · Pull Request #690 · tabiya-tech/compass

irumvanselme · 2026-03-10T09:36:27Z

No description provided.

sentry · 2026-03-10T09:38:52Z

backend/app/vector_search/esco_search_service.py

+            preferredLabel=skill_relationship.get("skills").get("preferredLabel", ""),
+            description=skill_relationship.get("skills").get("description", ""),
+            scopeNote=skill_relationship.get("skills").get("scopeNote", ""),
+            altLabels=skill_relationship.get("skills").get("altLabels", []),
+            skillType=skill_relationship.get("skills").get("skillType", ""),
+            originUUID=skill_relationship.get("skills").get("originUUID", ""),
+            UUIDHistory=skill_relationship.get("skills").get("UUIDHistory", []),
+            relationType=skill_relationship.get("relationType", ""),
+            signallingValueLabel=skill_relationship.get("signallingValueLabel", ""),
            score=0.0
-        ) for skill in skills]
+        ) for skill_relationship in skills_relationships]


Bug: The removal of the $group stage from the aggregation pipeline may lead to duplicate skills being returned if the underlying relations collection contains duplicate entries.
_{Severity: MEDIUM}

Suggested Fix

Reintroduce a deduplication mechanism. The safest approach is to add a $group stage back into the MongoDB aggregation pipeline to group results by skillId, mirroring the previous implementation's defensive approach. Alternatively, perform deduplication in the application code after fetching the results.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: backend/app/vector_search/esco_search_service.py#L414-L432 Potential issue: The MongoDB aggregation pipeline was modified by removing a `$group` stage that previously deduplicated results by `skillId`. The underlying `relations` collection does not enforce uniqueness on the combination of `requiringOccupationId` and `requiredSkillId`. If duplicate relation documents exist in the collection, this change will cause the query to return duplicate `AssociatedSkillEntity` entries, whereas the previous implementation would have returned a unique list. This is a functional regression that relies on an unverified assumption that the source data contains no duplicates.

_{Did we get this right? 👍 / 👎 to inform future reviews.}

@sentry, the fact that I added limit 1 in the previous stage pipeline, won't that sovle the problem of removing duplicates?

…ions

Copilot

Pull request overview

This PR optimizes the vector search infrastructure used by the "Brujula" feature by improving MongoDB index design, query pipeline efficiency, and fixing a minor script execution log placement bug.

Changes:

Adds a new (skillId, modelId) compound index optimized for the $lookup access pattern in create_skills_indexes, and refactors the _find_skills_of_occupation aggregation pipeline to use a $limit: 1 inside the lookup sub-pipeline (replacing a $group deduplication stage) and excludes large embedding fields from query results.
Restructures main() in the embedding generation script to correctly guard steps 1-4 under if opts.generate_embeddings:, and fixes a bug where logger.info("Script execution completed") was at module scope (causing it to run at import time) instead of inside main().
Collapses all Swagger UI endpoints by default via docExpansion: none.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
`backend/scripts/embeddings/_common.py`	Adds `skill_id_model_id_index` with `skillId` as leading field to support the `$lookup` join pattern
`backend/app/vector_search/esco_search_service.py`	Optimizes `_find_skills_of_occupation` pipeline: adds `$project` to strip unused fields, uses `$limit: 1` in lookup sub-pipeline to avoid `$group` deduplication, and restructures result mapping
`backend/scripts/embeddings/generate_taxonomy_embeddings.py`	Nests embedding generation steps inside guard block and fixes `logger.info` indentation bug
`backend/app/server.py`	Adds `swagger_ui_parameters` to collapse Swagger UI by default

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-10T09:51:14Z

backend/app/vector_search/esco_search_service.py

+                        # Limit 1 of the skills (Note: each skill has 3 corresponding documents, pick the first.
+                        {"$limit": 1},


The $limit: 1 stage in the inner $lookup pipeline picks an arbitrary document (the first one returned by MongoDB) out of the potentially 3 copies of a skill document (one per embedded field: description, preferredLabel, altLabels). Because there is no $sort before the $limit, the document returned is non-deterministic and may vary between queries or after index changes. While the non-embedding fields should be identical across all 3 copies, it would be safer and clearer to add a $sort: {"embedded_field": 1} (or any stable sort) before $limit: 1 to ensure a deterministic and predictable result.

Copilot · 2026-03-10T09:51:15Z

backend/app/vector_search/esco_search_service.py

+                        {"$match": {"modelId": self._model_id}},
+                        # Limit 1 of the skills (Note: each skill has 3 corresponding documents, pick the first.
+                        {"$limit": 1},
+                        # Ignore the embedding from the result for latency fixup.


The comment on line 409 describes the purpose of {"$project": { "embedding": 0, "embedded_field": 0, "embedded_text": 0 }} as "latency fixup," but this phrasing is non-standard and unclear. A more descriptive comment such as "Exclude large/irrelevant fields from the result to reduce network transfer and improve latency" would better communicate the intent.

…erformance

…ed MongoDB lookups to fix the issues of index missing on some forks

…ails if the script is only generating indexes

sentry bot reviewed Mar 10, 2026

View reviewed changes

feat(server): add swagger UI parameter to collapse documentation sect…

b8f79b7

…ions

irumvanselme force-pushed the task/optimize-vector-indexes-for-brujula-investigate-index-rebuild-CORE-104 branch from 750f4f4 to 8ff4d7b Compare March 10, 2026 09:40

irumvanselme requested a review from Copilot March 10, 2026 09:43

Copilot started reviewing on behalf of irumvanselme March 10, 2026 09:44 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

irumvanselme added 3 commits March 10, 2026 11:59

feat(esco_search): optimize skill aggregation pipeline for improved p…

1366934

…erformance

feat(index): create compound index on skillId and modelId for optimiz…

bf424f6

…ed MongoDB lookups to fix the issues of index missing on some forks

refactor(generate_taxonomy_embeddings): do not validate the model det…

2943fc2

…ails if the script is only generating indexes

irumvanselme force-pushed the task/optimize-vector-indexes-for-brujula-investigate-index-rebuild-CORE-104 branch from 8ff4d7b to 2943fc2 Compare March 10, 2026 09:59

irumvanselme merged commit 48d7a83 into main Mar 10, 2026
1 of 4 checks passed

irumvanselme deleted the task/optimize-vector-indexes-for-brujula-investigate-index-rebuild-CORE-104 branch March 10, 2026 09:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

task: optimize vector indexes for brujula investigate index rebuild core 104#690

task: optimize vector indexes for brujula investigate index rebuild core 104#690
irumvanselme merged 4 commits intomainfrom
task/optimize-vector-indexes-for-brujula-investigate-index-rebuild-CORE-104

irumvanselme commented Mar 10, 2026

Uh oh!

sentry bot Mar 10, 2026

Uh oh!

irumvanselme Mar 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# Limit 1 of the skills (Note: each skill has 3 corresponding documents, pick the first.
		{"$limit": 1},

Conversation

irumvanselme commented Mar 10, 2026

Uh oh!

sentry bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

irumvanselme Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants