task: optimize vector indexes for brujula investigate index rebuild core 104#690
Conversation
| preferredLabel=skill_relationship.get("skills").get("preferredLabel", ""), | ||
| description=skill_relationship.get("skills").get("description", ""), | ||
| scopeNote=skill_relationship.get("skills").get("scopeNote", ""), | ||
| altLabels=skill_relationship.get("skills").get("altLabels", []), | ||
| skillType=skill_relationship.get("skills").get("skillType", ""), | ||
| originUUID=skill_relationship.get("skills").get("originUUID", ""), | ||
| UUIDHistory=skill_relationship.get("skills").get("UUIDHistory", []), | ||
| relationType=skill_relationship.get("relationType", ""), | ||
| signallingValueLabel=skill_relationship.get("signallingValueLabel", ""), | ||
| score=0.0 | ||
| ) for skill in skills] | ||
| ) for skill_relationship in skills_relationships] |
There was a problem hiding this comment.
Bug: The removal of the $group stage from the aggregation pipeline may lead to duplicate skills being returned if the underlying relations collection contains duplicate entries.
Severity: MEDIUM
Suggested Fix
Reintroduce a deduplication mechanism. The safest approach is to add a $group stage back into the MongoDB aggregation pipeline to group results by skillId, mirroring the previous implementation's defensive approach. Alternatively, perform deduplication in the application code after fetching the results.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: backend/app/vector_search/esco_search_service.py#L414-L432
Potential issue: The MongoDB aggregation pipeline was modified by removing a `$group`
stage that previously deduplicated results by `skillId`. The underlying `relations`
collection does not enforce uniqueness on the combination of `requiringOccupationId` and
`requiredSkillId`. If duplicate relation documents exist in the collection, this change
will cause the query to return duplicate `AssociatedSkillEntity` entries, whereas the
previous implementation would have returned a unique list. This is a functional
regression that relies on an unverified assumption that the source data contains no
duplicates.
Did we get this right? 👍 / 👎 to inform future reviews.
There was a problem hiding this comment.
@sentry, the fact that I added limit 1 in the previous stage pipeline, won't that sovle the problem of removing duplicates?
750f4f4 to
8ff4d7b
Compare
There was a problem hiding this comment.
Pull request overview
This PR optimizes the vector search infrastructure used by the "Brujula" feature by improving MongoDB index design, query pipeline efficiency, and fixing a minor script execution log placement bug.
Changes:
- Adds a new
(skillId, modelId)compound index optimized for the$lookupaccess pattern increate_skills_indexes, and refactors the_find_skills_of_occupationaggregation pipeline to use a$limit: 1inside the lookup sub-pipeline (replacing a$groupdeduplication stage) and excludes large embedding fields from query results. - Restructures
main()in the embedding generation script to correctly guard steps 1-4 underif opts.generate_embeddings:, and fixes a bug wherelogger.info("Script execution completed")was at module scope (causing it to run at import time) instead of insidemain(). - Collapses all Swagger UI endpoints by default via
docExpansion: none.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
backend/scripts/embeddings/_common.py |
Adds skill_id_model_id_index with skillId as leading field to support the $lookup join pattern |
backend/app/vector_search/esco_search_service.py |
Optimizes _find_skills_of_occupation pipeline: adds $project to strip unused fields, uses $limit: 1 in lookup sub-pipeline to avoid $group deduplication, and restructures result mapping |
backend/scripts/embeddings/generate_taxonomy_embeddings.py |
Nests embedding generation steps inside guard block and fixes logger.info indentation bug |
backend/app/server.py |
Adds swagger_ui_parameters to collapse Swagger UI by default |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Limit 1 of the skills (Note: each skill has 3 corresponding documents, pick the first. | ||
| {"$limit": 1}, |
There was a problem hiding this comment.
The $limit: 1 stage in the inner $lookup pipeline picks an arbitrary document (the first one returned by MongoDB) out of the potentially 3 copies of a skill document (one per embedded field: description, preferredLabel, altLabels). Because there is no $sort before the $limit, the document returned is non-deterministic and may vary between queries or after index changes. While the non-embedding fields should be identical across all 3 copies, it would be safer and clearer to add a $sort: {"embedded_field": 1} (or any stable sort) before $limit: 1 to ensure a deterministic and predictable result.
| {"$match": {"modelId": self._model_id}}, | ||
| # Limit 1 of the skills (Note: each skill has 3 corresponding documents, pick the first. | ||
| {"$limit": 1}, | ||
| # Ignore the embedding from the result for latency fixup. |
There was a problem hiding this comment.
The comment on line 409 describes the purpose of {"$project": { "embedding": 0, "embedded_field": 0, "embedded_text": 0 }} as "latency fixup," but this phrasing is non-standard and unclear. A more descriptive comment such as "Exclude large/irrelevant fields from the result to reduce network transfer and improve latency" would better communicate the intent.
…ed MongoDB lookups to fix the issues of index missing on some forks
…ails if the script is only generating indexes
8ff4d7b to
2943fc2
Compare
No description provided.