refactor(metadatakeys): merge duplicate keys across datasets and track group references#2697
Open
Junjiequan wants to merge 11 commits intomasterfrom
Open
refactor(metadatakeys): merge duplicate keys across datasets and track group references#2697Junjiequan wants to merge 11 commits intomasterfrom
Junjiequan wants to merge 11 commits intomasterfrom
Conversation
There was a problem hiding this comment.
Hey - I've found 6 issues, and left some high level feedback:
- The migration uses
aggregate(...).toArray()with$merge, which will materialize all results in memory; consider iterating the cursor (or omittingtoArray()entirely when$mergeis the terminal stage) to avoid potential OOM issues on large datasets. - Several unit tests for
MetadataKeysServiceassert on specificupdateManycall indices (e.g.mock.calls[1]); this makes them brittle as the implementation evolves—matching calls by argument shape instead of positional index would make the tests more robust. MetadataSourceDoc.sourceTypeis now derived frommodel.collection.name(e.g. for datasets), while docs/tests use capitalized literals like "Dataset"; it would be good to double-check and standardize thesourceTypeconvention across services, migrations, and tests to avoid subtle mismatches in queries and IDs.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The migration uses `aggregate(...).toArray()` with `$merge`, which will materialize all results in memory; consider iterating the cursor (or omitting `toArray()` entirely when `$merge` is the terminal stage) to avoid potential OOM issues on large datasets.
- Several unit tests for `MetadataKeysService` assert on specific `updateMany` call indices (e.g. `mock.calls[1]`); this makes them brittle as the implementation evolves—matching calls by argument shape instead of positional index would make the tests more robust.
- `MetadataSourceDoc.sourceType` is now derived from `model.collection.name` (e.g. for datasets), while docs/tests use capitalized literals like "Dataset"; it would be good to double-check and standardize the `sourceType` convention across services, migrations, and tests to avoid subtle mismatches in queries and IDs.
## Individual Comments
### Comment 1
<location path="src/metadata-keys/metadatakeys.service.ts" line_range="274-283" />
<code_context>
private async updateSharedKeys(
</code_context>
<issue_to_address>
**issue (bug_risk):** Keys whose humanReadableName changed are updated under the new _id before being treated as delete+insert, which can cause inconsistent counters and partial documents.
In `updateSharedKeys`, the group/publish update block runs for all `sharedKeys`, including those with changed `human_name`. For these, `ids` are built from the new `humanReadableName`, so `updateMany` may target documents that don’t yet exist (and should instead be created via `insertManyFromSource` with full `$setOnInsert`). Later, `humanNameChangedKeys` are processed again via `deleteMany(oldDoc)` + `insertManyFromSource(newDoc)`, so the same logical key is updated twice under different assumptions. To prevent double-counting and partial documents, exclude `humanNameChangedKeys` from the initial group/publish updates and rely only on the delete+insert path for name changes.
</issue_to_address>
### Comment 2
<location path="src/metadata-keys/metadatakeys.service.ts" line_range="297-307" />
<code_context>
+ const publishedFlippedOn = !oldDoc.isPublished && newDoc.isPublished;
+ const hasGroupChanges = addedGroups.length > 0 || removedGroups.length > 0;
+
+ if (hasGroupChanges || publishedFlippedOn) {
+ await this.metadataKeyModel.updateMany(filter, {
+ ...(addedGroups.length > 0 && {
+ $addToSet: { userGroups: { $each: addedGroups } },
+ }),
+ $inc: {
+ ...(addedGroups.length > 0 && this.groupCountDeltas(addedGroups, 1)),
+ ...(removedGroups.length > 0 &&
+ this.groupCountDeltas(removedGroups, -1)),
+ },
+ ...(publishedFlippedOn && { $set: { isPublished: true } }),
+ });
+
</code_context>
<issue_to_address>
**suggestion (bug_risk):** The `$inc` operator is always included, even when no groups changed, which risks sending an empty `$inc` object.
In this branch, when `publishedFlippedOn` is true but there are no added/removed groups, the update still includes `$inc` with no fields. MongoDB may reject or mis-handle an empty `$inc`. Consider constructing the `$inc` object separately and only spreading it into the update when it has at least one key (i.e., when there is a non-empty group delta).
```suggestion
const incUpdate = {
...(addedGroups.length > 0 && this.groupCountDeltas(addedGroups, 1)),
...(removedGroups.length > 0 &&
this.groupCountDeltas(removedGroups, -1)),
};
await this.metadataKeyModel.updateMany(filter, {
...(addedGroups.length > 0 && {
$addToSet: { userGroups: { $each: addedGroups } },
}),
...(Object.keys(incUpdate).length > 0 && { $inc: incUpdate }),
...(publishedFlippedOn && { $set: { isPublished: true } }),
});
```
</issue_to_address>
### Comment 3
<location path="migrations/20260420145401-sync-dataset-scientificMetadata-to-metadatakeys.js" line_range="8-9" />
<code_context>
+List metadata keys visible to the current user for a given source type:
```json
{
"where": {
- "sourceType": "dataset",
</code_context>
<issue_to_address>
**issue (bug_risk):** The combination of `preserveNullAndEmptyArrays` and the subsequent `$match` drops group-less datasets from `usageCount`, contrary to the comment.
Using `$unwind` with `preserveNullAndEmptyArrays: true` creates records with `userGroups: null` for group-less datasets, but the subsequent `$match: { userGroups: { $nin: [null, ""] } }` removes them. As a result, datasets without groups never reach the later grouping and are missing from `usageCount`, which contradicts the intent described in the comment. If group-less datasets should still contribute to `usageCount`, the pipeline needs to retain them (e.g., by handling `userGroups: null` explicitly or isolating the group-based logic so it doesn’t drop these records).
</issue_to_address>
### Comment 4
<location path="test/MetadataKeys.js" line_range="514-419" />
<code_context>
+ it("0640: changing ownerGroup updates userGroups — old group removed, new group added", async () => {
</code_context>
<issue_to_address>
**suggestion (testing):** Add a test that updates `accessGroups` (not just `ownerGroup`) and asserts that `userGroups` and access control on keys are updated accordingly.
Since `MetadataKeys` derives `userGroups` from both `ownerGroup` and `accessGroups`, and the service updates counts based on `userGroups`, this path should be tested as well. Please also add a test that:
- Starts with a dataset with `accessGroups: ["group1"]` and a key
- Patches the dataset to change `accessGroups` (e.g. to `["group2"]`)
- Asserts via `/metadatakeys` that:
- `userGroups` on the key reflects the updated access groups
- A user who only belonged to the removed group can no longer see the key
This will cover the `accessGroups` mutation path and its impact on access control and group reference counting.
Suggested implementation:
```javascript
await deleteDataset(pid);
});
it("0641: changing accessGroups updates userGroups and key visibility", async () => {
const pid = await createDataset({
...TestData.DatasetWithScientificMetadataV4,
ownerGroup: "group-owner",
accessGroups: ["group1"],
isPublished: false,
scientificMetadata: { access_group_change_key: { value: 42 } },
});
const filterKey = {
where: { sourceType: "Dataset", key: "access_group_change_key" },
};
// user belonging to group1 can see the key initially
const keysForGroup1Before = await getMetadataKeys(
filterKey,
accessTokenGroup1
);
keysForGroup1Before.body.should.have.lengthOf(1);
keysForGroup1Before.body[0].userGroups.should.contain("group1");
keysForGroup1Before.body[0].userGroups.should.not.contain("group2");
// patch dataset to switch accessGroups from group1 to group2
await patchDataset(
pid,
{ accessGroups: ["group2"] },
accessTokenAdmin
);
// after patch, key should reflect updated userGroups (derived from updated accessGroups)
const keysForGroup2After = await getMetadataKeys(
filterKey,
accessTokenGroup2
);
keysForGroup2After.body.should.have.lengthOf(1);
keysForGroup2After.body[0].userGroups.should.not.contain("group1");
keysForGroup2After.body[0].userGroups.should.contain("group2");
// user that only belonged to removed group1 can no longer see the key
const keysForGroup1After = await getMetadataKeys(
filterKey,
accessTokenGroup1
);
keysForGroup1After.body.should.have.lengthOf(0);
await deleteDataset(pid);
});
it("0640: changing ownerGroup updates userGroups — old group removed, new group added", async () => {
```
To make this compile and align with your existing tests, you may need to:
1. Ensure `accessTokenGroup1` and `accessTokenGroup2` exist and represent users who:
- Both can see datasets/keys owned by `group-owner`.
- `accessTokenGroup1` has group membership `group1` but not `group2`.
- `accessTokenGroup2` has group membership `group2` (and ideally not `group1`).
If your test suite uses a different naming convention (e.g. `accessTokenUserGroup1`), adjust the token names in the new test accordingly.
2. Confirm that `patchDataset` is available in this file and patches `accessGroups` without resetting other fields. If the helper has a different signature, update the call in the new test to match.
3. If your `getMetadataKeys` helper or the `/metadatakeys` response shape differs (e.g. `userGroups` lives under a nested property), adapt the assertions to reference the correct path.
</issue_to_address>
### Comment 5
<location path="docs/developer-guide/migrations/20260420145401-sync-dataset-scientificMetadata-to-metadatakeys.md" line_range="143" />
<code_context>
+{ datasetId: "uuid-B", key: "temperature", isPublished: false, humanReadableName: "Temperature", userGroups: ["group-1"] }
+```
+
+> `ownerGroup` is mandatory field so no null fallback is needed. `accessGroups` is optional so it falls back to `[]`.
+
+---
</code_context>
<issue_to_address>
**nitpick (typo):** Add an article in “ownerGroup is mandatory field” for correct grammar.
You could rephrase this as: “`ownerGroup` is a mandatory field, so no null fallback is needed. `accessGroups` is optional, so it falls back to `[]`.”
```suggestion
> `ownerGroup` is a mandatory field, so no null fallback is needed. `accessGroups` is optional, so it falls back to `[]`.
```
</issue_to_address>
### Comment 6
<location path="migrations/20260420145401-sync-dataset-scientificMetadata-to-metadatakeys.js" line_range="3" />
<code_context>
+const SOURCE_COLLECTIONS = ["Dataset"];
+
+function buildPipeline(sourceType) {
+ return [
+ // -------------------------------------------------------------------------
</code_context>
<issue_to_address>
**issue (complexity):** Consider decomposing the monolithic aggregation pipeline into smaller helper functions and a dedicated merge-stage builder so each phase of the migration is easier to understand and maintain.
You can keep all current behavior and significantly reduce cognitive load by decomposing the pipeline and isolating the merge logic.
### 1. Split the monolithic `buildPipeline` into small, named helpers
Right now `buildPipeline` mixes projection, grouping, and merge configuration in one large array. You can keep the exact aggregation semantics but break it up into helpers so each concern is localized and easier to test and review.
For example:
```js
function buildInitialStages() {
return [
{ $match: { scientificMetadata: { $exists: true, $type: "object" } } },
{
$project: {
datasetId: "$_id",
ownerGroup: 1,
accessGroups: 1,
isPublished: 1,
metaArr: { $objectToArray: "$scientificMetadata" },
},
},
{ $unwind: "$metaArr" },
{
$project: {
datasetId: 1,
key: "$metaArr.k",
isPublished: 1,
humanReadableName: { $ifNull: ["$metaArr.v.human_name", ""] },
userGroups: {
$setUnion: [["$ownerGroup"], { $ifNull: ["$accessGroups", []] }],
},
},
},
{
$unwind: {
path: "$userGroups",
preserveNullAndEmptyArrays: true,
},
},
{ $match: { userGroups: { $nin: [null, ""] } } },
];
}
function groupByKeyAndGroup(sourceType) {
return [
{
$group: {
_id: {
metaKeyId: {
$concat: [`${sourceType}_`, "$key", "_", "$humanReadableName"],
},
group: "$userGroups",
},
key: { $first: "$key" },
humanReadableName: { $first: "$humanReadableName" },
isPublished: { $max: "$isPublished" },
groupCount: { $sum: 1 },
datasetIds: { $addToSet: "$datasetId" },
},
},
{
$group: {
_id: "$_id.metaKeyId",
key: { $first: "$key" },
humanReadableName: { $first: "$humanReadableName" },
isPublished: { $max: "$isPublished" },
userGroups: { $push: "$_id.group" },
userGroupCountsArr: { $push: { k: "$_id.group", v: "$groupCount" } },
datasetIdSets: { $push: "$datasetIds" },
},
},
];
}
function projectFinalShape(sourceType) {
return [
{
$project: {
_id: 1,
key: 1,
sourceType: { $literal: sourceType },
humanReadableName: 1,
isPublished: 1,
userGroups: 1,
userGroupCounts: { $arrayToObject: "$userGroupCountsArr" },
usageCount: {
$size: {
$reduce: {
input: "$datasetIdSets",
initialValue: [],
in: { $setUnion: ["$$value", "$$this"] },
},
},
},
createdBy: { $literal: "migration" },
createdAt: { $toDate: "$$NOW" },
},
},
];
}
function buildPipeline(sourceType) {
return [
...buildInitialStages(),
...groupByKeyAndGroup(sourceType),
...projectFinalShape(sourceType),
buildMergeStage(),
];
}
```
This keeps the same pipeline stages but makes each phase independently understandable.
### 2. Extract the `$merge` `whenMatched` pipeline into a dedicated builder
The `whenMatched` expression is dense and currently buried at the bottom of `buildPipeline`. Extracting it into its own function makes it clearer that it’s “just” merge policy and decouples it from the shape-building stages.
```js
function buildWhenMatchedPipeline() {
return [
{
$replaceWith: {
$mergeObjects: [
"$$new",
{
createdAt: { $ifNull: ["$createdAt", "$$new.createdAt"] },
createdBy: { $ifNull: ["$createdBy", "$$new.createdBy"] },
updatedBy: { $literal: "migration" },
updatedAt: { $toDate: "$$NOW" },
userGroups: { $setUnion: ["$userGroups", "$$new.userGroups"] },
userGroupCounts: {
$arrayToObject: {
$map: {
input: {
$setUnion: [
{
$map: {
input: { $objectToArray: "$userGroupCounts" },
as: "e",
in: "$$e.k",
},
},
{
$map: {
input: { $objectToArray: "$$new.userGroupCounts" },
as: "e",
in: "$$e.k",
},
},
],
},
as: "group",
in: {
k: "$$group",
v: {
$add: [
{
$ifNull: [
{
$getField: {
field: "$$group",
input: "$userGroupCounts",
},
},
0,
],
},
{
$ifNull: [
{
$getField: {
field: "$$group",
input: "$$new.userGroupCounts",
},
},
0,
],
},
],
},
},
},
},
},
isPublished: { $or: ["$isPublished", "$$new.isPublished"] },
usageCount: { $add: ["$usageCount", "$$new.usageCount"] },
},
],
},
},
];
}
function buildMergeStage() {
return {
$merge: {
into: "MetadataKeys",
on: "_id",
whenMatched: buildWhenMatchedPipeline(),
whenNotMatched: "insert",
},
};
}
```
This preserves all existing merge behavior (including future multi-source support), but makes the migration file far easier to scan: `buildPipeline` reads as “shape → aggregate → finalize → merge”, and the complex merge internals live in a clearly named helper that can be documented and unit-tested separately.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| afterNew.body.should.have.lengthOf(1); | ||
| afterNew.body[0].usageCount.should.equal(1); | ||
|
|
||
| await deleteDataset(pid); |
There was a problem hiding this comment.
suggestion (testing): Add a test that updates accessGroups (not just ownerGroup) and asserts that userGroups and access control on keys are updated accordingly.
Since MetadataKeys derives userGroups from both ownerGroup and accessGroups, and the service updates counts based on userGroups, this path should be tested as well. Please also add a test that:
- Starts with a dataset with
accessGroups: ["group1"]and a key - Patches the dataset to change
accessGroups(e.g. to["group2"]) - Asserts via
/metadatakeysthat:userGroupson the key reflects the updated access groups- A user who only belonged to the removed group can no longer see the key
This will cover the accessGroups mutation path and its impact on access control and group reference counting.
Suggested implementation:
await deleteDataset(pid);
});
it("0641: changing accessGroups updates userGroups and key visibility", async () => {
const pid = await createDataset({
...TestData.DatasetWithScientificMetadataV4,
ownerGroup: "group-owner",
accessGroups: ["group1"],
isPublished: false,
scientificMetadata: { access_group_change_key: { value: 42 } },
});
const filterKey = {
where: { sourceType: "Dataset", key: "access_group_change_key" },
};
// user belonging to group1 can see the key initially
const keysForGroup1Before = await getMetadataKeys(
filterKey,
accessTokenGroup1
);
keysForGroup1Before.body.should.have.lengthOf(1);
keysForGroup1Before.body[0].userGroups.should.contain("group1");
keysForGroup1Before.body[0].userGroups.should.not.contain("group2");
// patch dataset to switch accessGroups from group1 to group2
await patchDataset(
pid,
{ accessGroups: ["group2"] },
accessTokenAdmin
);
// after patch, key should reflect updated userGroups (derived from updated accessGroups)
const keysForGroup2After = await getMetadataKeys(
filterKey,
accessTokenGroup2
);
keysForGroup2After.body.should.have.lengthOf(1);
keysForGroup2After.body[0].userGroups.should.not.contain("group1");
keysForGroup2After.body[0].userGroups.should.contain("group2");
// user that only belonged to removed group1 can no longer see the key
const keysForGroup1After = await getMetadataKeys(
filterKey,
accessTokenGroup1
);
keysForGroup1After.body.should.have.lengthOf(0);
await deleteDataset(pid);
});
it("0640: changing ownerGroup updates userGroups — old group removed, new group added", async () => {To make this compile and align with your existing tests, you may need to:
- Ensure
accessTokenGroup1andaccessTokenGroup2exist and represent users who:- Both can see datasets/keys owned by
group-owner. accessTokenGroup1has group membershipgroup1but notgroup2.accessTokenGroup2has group membershipgroup2(and ideally notgroup1).
If your test suite uses a different naming convention (e.g.accessTokenUserGroup1), adjust the token names in the new test accordingly.
- Both can see datasets/keys owned by
- Confirm that
patchDatasetis available in this file and patchesaccessGroupswithout resetting other fields. If the helper has a different signature, update the call in the new test to match. - If your
getMetadataKeyshelper or the/metadatakeysresponse shape differs (e.g.userGroupslives under a nested property), adapt the assertions to reference the correct path.
… into a single metadata key with combined userGroups
…k group references
Co-authored-by: Copilot <copilot@github.com>
bd604d9 to
8c915ea
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
MetadataKeys are now deduplicated, instead of one entry per key per dataset, a single entry is shared across all datasets containing the same key, with
userGroups,usageCount, anduserGroupCountsmerged.userGroupCounts
New internal field that tracks how many datasets per group reference this key:
When a group's count hits 0 it is removed from userGroups. When usageCount hits 0 the entry is deleted.
Changes
dataset,proposal) permission changes (ownerGroup,accessGroups) update userGroups and userGroupCounts accordinglyisPublished: true, the metadata key will haveisPublished:trueas well and is visible to all usersMotivation
Fixes
Tests included
Documentation
official documentation info
Summary by Sourcery
Track metadata keys at a global level across datasets and other entities, aggregating usage and access control by key rather than per-source document.
New Features:
Enhancements:
Documentation:
Tests: