Skip to content

refactor(metadatakeys): merge duplicate keys across datasets and track group references#2697

Open
Junjiequan wants to merge 11 commits intomasterfrom
refactor-metadatakeys-service
Open

refactor(metadatakeys): merge duplicate keys across datasets and track group references#2697
Junjiequan wants to merge 11 commits intomasterfrom
refactor-metadatakeys-service

Conversation

@Junjiequan
Copy link
Copy Markdown
Member

@Junjiequan Junjiequan commented Apr 20, 2026

Description

MetadataKeys are now deduplicated, instead of one entry per key per dataset, a single entry is shared across all datasets containing the same key, with userGroups, usageCount, and userGroupCounts merged.

userGroupCounts

New internal field that tracks how many datasets per group reference this key:

dataset1: {ownerGroup:"group1", userGroups:["group2"]},
dataset2: {ownerGroup:"group1"}
dataset3: {ownerGroup:"group1"}

"userGroupCounts": { "group-1": 3, "group-2": 1 },
"userGroups": ["group-1", "group-2"],
"usageCount": 3

When a group's count hits 0 it is removed from userGroups. When usageCount hits 0 the entry is deleted.

Changes

  • Parent document (dataset, proposal) permission changes (ownerGroup, accessGroups) update userGroups and userGroupCounts accordingly
  • If any parent document has isPublished: true, the metadata key will have isPublished:true as well and is visible to all users
  • Migration script included with documentation explaining the aggregation pipeline to populate MetadataKeys from existing Dataset documents — must be run manually before deploying (duration depends on dataset count and scientificMetadata count per dataset)

Motivation

Fixes

  • Bug fixed (#X)

Tests included

  • Included for each change/fix?
  • Passing?

Documentation

  • swagger documentation updated (required for API changes)
  • official documentation updated

official documentation info

Summary by Sourcery

Track metadata keys at a global level across datasets and other entities, aggregating usage and access control by key rather than per-source document.

New Features:

  • Expose metadata key usage counts and per-group reference counts, enabling queries over how often and by whom metadata keys are used.
  • Add a migration to backfill the MetadataKeys collection from existing dataset scientificMetadata, merging keys across datasets and computing group usage.
  • Extend metadata key search capabilities with filtering on humanReadableName, regex matching, field projection, and pagination semantics.

Enhancements:

  • Refine metadata key aggregation to use sourceType plus key and humanReadableName as the identifier, allowing the same key to be merged across multiple datasets and groups.
  • Update datasets, samples, proposals, and instruments services to emit unified MetadataSourceDoc structures that carry resolved userGroups and drive consistent metadata key bookkeeping.
  • Strengthen access-control semantics so metadata keys respect publication status and user group visibility while avoiding exposure of group-private keys to unauthorized users.
  • Improve documentation for the MetadataKeys module and its migration, including examples of filters and an architectural overview of synchronization behavior.

Documentation:

  • Expand developer documentation for the MetadataKeys module to describe the new schema, synchronization flow, filter options, and initial migration requirements.
  • Document the new migration that rebuilds the MetadataKeys collection from dataset scientificMetadata, including pipeline details and operational guidance.

Tests:

  • Add extensive unit tests for MetadataKeysService covering aggregation pipelines, insert/delete/replace behaviors, and group/publish state transitions.
  • Add end-to-end tests around the /metadatakeys API to validate access control, search behavior, pagination, projection, and usageCount semantics across different user roles.

@Junjiequan Junjiequan requested a review from a team as a code owner April 20, 2026 18:41
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 6 issues, and left some high level feedback:

  • The migration uses aggregate(...).toArray() with $merge, which will materialize all results in memory; consider iterating the cursor (or omitting toArray() entirely when $merge is the terminal stage) to avoid potential OOM issues on large datasets.
  • Several unit tests for MetadataKeysService assert on specific updateMany call indices (e.g. mock.calls[1]); this makes them brittle as the implementation evolves—matching calls by argument shape instead of positional index would make the tests more robust.
  • MetadataSourceDoc.sourceType is now derived from model.collection.name (e.g. for datasets), while docs/tests use capitalized literals like "Dataset"; it would be good to double-check and standardize the sourceType convention across services, migrations, and tests to avoid subtle mismatches in queries and IDs.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The migration uses `aggregate(...).toArray()` with `$merge`, which will materialize all results in memory; consider iterating the cursor (or omitting `toArray()` entirely when `$merge` is the terminal stage) to avoid potential OOM issues on large datasets.
- Several unit tests for `MetadataKeysService` assert on specific `updateMany` call indices (e.g. `mock.calls[1]`); this makes them brittle as the implementation evolves—matching calls by argument shape instead of positional index would make the tests more robust.
- `MetadataSourceDoc.sourceType` is now derived from `model.collection.name` (e.g. for datasets), while docs/tests use capitalized literals like "Dataset"; it would be good to double-check and standardize the `sourceType` convention across services, migrations, and tests to avoid subtle mismatches in queries and IDs.

## Individual Comments

### Comment 1
<location path="src/metadata-keys/metadatakeys.service.ts" line_range="274-283" />
<code_context>
   private async updateSharedKeys(
</code_context>
<issue_to_address>
**issue (bug_risk):** Keys whose humanReadableName changed are updated under the new _id before being treated as delete+insert, which can cause inconsistent counters and partial documents.

In `updateSharedKeys`, the group/publish update block runs for all `sharedKeys`, including those with changed `human_name`. For these, `ids` are built from the new `humanReadableName`, so `updateMany` may target documents that don’t yet exist (and should instead be created via `insertManyFromSource` with full `$setOnInsert`). Later, `humanNameChangedKeys` are processed again via `deleteMany(oldDoc)` + `insertManyFromSource(newDoc)`, so the same logical key is updated twice under different assumptions. To prevent double-counting and partial documents, exclude `humanNameChangedKeys` from the initial group/publish updates and rely only on the delete+insert path for name changes.
</issue_to_address>

### Comment 2
<location path="src/metadata-keys/metadatakeys.service.ts" line_range="297-307" />
<code_context>
+    const publishedFlippedOn = !oldDoc.isPublished && newDoc.isPublished;
+    const hasGroupChanges = addedGroups.length > 0 || removedGroups.length > 0;
+
+    if (hasGroupChanges || publishedFlippedOn) {
+      await this.metadataKeyModel.updateMany(filter, {
+        ...(addedGroups.length > 0 && {
+          $addToSet: { userGroups: { $each: addedGroups } },
+        }),
+        $inc: {
+          ...(addedGroups.length > 0 && this.groupCountDeltas(addedGroups, 1)),
+          ...(removedGroups.length > 0 &&
+            this.groupCountDeltas(removedGroups, -1)),
+        },
+        ...(publishedFlippedOn && { $set: { isPublished: true } }),
+      });
+
</code_context>
<issue_to_address>
**suggestion (bug_risk):** The `$inc` operator is always included, even when no groups changed, which risks sending an empty `$inc` object.

In this branch, when `publishedFlippedOn` is true but there are no added/removed groups, the update still includes `$inc` with no fields. MongoDB may reject or mis-handle an empty `$inc`. Consider constructing the `$inc` object separately and only spreading it into the update when it has at least one key (i.e., when there is a non-empty group delta).

```suggestion
      const incUpdate = {
        ...(addedGroups.length > 0 && this.groupCountDeltas(addedGroups, 1)),
        ...(removedGroups.length > 0 &&
          this.groupCountDeltas(removedGroups, -1)),
      };

      await this.metadataKeyModel.updateMany(filter, {
        ...(addedGroups.length > 0 && {
          $addToSet: { userGroups: { $each: addedGroups } },
        }),
        ...(Object.keys(incUpdate).length > 0 && { $inc: incUpdate }),
        ...(publishedFlippedOn && { $set: { isPublished: true } }),
      });
```
</issue_to_address>

### Comment 3
<location path="migrations/20260420145401-sync-dataset-scientificMetadata-to-metadatakeys.js" line_range="8-9" />
<code_context>
+List metadata keys visible to the current user for a given source type:

 ```json
 {
   "where": {
-    "sourceType": "dataset",
</code_context>
<issue_to_address>
**issue (bug_risk):** The combination of `preserveNullAndEmptyArrays` and the subsequent `$match` drops group-less datasets from `usageCount`, contrary to the comment.

Using `$unwind` with `preserveNullAndEmptyArrays: true` creates records with `userGroups: null` for group-less datasets, but the subsequent `$match: { userGroups: { $nin: [null, ""] } }` removes them. As a result, datasets without groups never reach the later grouping and are missing from `usageCount`, which contradicts the intent described in the comment. If group-less datasets should still contribute to `usageCount`, the pipeline needs to retain them (e.g., by handling `userGroups: null` explicitly or isolating the group-based logic so it doesn’t drop these records).
</issue_to_address>

### Comment 4
<location path="test/MetadataKeys.js" line_range="514-419" />
<code_context>
+  it("0640: changing ownerGroup updates userGroups — old group removed, new group added", async () => {
</code_context>
<issue_to_address>
**suggestion (testing):** Add a test that updates `accessGroups` (not just `ownerGroup`) and asserts that `userGroups` and access control on keys are updated accordingly.

Since `MetadataKeys` derives `userGroups` from both `ownerGroup` and `accessGroups`, and the service updates counts based on `userGroups`, this path should be tested as well. Please also add a test that:

- Starts with a dataset with `accessGroups: ["group1"]` and a key
- Patches the dataset to change `accessGroups` (e.g. to `["group2"]`)
- Asserts via `/metadatakeys` that:
  - `userGroups` on the key reflects the updated access groups
  - A user who only belonged to the removed group can no longer see the key

This will cover the `accessGroups` mutation path and its impact on access control and group reference counting.

Suggested implementation:

```javascript
    await deleteDataset(pid);
  });

  it("0641: changing accessGroups updates userGroups and key visibility", async () => {
    const pid = await createDataset({
      ...TestData.DatasetWithScientificMetadataV4,
      ownerGroup: "group-owner",
      accessGroups: ["group1"],
      isPublished: false,
      scientificMetadata: { access_group_change_key: { value: 42 } },
    });

    const filterKey = {
      where: { sourceType: "Dataset", key: "access_group_change_key" },
    };

    // user belonging to group1 can see the key initially
    const keysForGroup1Before = await getMetadataKeys(
      filterKey,
      accessTokenGroup1
    );
    keysForGroup1Before.body.should.have.lengthOf(1);
    keysForGroup1Before.body[0].userGroups.should.contain("group1");
    keysForGroup1Before.body[0].userGroups.should.not.contain("group2");

    // patch dataset to switch accessGroups from group1 to group2
    await patchDataset(
      pid,
      { accessGroups: ["group2"] },
      accessTokenAdmin
    );

    // after patch, key should reflect updated userGroups (derived from updated accessGroups)
    const keysForGroup2After = await getMetadataKeys(
      filterKey,
      accessTokenGroup2
    );
    keysForGroup2After.body.should.have.lengthOf(1);
    keysForGroup2After.body[0].userGroups.should.not.contain("group1");
    keysForGroup2After.body[0].userGroups.should.contain("group2");

    // user that only belonged to removed group1 can no longer see the key
    const keysForGroup1After = await getMetadataKeys(
      filterKey,
      accessTokenGroup1
    );
    keysForGroup1After.body.should.have.lengthOf(0);

    await deleteDataset(pid);
  });

  it("0640: changing ownerGroup updates userGroups — old group removed, new group added", async () => {

```

To make this compile and align with your existing tests, you may need to:
1. Ensure `accessTokenGroup1` and `accessTokenGroup2` exist and represent users who:
   - Both can see datasets/keys owned by `group-owner`.
   - `accessTokenGroup1` has group membership `group1` but not `group2`.
   - `accessTokenGroup2` has group membership `group2` (and ideally not `group1`).
   If your test suite uses a different naming convention (e.g. `accessTokenUserGroup1`), adjust the token names in the new test accordingly.
2. Confirm that `patchDataset` is available in this file and patches `accessGroups` without resetting other fields. If the helper has a different signature, update the call in the new test to match.
3. If your `getMetadataKeys` helper or the `/metadatakeys` response shape differs (e.g. `userGroups` lives under a nested property), adapt the assertions to reference the correct path.
</issue_to_address>

### Comment 5
<location path="docs/developer-guide/migrations/20260420145401-sync-dataset-scientificMetadata-to-metadatakeys.md" line_range="143" />
<code_context>
+{ datasetId: "uuid-B", key: "temperature", isPublished: false, humanReadableName: "Temperature", userGroups: ["group-1"] }
+```
+
+> `ownerGroup` is mandatory field so no null fallback is needed. `accessGroups` is optional so it falls back to `[]`.
+
+---
</code_context>
<issue_to_address>
**nitpick (typo):** Add an article in “ownerGroup is mandatory field” for correct grammar.

You could rephrase this as: “`ownerGroup` is a mandatory field, so no null fallback is needed. `accessGroups` is optional, so it falls back to `[]`.”

```suggestion
> `ownerGroup` is a mandatory field, so no null fallback is needed. `accessGroups` is optional, so it falls back to `[]`.
```
</issue_to_address>

### Comment 6
<location path="migrations/20260420145401-sync-dataset-scientificMetadata-to-metadatakeys.js" line_range="3" />
<code_context>
+const SOURCE_COLLECTIONS = ["Dataset"];
+
+function buildPipeline(sourceType) {
+  return [
+    // -------------------------------------------------------------------------
</code_context>
<issue_to_address>
**issue (complexity):** Consider decomposing the monolithic aggregation pipeline into smaller helper functions and a dedicated merge-stage builder so each phase of the migration is easier to understand and maintain.

You can keep all current behavior and significantly reduce cognitive load by decomposing the pipeline and isolating the merge logic.

### 1. Split the monolithic `buildPipeline` into small, named helpers

Right now `buildPipeline` mixes projection, grouping, and merge configuration in one large array. You can keep the exact aggregation semantics but break it up into helpers so each concern is localized and easier to test and review.

For example:

```js
function buildInitialStages() {
  return [
    { $match: { scientificMetadata: { $exists: true, $type: "object" } } },
    {
      $project: {
        datasetId: "$_id",
        ownerGroup: 1,
        accessGroups: 1,
        isPublished: 1,
        metaArr: { $objectToArray: "$scientificMetadata" },
      },
    },
    { $unwind: "$metaArr" },
    {
      $project: {
        datasetId: 1,
        key: "$metaArr.k",
        isPublished: 1,
        humanReadableName: { $ifNull: ["$metaArr.v.human_name", ""] },
        userGroups: {
          $setUnion: [["$ownerGroup"], { $ifNull: ["$accessGroups", []] }],
        },
      },
    },
    {
      $unwind: {
        path: "$userGroups",
        preserveNullAndEmptyArrays: true,
      },
    },
    { $match: { userGroups: { $nin: [null, ""] } } },
  ];
}

function groupByKeyAndGroup(sourceType) {
  return [
    {
      $group: {
        _id: {
          metaKeyId: {
            $concat: [`${sourceType}_`, "$key", "_", "$humanReadableName"],
          },
          group: "$userGroups",
        },
        key: { $first: "$key" },
        humanReadableName: { $first: "$humanReadableName" },
        isPublished: { $max: "$isPublished" },
        groupCount: { $sum: 1 },
        datasetIds: { $addToSet: "$datasetId" },
      },
    },
    {
      $group: {
        _id: "$_id.metaKeyId",
        key: { $first: "$key" },
        humanReadableName: { $first: "$humanReadableName" },
        isPublished: { $max: "$isPublished" },
        userGroups: { $push: "$_id.group" },
        userGroupCountsArr: { $push: { k: "$_id.group", v: "$groupCount" } },
        datasetIdSets: { $push: "$datasetIds" },
      },
    },
  ];
}

function projectFinalShape(sourceType) {
  return [
    {
      $project: {
        _id: 1,
        key: 1,
        sourceType: { $literal: sourceType },
        humanReadableName: 1,
        isPublished: 1,
        userGroups: 1,
        userGroupCounts: { $arrayToObject: "$userGroupCountsArr" },
        usageCount: {
          $size: {
            $reduce: {
              input: "$datasetIdSets",
              initialValue: [],
              in: { $setUnion: ["$$value", "$$this"] },
            },
          },
        },
        createdBy: { $literal: "migration" },
        createdAt: { $toDate: "$$NOW" },
      },
    },
  ];
}

function buildPipeline(sourceType) {
  return [
    ...buildInitialStages(),
    ...groupByKeyAndGroup(sourceType),
    ...projectFinalShape(sourceType),
    buildMergeStage(),
  ];
}
```

This keeps the same pipeline stages but makes each phase independently understandable.

### 2. Extract the `$merge` `whenMatched` pipeline into a dedicated builder

The `whenMatched` expression is dense and currently buried at the bottom of `buildPipeline`. Extracting it into its own function makes it clearer that it’s “just” merge policy and decouples it from the shape-building stages.

```js
function buildWhenMatchedPipeline() {
  return [
    {
      $replaceWith: {
        $mergeObjects: [
          "$$new",
          {
            createdAt: { $ifNull: ["$createdAt", "$$new.createdAt"] },
            createdBy: { $ifNull: ["$createdBy", "$$new.createdBy"] },
            updatedBy: { $literal: "migration" },
            updatedAt: { $toDate: "$$NOW" },

            userGroups: { $setUnion: ["$userGroups", "$$new.userGroups"] },

            userGroupCounts: {
              $arrayToObject: {
                $map: {
                  input: {
                    $setUnion: [
                      {
                        $map: {
                          input: { $objectToArray: "$userGroupCounts" },
                          as: "e",
                          in: "$$e.k",
                        },
                      },
                      {
                        $map: {
                          input: { $objectToArray: "$$new.userGroupCounts" },
                          as: "e",
                          in: "$$e.k",
                        },
                      },
                    ],
                  },
                  as: "group",
                  in: {
                    k: "$$group",
                    v: {
                      $add: [
                        {
                          $ifNull: [
                            {
                              $getField: {
                                field: "$$group",
                                input: "$userGroupCounts",
                              },
                            },
                            0,
                          ],
                        },
                        {
                          $ifNull: [
                            {
                              $getField: {
                                field: "$$group",
                                input: "$$new.userGroupCounts",
                              },
                            },
                            0,
                          ],
                        },
                      ],
                    },
                  },
                },
              },
            },

            isPublished: { $or: ["$isPublished", "$$new.isPublished"] },
            usageCount: { $add: ["$usageCount", "$$new.usageCount"] },
          },
        ],
      },
    },
  ];
}

function buildMergeStage() {
  return {
    $merge: {
      into: "MetadataKeys",
      on: "_id",
      whenMatched: buildWhenMatchedPipeline(),
      whenNotMatched: "insert",
    },
  };
}
```

This preserves all existing merge behavior (including future multi-source support), but makes the migration file far easier to scan: `buildPipeline` reads as “shape → aggregate → finalize → merge”, and the complex merge internals live in a clearly named helper that can be documented and unit-tested separately.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread src/metadata-keys/metadatakeys.service.ts Outdated
Comment thread src/metadata-keys/metadatakeys.service.ts
Comment thread test/MetadataKeys.js
afterNew.body.should.have.lengthOf(1);
afterNew.body[0].usageCount.should.equal(1);

await deleteDataset(pid);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Add a test that updates accessGroups (not just ownerGroup) and asserts that userGroups and access control on keys are updated accordingly.

Since MetadataKeys derives userGroups from both ownerGroup and accessGroups, and the service updates counts based on userGroups, this path should be tested as well. Please also add a test that:

  • Starts with a dataset with accessGroups: ["group1"] and a key
  • Patches the dataset to change accessGroups (e.g. to ["group2"])
  • Asserts via /metadatakeys that:
    • userGroups on the key reflects the updated access groups
    • A user who only belonged to the removed group can no longer see the key

This will cover the accessGroups mutation path and its impact on access control and group reference counting.

Suggested implementation:

    await deleteDataset(pid);
  });

  it("0641: changing accessGroups updates userGroups and key visibility", async () => {
    const pid = await createDataset({
      ...TestData.DatasetWithScientificMetadataV4,
      ownerGroup: "group-owner",
      accessGroups: ["group1"],
      isPublished: false,
      scientificMetadata: { access_group_change_key: { value: 42 } },
    });

    const filterKey = {
      where: { sourceType: "Dataset", key: "access_group_change_key" },
    };

    // user belonging to group1 can see the key initially
    const keysForGroup1Before = await getMetadataKeys(
      filterKey,
      accessTokenGroup1
    );
    keysForGroup1Before.body.should.have.lengthOf(1);
    keysForGroup1Before.body[0].userGroups.should.contain("group1");
    keysForGroup1Before.body[0].userGroups.should.not.contain("group2");

    // patch dataset to switch accessGroups from group1 to group2
    await patchDataset(
      pid,
      { accessGroups: ["group2"] },
      accessTokenAdmin
    );

    // after patch, key should reflect updated userGroups (derived from updated accessGroups)
    const keysForGroup2After = await getMetadataKeys(
      filterKey,
      accessTokenGroup2
    );
    keysForGroup2After.body.should.have.lengthOf(1);
    keysForGroup2After.body[0].userGroups.should.not.contain("group1");
    keysForGroup2After.body[0].userGroups.should.contain("group2");

    // user that only belonged to removed group1 can no longer see the key
    const keysForGroup1After = await getMetadataKeys(
      filterKey,
      accessTokenGroup1
    );
    keysForGroup1After.body.should.have.lengthOf(0);

    await deleteDataset(pid);
  });

  it("0640: changing ownerGroup updates userGroups — old group removed, new group added", async () => {

To make this compile and align with your existing tests, you may need to:

  1. Ensure accessTokenGroup1 and accessTokenGroup2 exist and represent users who:
    • Both can see datasets/keys owned by group-owner.
    • accessTokenGroup1 has group membership group1 but not group2.
    • accessTokenGroup2 has group membership group2 (and ideally not group1).
      If your test suite uses a different naming convention (e.g. accessTokenUserGroup1), adjust the token names in the new test accordingly.
  2. Confirm that patchDataset is available in this file and patches accessGroups without resetting other fields. If the helper has a different signature, update the call in the new test to match.
  3. If your getMetadataKeys helper or the /metadatakeys response shape differs (e.g. userGroups lives under a nested property), adapt the assertions to reference the correct path.

@Junjiequan Junjiequan force-pushed the refactor-metadatakeys-service branch from bd604d9 to 8c915ea Compare April 23, 2026 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant