Skip to content

feat(schema): metadata field for dataFiles#2683

Open
abdellah257 wants to merge 3 commits intoSciCatProject:masterfrom
abdellah257:datafile-metadata
Open

feat(schema): metadata field for dataFiles#2683
abdellah257 wants to merge 3 commits intoSciCatProject:masterfrom
abdellah257:datafile-metadata

Conversation

@abdellah257
Copy link
Copy Markdown
Contributor

Description

a metadata field for dataFile, to include file related variables

[
  {
    "dataFileList": [
      {
        "path": "string",
        "size": 0,
        "time": "2026-04-14T12:05:20.052Z",
        "chk": "string",
        "uid": "string",
        "gid": "string",
        "perm": "string",
        "type": "string",
        "metadata": {}
      }
    ]
  }
]

Motivation

In case of large number of files in a dataset it's very useful to collect file specific metadata to help seperate between them.
The following is a use case example:
image

Changes:

additional field metadata in the dataFile DTO, schema and interface

Tests included

  • Included for each change/fix?
  • Passing?

Documentation

  • swagger documentation updated (required for API changes)
  • official documentation updated

@abdellah257 abdellah257 requested a review from a team as a code owner April 14, 2026 12:07
@abdellah257 abdellah257 changed the title feat: metadata field for dataFiles feat(schema): metadata field for dataFiles Apr 14, 2026
@HayenNico
Copy link
Copy Markdown
Member

This feature was already proposed last year in this PR: #1967. For the same reason as that time, I think we should not allow arbitrary metadata in multiple places to avoid ambiguity between the responsibilities of Dataset and DataFile.

@abdellah257
Copy link
Copy Markdown
Contributor Author

This feature was already proposed last year in this PR: #1967. For the same reason as that time, I think we should not allow arbitrary metadata in multiple places to avoid ambiguity between the responsibilities of Dataset and DataFile.

I see, thank you for referencing that issue, it helps bringing back that discussion, as I can get to the same point my predecessor got to at the end. I' am very open to other suggestion, but as of now it's the only option I can see.
I can also argue that the point of conflicting a Dataset and a DataFile is not clear especially at ILL, due to how general a Dataset is, it can have up to 64 000 files, with a set of these having a very specific function for later processing ....
Including this information in a Dataset as a map<file, metadata> at scientific_metadata is also not a viable option due the sizes of our datasets, and also conflicts with the purpose of scientific_metadata in the first place.

@Junjiequan
Copy link
Copy Markdown
Member

The changes look fine to me. The concern around unclear responsibilities between dataset metadata and datafile metadata makes sense, but I think it's mainly an issue if we assume a 1:1 relationship between datasets and datafiles, which I dont think is the case for everyone. To me, datafile metadata feels like a good fit for file specific details.

That being said, I'm not ignoring risks of users mixing the responsiblities between dataset and dataFile. I just think there's more practical way to address it without restrcting the change, we could:

  • Add a clear field description that indicates datafile metadata is not a substitute for dataset metadata
  • Document use case boundries for each

I'd love to discuss more about the potential risks, but my main point is that optional dataFile metadata is a good addition

@nitrosx
Copy link
Copy Markdown
Member

nitrosx commented Apr 16, 2026

I understand the worries to have metadata both in datasets and files, but I think that there is value to have metadata associated with a specific file, specifically when your datasets contains multiples files.
A clear documentation with examples should be able to address all the questions from how to use it, to data curation.
Also, we clearly have a use case from a supporting facility and it will not impact existing deployments if the feature is not used or, even better, disabled.

I see ESS using this feature in the near future, specifically for derived dataset.
I support the feature. Of course, I will be more than happy to discuss further the topic, if that help clarifying and dissipate the worries expressed above.

@Junjiequan Junjiequan changed the title feat(schema): metadata field for dataFiles feat(schema): metadata field for dataFiles Apr 17, 2026
@Junjiequan
Copy link
Copy Markdown
Member

Junjiequan commented Apr 17, 2026

If possible, please wait for the merge until the next scicat meeting. There might be different opinions and suggestions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants