Conversation
|
NOTE: There is an example_data dir in the tests dir, as well as the one in examples. I really like the idea that the tests be self contained to the degree possible. So ideally:
We have a system like this set up for the PYGNOME project: But:
NOTE: another category of tests (and examples) would point directly to online source -- that's a whole other thing ... Is something like this where you are headed? |
Not really, but we can adapt. At the moment everything is in a binary blob that is in the support data release [1]. The data release is from an empty branch [2], where we keep only a README file with the files list + checksum. The actuall data is added as a zipfile via GitHub release assets. That is a trick to avoid having them in the commit history, but keeping them in the same repo at the same time. The goal is to have a single zip file for easy use. However, if this file passes 2 G, then not only GitHub will block us from adding it as a download asset, but we start to lose efficiency. We can add more 2G files though. GitHub doesn't limit that (yet). With that said, the aim is to keep the workflow seamless. When running the test as usual, the data download would happen only once and automatically, thanks to pixi. It is cached and the zip file will be updated only when changed. Meaning the files won't be in the commit history, but they will be in local computers and CIs. We can add more granularity and work with individual files. it is a trade off between easier data maintenance at the cost of a bigger download, vs easier downloads but complex organization. IMO we should do separate files only if they are bigger than 50 MB. Why 50 MB? Completely arbitrary, kind of based on my bad network connection limit to run this "fast enough." What do you think? Is this OK for now or to you want to start a more granular control now? PS: You can test this branch now with [1] https://github.com/ioos/xarray-subset-grid/releases/tag/2026.02.06 |
We can try pytest-vcr for those. It is nice to avoid hitting the online service too much. However, sometimes checking if the data is there is part of the test, so maybe those would fall into a category "pytest-tests-that-can-fail-but-it-is-ok." (Pytest plugin pending ;-p). |
|
OK , I think I kinda got it -- using releases, it's a whole lot easier to have one (or a few) files, rather, than, say 100s. That makes sense. So we can think of it as collections of files (probably only one for now, but maybe in the future)
So here's a thought -- could we have a separate repo just for data files? In that case, you might have sets of files that are used by multiple projects each one could point to the same "data" repo's assets. Now that I think about it, if you had a repo that was only data files, and nothing else, maybe they could simple be served up from there -- they'd be big, but they wouldn't change (much) so the history wouldn't be an isse -- and if they did change, you could edit the history (or start fresh).
I think "bundling" them is a good idea anyway -- an maybe keep the bundles smaller than 2G -- 550MB or so? I"m making that up ... Then any test, examples, etc would need to know which bundle of files it needs. But we can start with one :-) I'm still confused about how to add a new file to the bundle though -- but I can wait until you have it set up, and then you can tell me :-) NOTE: I'm about to merge a branch that moved some of the smaller test files to inside the test dir. I think I still want to keep those local ... I suppose I could trim down the 6.4MB one, though that's not that big these days ... |
|
hmm -- this triggered: conda-forge/xarray-subset-grid-feedstock#1 which is no biggie, unless someone gets confused. Maybe a reason to have the data in a separate repo? |
|
Or, now that I think about it, change the feedstock to point to PyPI instead ... |
That has an easy fix, like switching to PyPI or telling the bot to pick up only SemVer releases. |
|
pixi conflicts are due to my adding the full docs requirements .... Doing a rebase or merge form main should fix it. |
f4ff4dc to
b3faf74
Compare
b3faf74 to
369c849
Compare
No problem.
Done! The steps update the data files are:
It is a bit involved, but we do not expect to to that very often. Suggestions on how to automate some of those steps are welcome! I would use this until we have the need for common data files in another repo. It will be relatively easy to move things. The main advantages here are that: (a) with pixi the data download step is seamless, and (b) no more binary files polluting in the history. We can even prune the one ones from history, but it will require everyone to re-fork to avoid re-introducing them. |
eac7157 to
04b449f
Compare
3c19c59 to
878f2ac
Compare
This is not ready yet. It is pending:
I'll try to finish those step later today. The first 3 should be a bit cumbersome the first time, but after that, updating the data should be relatively easy.