Skip to content

Split tests and data#107

Open
ocefpaf wants to merge 1 commit intoioos:mainfrom
ocefpaf:split_tests_and_data
Open

Split tests and data#107
ocefpaf wants to merge 1 commit intoioos:mainfrom
ocefpaf:split_tests_and_data

Conversation

@ocefpaf
Copy link
Member

@ocefpaf ocefpaf commented Feb 6, 2026

This is not ready yet. It is pending:

  • create the empty data branch for data releases
  • add the a README with data checksum
  • make a first release with the zipped data
  • run download scripts in both docs and tests GHA

I'll try to finish those step later today. The first 3 should be a bit cumbersome the first time, but after that, updating the data should be relatively easy.

@ChrisBarker-NOAA
Copy link
Contributor

NOTE:

There is an example_data dir in the tests dir, as well as the one in examples.

I really like the idea that the tests be self contained to the degree possible.

So ideally:

  • we have a set of small test files that can be kept with the tests for quick and easy test running.
  • There may be a set of larger test files, required, in particular, for performance evaluation.
    • these would be downloaded on demand one way or another.
    • the tests that use these should be turned on or off with a pytest flag (there is an --online flag now that can be used)
    • It probably makes sense to use (some of) the same these large files for the examples.
  • I think it's better to have the individual files downloaded on demand, rather than a builk download
    • I think pooch makes that pretty easy.
    • That would also make it possible to have the tests download what they need where they need it, and the examples the same, but not in the same place.

We have a system like this set up for the PYGNOME project:

But:

  • We have our own pooch-lite -- not as robust, but pooch wasn't around back then...
  • We are hosting a data server on our systems -- not well suited to a community / open-source project.

NOTE: another category of tests (and examples) would point directly to online source -- that's a whole other thing ...
I suspect many of the examples might use these, but hope few, if any, of the tests do (they do now, and it's fragile)

Is something like this where you are headed?

@ocefpaf
Copy link
Member Author

ocefpaf commented Feb 6, 2026

Is something like this where you are headed?

Not really, but we can adapt. At the moment everything is in a binary blob that is in the support data release [1]. The data release is from an empty branch [2], where we keep only a README file with the files list + checksum. The actuall data is added as a zipfile via GitHub release assets. That is a trick to avoid having them in the commit history, but keeping them in the same repo at the same time.

The goal is to have a single zip file for easy use. However, if this file passes 2 G, then not only GitHub will block us from adding it as a download asset, but we start to lose efficiency. We can add more 2G files though. GitHub doesn't limit that (yet).

With that said, the aim is to keep the workflow seamless. When running the test as usual, the data download would happen only once and automatically, thanks to pixi. It is cached and the zip file will be updated only when changed. Meaning the files won't be in the commit history, but they will be in local computers and CIs.

We can add more granularity and work with individual files. it is a trade off between easier data maintenance at the cost of a bigger download, vs easier downloads but complex organization.

IMO we should do separate files only if they are bigger than 50 MB. Why 50 MB? Completely arbitrary, kind of based on my bad network connection limit to run this "fast enough."

What do you think? Is this OK for now or to you want to start a more granular control now?

PS: You can test this branch now with pixi run --environment test313 test_all and the data download will be taken care by pixi's task.

[1] https://github.com/ioos/xarray-subset-grid/releases/tag/2026.02.06
[2] https://github.com/ioos/xarray-subset-grid/tree/data_files

@ocefpaf ocefpaf marked this pull request as ready for review February 6, 2026 17:39
@ocefpaf
Copy link
Member Author

ocefpaf commented Feb 6, 2026

NOTE: another category of tests (and examples) would point directly to online source -- that's a whole other thing ...
I suspect many of the examples might use these, but hope few, if any, of the tests do (they do now, and it's fragile)

We can try pytest-vcr for those. It is nice to avoid hitting the online service too much. However, sometimes checking if the data is there is part of the test, so maybe those would fall into a category "pytest-tests-that-can-fail-but-it-is-ok." (Pytest plugin pending ;-p).

@ChrisBarker-NOAA
Copy link
Contributor

ChrisBarker-NOAA commented Feb 6, 2026

OK , I think I kinda got it -- using releases, it's a whole lot easier to have one (or a few) files, rather, than, say 100s. That makes sense.

So we can think of it as collections of files (probably only one for now, but maybe in the future)

Not really, but we can adapt. At the moment everything is in a binary blob that is in the support data release [1]. The data release is from an empty branch [2], where we keep only a README file with the files list + checksum. The actuall data is added as a zipfile via GitHub release assets. That is a trick to avoid having them in the commit history, but keeping them in the same repo at the same time.

So here's a thought -- could we have a separate repo just for data files? In that case, you might have sets of files that are used by multiple projects each one could point to the same "data" repo's assets.

Now that I think about it, if you had a repo that was only data files, and nothing else, maybe they could simple be served up from there -- they'd be big, but they wouldn't change (much) so the history wouldn't be an isse -- and if they did change, you could edit the history (or start fresh).

The goal is to have a single zip file for easy use. However, if this file passes 2 G, then not only GitHub will block us from adding it as a download asset, but we start to lose efficiency. We can add more 2G files though. GitHub doesn't limit that (yet).

I think "bundling" them is a good idea anyway -- an maybe keep the bundles smaller than 2G -- 550MB or so? I"m making that up ...

Then any test, examples, etc would need to know which bundle of files it needs.

But we can start with one :-)

I'm still confused about how to add a new file to the bundle though -- but I can wait until you have it set up, and then you can tell me :-)

NOTE: I'm about to merge a branch that moved some of the smaller test files to inside the test dir. I think I still want to keep those local ...

-rw-r--r--  1 chris.barker  1806393361   119K Feb  4 12:12 2D-rectangular_grid_wind.nc
-rw-r--r--  1 chris.barker  1806393361   6.4M Jun  5  2025 AMSEAS-subset.nc
-rw-r--r--  1 chris.barker  1806393361    67K Jun  5  2025 arakawa_c_test_grid.nc
-rw-r--r--@ 1 chris.barker  1806393361    63K Jun  5  2025 arakawa_c_test_grid.png
-rw-r--r--  1 chris.barker  1806393361   119K Feb  4 12:12 rectangular_grid_decreasing.nc
-rw-r--r--  1 chris.barker  1806393361   2.6M Jun  5  2025 SFBOFS_subset1.nc
-rw-r--r--  1 chris.barker  1806393361    45K Jun  5  2025 small_ugrid_zero_based.nc
-rw-r--r--  1 chris.barker  1806393361   216K Jun  5  2025 tris_and_bounds.nc

I suppose I could trim down the 6.4MB one, though that's not that big these days ...

@ChrisBarker-NOAA
Copy link
Contributor

hmm -- this triggered:

conda-forge/xarray-subset-grid-feedstock#1

which is no biggie, unless someone gets confused.

Maybe a reason to have the data in a separate repo?

@ChrisBarker-NOAA
Copy link
Contributor

Or, now that I think about it, change the feedstock to point to PyPI instead ...

@ocefpaf
Copy link
Member Author

ocefpaf commented Feb 6, 2026

Maybe a reason to have the data in a separate repo?

That has an easy fix, like switching to PyPI or telling the bot to pick up only SemVer releases.
With that said, if there will be a shared dataset with other projects, a shared separate repo is not a bad idea.

@ocefpaf ocefpaf mentioned this pull request Feb 7, 2026
@ChrisBarker-NOAA
Copy link
Contributor

pixi conflicts are due to my adding the full docs requirements ....

Doing a rebase or merge form main should fix it.

@ocefpaf
Copy link
Member Author

ocefpaf commented Feb 9, 2026

pixi conflicts are due to my adding the full docs requirements ....

No problem.

Doing a rebase or merge form main should fix it.

Done!

The steps update the data files are:

  1. Download the zip and unzip it in a new folder;
  2. Add the files there using the same directory structure from the repo, that way they'll be in the right place when unzipped
  3. Update the data_files branch README.md file with the new checksum (find -type f \( -not -name "README.md" \) -exec md5sum '{}' \; > README.md)
  4. Issue a new data release from that branch using CalVer
  5. Update the download script with the data release and checksum

It is a bit involved, but we do not expect to to that very often. Suggestions on how to automate some of those steps are welcome!

I would use this until we have the need for common data files in another repo. It will be relatively easy to move things. The main advantages here are that: (a) with pixi the data download step is seamless, and (b) no more binary files polluting in the history. We can even prune the one ones from history, but it will require everyone to re-fork to avoid re-introducing them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants