Split tests and data by ocefpaf · Pull Request #107 · ioos/xarray-subset-grid

ocefpaf · 2026-02-06T13:14:07Z

This is not ready yet. It is pending:

create the empty data branch for data releases
add the a README with data checksum
make a first release with the zipped data
run download scripts in both docs and tests GHA

I'll try to finish those step later today. The first 3 should be a bit cumbersome the first time, but after that, updating the data should be relatively easy.

ChrisBarker-NOAA · 2026-02-06T16:33:48Z

NOTE:

There is an example_data dir in the tests dir, as well as the one in examples.

I really like the idea that the tests be self contained to the degree possible.

So ideally:

we have a set of small test files that can be kept with the tests for quick and easy test running.
There may be a set of larger test files, required, in particular, for performance evaluation.
- these would be downloaded on demand one way or another.
- the tests that use these should be turned on or off with a pytest flag (there is an --online flag now that can be used)
- It probably makes sense to use (some of) the same these large files for the examples.
I think it's better to have the individual files downloaded on demand, rather than a builk download
- I think pooch makes that pretty easy.
- That would also make it possible to have the tests download what they need where they need it, and the examples the same, but not in the same place.

We have a system like this set up for the PYGNOME project:

But:

We have our own pooch-lite -- not as robust, but pooch wasn't around back then...
We are hosting a data server on our systems -- not well suited to a community / open-source project.

NOTE: another category of tests (and examples) would point directly to online source -- that's a whole other thing ...
I suspect many of the examples might use these, but hope few, if any, of the tests do (they do now, and it's fragile)

Is something like this where you are headed?

ocefpaf · 2026-02-06T17:33:28Z

Is something like this where you are headed?

Not really, but we can adapt. At the moment everything is in a binary blob that is in the support data release [1]. The data release is from an empty branch [2], where we keep only a README file with the files list + checksum. The actuall data is added as a zipfile via GitHub release assets. That is a trick to avoid having them in the commit history, but keeping them in the same repo at the same time.

The goal is to have a single zip file for easy use. However, if this file passes 2 G, then not only GitHub will block us from adding it as a download asset, but we start to lose efficiency. We can add more 2G files though. GitHub doesn't limit that (yet).

With that said, the aim is to keep the workflow seamless. When running the test as usual, the data download would happen only once and automatically, thanks to pixi. It is cached and the zip file will be updated only when changed. Meaning the files won't be in the commit history, but they will be in local computers and CIs.

We can add more granularity and work with individual files. it is a trade off between easier data maintenance at the cost of a bigger download, vs easier downloads but complex organization.

IMO we should do separate files only if they are bigger than 50 MB. Why 50 MB? Completely arbitrary, kind of based on my bad network connection limit to run this "fast enough."

What do you think? Is this OK for now or to you want to start a more granular control now?

PS: You can test this branch now with pixi run --environment test313 test_all and the data download will be taken care by pixi's task.

[1] https://github.com/ioos/xarray-subset-grid/releases/tag/2026.02.06
[2] https://github.com/ioos/xarray-subset-grid/tree/data_files

ocefpaf · 2026-02-06T17:41:52Z

NOTE: another category of tests (and examples) would point directly to online source -- that's a whole other thing ...
I suspect many of the examples might use these, but hope few, if any, of the tests do (they do now, and it's fragile)

We can try pytest-vcr for those. It is nice to avoid hitting the online service too much. However, sometimes checking if the data is there is part of the test, so maybe those would fall into a category "pytest-tests-that-can-fail-but-it-is-ok." (Pytest plugin pending ;-p).

ChrisBarker-NOAA · 2026-02-06T19:46:48Z

OK , I think I kinda got it -- using releases, it's a whole lot easier to have one (or a few) files, rather, than, say 100s. That makes sense.

So we can think of it as collections of files (probably only one for now, but maybe in the future)

Not really, but we can adapt. At the moment everything is in a binary blob that is in the support data release [1]. The data release is from an empty branch [2], where we keep only a README file with the files list + checksum. The actuall data is added as a zipfile via GitHub release assets. That is a trick to avoid having them in the commit history, but keeping them in the same repo at the same time.

So here's a thought -- could we have a separate repo just for data files? In that case, you might have sets of files that are used by multiple projects each one could point to the same "data" repo's assets.

Now that I think about it, if you had a repo that was only data files, and nothing else, maybe they could simple be served up from there -- they'd be big, but they wouldn't change (much) so the history wouldn't be an isse -- and if they did change, you could edit the history (or start fresh).

The goal is to have a single zip file for easy use. However, if this file passes 2 G, then not only GitHub will block us from adding it as a download asset, but we start to lose efficiency. We can add more 2G files though. GitHub doesn't limit that (yet).

I think "bundling" them is a good idea anyway -- an maybe keep the bundles smaller than 2G -- 550MB or so? I"m making that up ...

Then any test, examples, etc would need to know which bundle of files it needs.

But we can start with one :-)

I'm still confused about how to add a new file to the bundle though -- but I can wait until you have it set up, and then you can tell me :-)

NOTE: I'm about to merge a branch that moved some of the smaller test files to inside the test dir. I think I still want to keep those local ...

-rw-r--r--  1 chris.barker  1806393361   119K Feb  4 12:12 2D-rectangular_grid_wind.nc
-rw-r--r--  1 chris.barker  1806393361   6.4M Jun  5  2025 AMSEAS-subset.nc
-rw-r--r--  1 chris.barker  1806393361    67K Jun  5  2025 arakawa_c_test_grid.nc
-rw-r--r--@ 1 chris.barker  1806393361    63K Jun  5  2025 arakawa_c_test_grid.png
-rw-r--r--  1 chris.barker  1806393361   119K Feb  4 12:12 rectangular_grid_decreasing.nc
-rw-r--r--  1 chris.barker  1806393361   2.6M Jun  5  2025 SFBOFS_subset1.nc
-rw-r--r--  1 chris.barker  1806393361    45K Jun  5  2025 small_ugrid_zero_based.nc
-rw-r--r--  1 chris.barker  1806393361   216K Jun  5  2025 tris_and_bounds.nc

I suppose I could trim down the 6.4MB one, though that's not that big these days ...

ChrisBarker-NOAA · 2026-02-06T22:31:11Z

hmm -- this triggered:

conda-forge/xarray-subset-grid-feedstock#1

which is no biggie, unless someone gets confused.

Maybe a reason to have the data in a separate repo?

ChrisBarker-NOAA · 2026-02-06T22:32:48Z

Or, now that I think about it, change the feedstock to point to PyPI instead ...

ocefpaf · 2026-02-06T23:58:36Z

Maybe a reason to have the data in a separate repo?

That has an easy fix, like switching to PyPI or telling the bot to pick up only SemVer releases.
With that said, if there will be a shared dataset with other projects, a shared separate repo is not a bad idea.

ChrisBarker-NOAA · 2026-02-07T20:42:56Z

pixi conflicts are due to my adding the full docs requirements ....

Doing a rebase or merge form main should fix it.

ocefpaf · 2026-02-09T14:49:03Z

pixi conflicts are due to my adding the full docs requirements ....

No problem.

Doing a rebase or merge form main should fix it.

Done!

The steps update the data files are:

Download the zip and unzip it in a new folder;
Add the files there using the same directory structure from the repo, that way they'll be in the right place when unzipped
Update the data_files branch README.md file with the new checksum (find -type f \( -not -name "README.md" \) -exec md5sum '{}' \; > README.md)
Issue a new data release from that branch using CalVer
Update the download script with the data release and checksum

It is a bit involved, but we do not expect to to that very often. Suggestions on how to automate some of those steps are welcome!

I would use this until we have the need for common data files in another repo. It will be relatively easy to move things. The main advantages here are that: (a) with pixi the data download step is seamless, and (b) no more binary files polluting in the history. We can even prune the one ones from history, but it will require everyone to re-fork to avoid re-introducing them.

ocefpaf temporarily deployed to pypi February 6, 2026 13:14 — with GitHub Actions Inactive

ocefpaf temporarily deployed to pypi February 6, 2026 17:20 — with GitHub Actions Inactive

ocefpaf temporarily deployed to pypi February 6, 2026 17:22 — with GitHub Actions Inactive

ocefpaf temporarily deployed to pypi February 6, 2026 17:35 — with GitHub Actions Inactive

ocefpaf marked this pull request as ready for review February 6, 2026 17:39

ocefpaf mentioned this pull request Feb 7, 2026

Regular grid fixes #108

Merged

ocefpaf force-pushed the split_tests_and_data branch from f4ff4dc to b3faf74 Compare February 9, 2026 14:41

ocefpaf temporarily deployed to pypi February 9, 2026 14:41 — with GitHub Actions Inactive

ocefpaf force-pushed the split_tests_and_data branch from b3faf74 to 369c849 Compare February 9, 2026 14:42

ocefpaf temporarily deployed to pypi February 9, 2026 14:42 — with GitHub Actions Inactive

ocefpaf temporarily deployed to pypi February 9, 2026 14:50 — with GitHub Actions Inactive

ocefpaf force-pushed the split_tests_and_data branch from eac7157 to 04b449f Compare February 9, 2026 14:52

ocefpaf temporarily deployed to pypi February 9, 2026 14:52 — with GitHub Actions Inactive

ocefpaf requested a review from ChrisBarker-NOAA February 9, 2026 14:54

ocefpaf temporarily deployed to pypi February 9, 2026 14:59 — with GitHub Actions Inactive

add data download

878f2ac

ocefpaf force-pushed the split_tests_and_data branch from 3c19c59 to 878f2ac Compare February 9, 2026 14:59

ocefpaf temporarily deployed to pypi February 9, 2026 14:59 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split tests and data#107

Split tests and data#107
ocefpaf wants to merge 1 commit intoioos:mainfrom
ocefpaf:split_tests_and_data

ocefpaf commented Feb 6, 2026 •

edited

Loading

Uh oh!

ChrisBarker-NOAA commented Feb 6, 2026

Uh oh!

ocefpaf commented Feb 6, 2026

Uh oh!

ocefpaf commented Feb 6, 2026

Uh oh!

ChrisBarker-NOAA commented Feb 6, 2026 •

edited

Loading

Uh oh!

ChrisBarker-NOAA commented Feb 6, 2026

Uh oh!

ChrisBarker-NOAA commented Feb 6, 2026

Uh oh!

ocefpaf commented Feb 6, 2026

Uh oh!

ChrisBarker-NOAA commented Feb 7, 2026

Uh oh!

ocefpaf commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ocefpaf commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChrisBarker-NOAA commented Feb 6, 2026

Uh oh!

ocefpaf commented Feb 6, 2026

Uh oh!

ocefpaf commented Feb 6, 2026

Uh oh!

ChrisBarker-NOAA commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChrisBarker-NOAA commented Feb 6, 2026

Uh oh!

ChrisBarker-NOAA commented Feb 6, 2026

Uh oh!

ocefpaf commented Feb 6, 2026

Uh oh!

ChrisBarker-NOAA commented Feb 7, 2026

Uh oh!

ocefpaf commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ocefpaf commented Feb 6, 2026 •

edited

Loading

ChrisBarker-NOAA commented Feb 6, 2026 •

edited

Loading