Skip to content

Conversation

@wrongu
Copy link
Contributor

@wrongu wrongu commented May 26, 2023

Add conftest.py and use it to set a random seed for all pytest tests.

This PR makes pytest deterministic and solves the problem of tests randomly failing due to unlucky RNG. There are pros and cons to this, of course. I find that it tends to help in CI settings.

@WardBrian
Copy link
Collaborator

We need to work on this issue to make CI less of a hassle, but I’m not convinced fixing a seed is the approach we should take

@codecov-commenter
Copy link

Codecov Report

Merging #31 (ccee362) into main (f845628) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main      #31   +/-   ##
=======================================
  Coverage   95.76%   95.76%           
=======================================
  Files           9        9           
  Lines         260      260           
=======================================
  Hits          249      249           
  Misses         11       11           

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@wrongu
Copy link
Contributor Author

wrongu commented May 26, 2023

This could alternatively be configured to run twice, once from seed and once without. Or n times with n seeds. Or any further combination.

Ultimately up to you of course. Just a suggestion.

@bob-carpenter
Copy link
Collaborator

I didn't see a related issue this solves, but the PR indicates it's failures for some RNG seeds. So the obvious question is why there are failures. Is it just too strict a test interval like testing that 95% intervals cover true results and cause failures 5% of the time? In that case, we can either (a) widen the test interval, or (b) do repeated tests (if we do 20 tests, we expect binomial(.05, 20) failures and can test that expectation), or (c) fix seeds.

The downside of (a) is that it reduces test power, the downside of (b) is a lot more computation, and the downside of (c) is brittle tests. I don't think there's an optimal solution, so we just need to pick something.

@WardBrian
Copy link
Collaborator

Yes, the tests for e.g. HMC have tolerances that mean they still fail on some runs.

Our current test suite runs in ~12 seconds, so we have a lot head room to run more iterations/tests but we need to keep an eye on this as the number of algorithms increases. At the moment for the purposes of reviewing I've been doing a sort of hand-wavey equivalent of (b) by seeing that only one or two test runs (out of the 12 we do in our matrix of python version and OS) failed

@WardBrian WardBrian mentioned this pull request Jun 14, 2023
@WardBrian
Copy link
Collaborator

I recently stumbled on https://github.com/pytest-dev/pytest-rerunfailures, which could be another approach we use (maybe only as a stopgap)

@roualdes
Copy link
Collaborator

I see the appeal of pytest-rerunfailures, especially as a stopgap.

Brian's "hand-wavey" strategy seems good to me, especially if we just couch it in stats-speak. Modify Bob's suggestion (b) to test for 90% coverage (or some such number not too much smaller than 95%), e.g. at least 18 of 20 repeated tests pass, instead of boolean testing the expectation. I think this would allow more wiggle room while still protecting test power.

Then pytest-rerunfailures maybe once or twice, in case we didn't build in enough wiggle room and we find ourselves frustrated by the tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants