-
Notifications
You must be signed in to change notification settings - Fork 4
affine invariant sampler + type cleanup #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## main #10 +/- ##
==========================================
- Coverage 94.19% 93.30% -0.89%
==========================================
Files 9 9
Lines 224 269 +45
==========================================
+ Hits 211 251 +40
- Misses 13 18 +5
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
|
@WardBrian: This one's not quite ready for re-review until I get test coverage up to 100%. Is there a way to see what the coverage is? I already know that I'm not testing things like all the paths through RNG creation or initialization yet. I want to get this class for ensemble sampling squared away and then I'll go back and get all the other ones up to form with doc and tests. |
|
For test coverage, I would look here: https://app.codecov.io/gh/flatironinstitute/bayes-kit/blob/affine-invariant-walker-v1/bayes_kit/ensemble.py That will get updated every time CI is run on this branch. Agreed it’s good to finish things. I’m a bit hesitant about going all in on docstrings right now, especially if we think names of things or structure of APIs could still evolve - docstrings are much more annoying to keep updated than types and tests in my experience, since even Sphinx won’t warn you if the doc string doesn’t match the actual arguments that the function has |
…into affine-invariant-walker-v1
I'm in favor of getting the details into doc sooner rather than later for several reasons in decreasing importance.
I agree that there's debt to pay from thorough doc if the code changes. If the code changes but not the doc, that's even worse than no doc. I don't imagine this is going to be so volatile that we won't be able to keep up, but time will tell. |
bayes_kit/ensemble.py
Outdated
| """ | ||
| return self.sample() | ||
|
|
||
| def draw_z(self) -> Sample: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
What's a
z? -
Should we prefix "private" methods with an underscore, as we do with fields? Doing so can make it more obvious what's the intended API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
z is the name of the variable in the paper.
The doc explains how the function works:
Return a random draw of `z` in `(1/a, a)` with `p(z) propto 1 / sqrt(z)`The usage shows that it's an interpolation ratio:
z = self.draw_z()
theta_star: NDArray[np.float64] = theta_j + z * (theta_k - theta_j)Are you asking indirectly what it is because you think it should have a more descriptive variable name?
I'm OK prefixing internal methods with an underscore. I wasn't even aware of that convention. Please bear with me for a couple of years as I come up to speed on Python!
| sampler = AffineInvariantWalker(model_dummy) | ||
|
|
||
|
|
||
| def test_aiw_std_normal() -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate the elegance of delegating to the actual sampling test, however I think there are two potential issues:
- This can get a bit burdensome in test execution time; and
- it's not clear that the "sampler worked" base test is the right level of granularity for the features being tested.
On the second point:
- If the
run_sampling_testcall passes with the sampler's default parameters, it isn't clear that rerunning it with different parameters can capture how those parameters have changed the underlying behavior. Like, would specifying a different value forabe expected to produce a tighter tolerance, convergence after fewer iterations, etc.? For another example, as written, we test that an initialization value can be passed successfully, but it's not clear whether that value was actually honored, e.g. by seeing whether choosing a good initial value results in faster convergence/tighter tolerance/etc. Essentially, if the tolerances inrun_sampling_testare loose enough, that function might pass cases where the desired behavior from the extra sampler parameters isn't actually implemented. - Conversely, we may not need to run even 10k iterations to confirm that the desired behavior is being honored. For instance, when setting the number of walkers, it may be enough to construct the sampler object, and then assert on the dimensions of its
_thetasmember (along with maybe some tests for e.g.stretch_movespecifically, to make sure that function is actually using all available walkers appropriately).
I'm not proposing to test a bunch of volatile implementation details, but piercing the veil a little bit might lead to more specific tests and a faster test suite overall.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure what you're asking me to do here.
this can get a bit burdensome in test execution time
Our whole test suite, including this one, runs in 13s on my desktop. Is this really a concern or can I just ignore this for now?
it's not clear that the "sampler worked" base test is the right level of granularity
If the run_sampling_test call passes with the sampler's default parameters, it isn't clear that rerunning it with different parameters can capture how those parameters have changed the underlying behavior.
There's no good theory around this to test in the affine-invariant case. The hyperparameter a is similar to the proposal scale in random-walk Metropolis. There will be an optimal value. In many cases we know what that is for Metropolis but there's not much theory around the affine-invariant ensemble methods. Should we start by testing the behavior of differently scaled or differently shaped proposals in Metropolis?
If we're willing to open up the box and test internals, I could maybe rewrite parts of the code to expose as functions rather than just as code. For example, we could test that the variance of proposals is higher with larger a.
ssert on the dimensions of its _thetas member (along with maybe some tests for e.g. stretch_move specifically, to make sure that function is actually using all available walkers appropriately).
That won't test that they're being used appropriately. Everything's stochastic all the way down in that updating a single walker involves selecting a random complementary walking, proposing, then doing an MH accept/reject step. I'm not sure how to test things like that we have a uniform distribution over all complementary walkers. Again, we'd have to decompose the sampler into finer grained pieces and test those.
For another example, as written, we test that an initialization value can be passed successfully, but it's not clear whether that value was actually honored, e.g. by seeing whether choosing a good initial value results in faster convergence/tighter tolerance/etc.
Again, same problem for Metropolis and every other sampler that we use. And again, we can open up the box and test that we really are setting the values appropriately on initialization. We might want to test that everything breaks if each ensemble member is initialized at the same value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First off, this does not seem to me to be a blocker for the PR going through--as @roualdes says, it's definitely a good step forward!--and I don't intend to single out this PR for particular scrutiny, it just happened to be what made me realize that we need to revisit the topic of testing strategy.
Also, I wasn't clear enough in my earlier comment. I'm not requesting even-stricter tests; I just want to make sure that each test adds new information. If we have several tests that don't capture different specific behavior of the subject-under-test, one option is to write them to test more specific details; another is to say "These are testing the same thing, maybe we'd be fine cutting some of them to save runtime." Both are valid responses, as long as we're satisfied with the resulting coverage.
Here, I'm hearing you say that for the affine-invariant sampler, there's no good theory for how changing the hyperparameters should affect overall algorithmic performance. To me that sounds like it means that even if we rerun the convergence test, we don't have an outcome to assert on afterwards. So I'm thinking there's no need to spend time rerunning it--I'd be happy to cut entirely the tests against a specific initialization point, initializing with different random generators, etc.
(That said, if we do want to test those things, rerunning the full integration test for them will take longer & probably give less new information than testing something more specific. So for the initialization point, instead of repeating the test for convergence, I'd pick a start point out in the boonies with a model centered at 0, take one draw, and assert that the result is within $APPROPRIATE_VALUE of the initialization; for the random generators, pass in a mock generator and assert that that mock was used; etc.)
I do love the idea of testing for higher-order properties that any correct implementation ought to have--you mentioned ensuring correct distribution over walkers; hearing you mention it, that sounds like a great thing to check. For that specifically, we can mock _stretch_move with a function that captures the parameters it was called with, run 500 or so draws (which will be instantaneous, since we don't actually need to evaluate the densities), and then do some assertions on the distribution of parameters used to call it--I'm happy to help with setting this up, it'll be good for me too to figure out the mechanics. But again, I don't think that needs to be present for this PR to go through. (Similarly with checking for higher proposal variance with higher a--that sounds useful to add, and maybe something we can test either through a test against _draw_z, or with some clever mocking or a minor refactor, but I think that's more an "open an issue" thing than a blocker for the PR.)
Our whole test suite, including this one, runs in 13s on my desktop. Is this really a concern or can I just ignore this for now?
Basically, it's a real concern that you can ignore for now 😄 I'm just looking at the possible matrix of parameters and thinking that even if each call to run_sampling_tests takes (say) half a second, the overall runtime could get noticeable. (And again, my objection to that is in the context where we don't have a good theoretical basis to test for new outcomes from the additional integration test calls--I'm perfectly sanguine about taking the extra time if it gives us key information, but it sounds like it might not here.)
Should we start by testing the behavior of differently scaled or differently shaped proposals in Metropolis?
I think this gets into the tension between testing the algorithm and testing the implementation of the algorithm. For the purposes of testing the implementation, I don't think we need to go beyond testing that Metropolis calls what it's given; past that point, it seems like we'd be more testing the Metropolis algorithm itself rather than this implementation of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the key issue that @jsoules raised:
we'd be more testing the Metropolis algorithm itself rather than this implementation of it
I agree that we should test our implementations. And that what we have now mainly tests the algorithm.
We can test our implementation does the right thing with changes in a, namely that it gives you a broader range of proposals. That's easy to check. And that's why we were talking about mock objects on the way to lunch. It does require refactoring the code so we can test the internal details of the algorithm. Otherwise, all we have is something we can test end-to-end. You can see that I just pushed the new tests for ESS and added the IAT functionality, where the IAT funcitonality is required for the ESS functionality. We can do the same thing here.
Part of what I was doing with all those calls is making sure that we have code coverage. The tests can be simplified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like we're on the same page!
Re: doing the right thing with changes in a, would you consider hitting _draw_z for N draws each with M different values of a (well, _inv_sqrt_a and _sqrt_a) to be sufficient? Or would you prefer to test only through "public" parts of the API?
My understanding was that the z interpolation factor was pretty specific to this sampler and so isn't likely to have outside consumers if promoted to a public API endpoint, so I'm comfortable leaving the code as currently structured and just testing the private function if it avoids some of the refactoring. Also happy to go along with a bigger restructure if there's a practical or policy reason to do so--just looking for an appropriately minimized amount of effort here.
(To be clear, I'm imagining a test that in pseudocode says basically:
def test_variance_of_interpolation_factor_increases_as_a_increases():
def test_an_a(a: int) -> Sample[]:
sampler = # instantiate an AffineInvariantWalker with the given a
return [sampler._draw_z() for _ in range(100)]
as = [1, 2, 5, 10, 100]
zs = [test_an_a(a) for a in as]
# Do something to fix dimensions, then
assert var(z[0]) < var(z[1]) < var(z[2]) < var(z[3]) < var(z[4]) Naturally we could get a lot more specific about the statistical properties we'd expect from _draw_z() and the tolerance to which we'd want those to be realized, this is just what I had in mind right now.)
|
I brought this up to date with the existing branch. All tests are passing and mypy strict passes. None of our samplers are tested up to the standard that @jsoules is suggesting in code review for this PR. If @WardBrian and @roualdes think that all the samplers should be tested to this level, I think we should start with our existing implementations as they're all more important than this one. Given that this is the PR where I fixed all the mypy errors, I'm stuck. I have the ESS/R-hat thing ready to go, but @WardBrian doesn't want me to put that in until it passes mypy. But without it, we can't do anything practical. |
|
I do think it'd make sense to test the other internal functions, which I haven't done yet. And if we do need to test the internal algorithm details, then it'd make sense to break functions out of the internals that we can test independently. |
|
I’m not very picky about fixing the type errors before anything is merged - in particular, we already have versions merged to |
|
I think this is a good start for testing. Perfect? No, and that's ok. In line with Brian's comment, I think this PR is a good improvement over no affine invariant ensemble sampler and better than an untested implementation. More specific tests on say I imagine a more strenuous test that might necessitate changing the default value of For instance, testing exclusively on 1d Std Normal models is not sufficiently validating this sampler, in my opinion. So maybe another test on say a 100d Normal, where decreasing I'll start a new issue and make some suggestions for testing in general and maybe there can be some work along the lines of my suggestion in future PRs. P.S. @jsoules I absolutely love the yellow billed magpie photo you've got. |
|
Given @WardBrian's comments about typing, I'll try to get the ESS updates in first, which will in turn make it easier to write a generic testing framework, which will in turn make it trivial to write individual sampler tests. |
Add affine invariant ensemble sampler.
Cleanup typing.
0 mypy errors with all tests passing, with