Skip to content

Conversation

@bob-carpenter
Copy link
Collaborator

@bob-carpenter bob-carpenter commented Jan 27, 2023

Add affine invariant ensemble sampler.

Cleanup typing.

0 mypy errors with all tests passing, with

  • numpy 1.24.1
  • mypy 0.991

@bob-carpenter bob-carpenter changed the title Affine invariant walker v1 affine invariant sampler + type cleanup Jan 27, 2023
@codecov-commenter
Copy link

codecov-commenter commented Jan 27, 2023

Codecov Report

Merging #10 (6d32f46) into main (a8e4f60) will decrease coverage by 0.89%.
The diff coverage is 89.47%.

@@            Coverage Diff             @@
##             main      #10      +/-   ##
==========================================
- Coverage   94.19%   93.30%   -0.89%     
==========================================
  Files           9        9              
  Lines         224      269      +45     
==========================================
+ Hits          211      251      +40     
- Misses         13       18       +5     
Impacted Files Coverage Δ
bayes_kit/rwm.py 91.66% <ø> (ø)
bayes_kit/ess.py 90.74% <83.33%> (ø)
bayes_kit/ensemble.py 90.00% <89.36%> (-10.00%) ⬇️
bayes_kit/__init__.py 100.00% <100.00%> (ø)
bayes_kit/rhat.py 100.00% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@bob-carpenter
Copy link
Collaborator Author

@WardBrian: This one's not quite ready for re-review until I get test coverage up to 100%. Is there a way to see what the coverage is? I already know that I'm not testing things like all the paths through RNG creation or initialization yet. I want to get this class for ensemble sampling squared away and then I'll go back and get all the other ones up to form with doc and tests.

@WardBrian
Copy link
Collaborator

For test coverage, I would look here: https://app.codecov.io/gh/flatironinstitute/bayes-kit/blob/affine-invariant-walker-v1/bayes_kit/ensemble.py

That will get updated every time CI is run on this branch.

Agreed it’s good to finish things. I’m a bit hesitant about going all in on docstrings right now, especially if we think names of things or structure of APIs could still evolve - docstrings are much more annoying to keep updated than types and tests in my experience, since even Sphinx won’t warn you if the doc string doesn’t match the actual arguments that the function has

@bob-carpenter
Copy link
Collaborator Author

I’m a bit hesitant about going all in on docstrings right now

I'm in favor of getting the details into doc sooner rather than later for several reasons in decreasing importance.

  1. I find it very helpful to document a function thoroughly to help work through boundary and exception conditions.

  2. It's useful for users who want to try the API.

  3. It sends the signal to users and potential devs that we're serious about the software.

  4. It's just as quick for me to just add thorough doc than thinking each time about how little doc I can get away with.

I agree that there's debt to pay from thorough doc if the code changes. If the code changes but not the doc, that's even worse than no doc. I don't imagine this is going to be so volatile that we won't be able to keep up, but time will tell.

"""
return self.sample()

def draw_z(self) -> Sample:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • What's a z?

  • Should we prefix "private" methods with an underscore, as we do with fields? Doing so can make it more obvious what's the intended API.

Copy link
Collaborator Author

@bob-carpenter bob-carpenter Feb 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

z is the name of the variable in the paper.

The doc explains how the function works:

Return a random draw of `z` in `(1/a, a)` with `p(z) propto 1 / sqrt(z)`

The usage shows that it's an interpolation ratio:

        z = self.draw_z()
        theta_star: NDArray[np.float64] = theta_j + z * (theta_k - theta_j)

Are you asking indirectly what it is because you think it should have a more descriptive variable name?

I'm OK prefixing internal methods with an underscore. I wasn't even aware of that convention. Please bear with me for a couple of years as I come up to speed on Python!

sampler = AffineInvariantWalker(model_dummy)


def test_aiw_std_normal() -> None:
Copy link
Collaborator

@jsoules jsoules Feb 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the elegance of delegating to the actual sampling test, however I think there are two potential issues:

  • This can get a bit burdensome in test execution time; and
  • it's not clear that the "sampler worked" base test is the right level of granularity for the features being tested.

On the second point:

  • If the run_sampling_test call passes with the sampler's default parameters, it isn't clear that rerunning it with different parameters can capture how those parameters have changed the underlying behavior. Like, would specifying a different value for a be expected to produce a tighter tolerance, convergence after fewer iterations, etc.? For another example, as written, we test that an initialization value can be passed successfully, but it's not clear whether that value was actually honored, e.g. by seeing whether choosing a good initial value results in faster convergence/tighter tolerance/etc. Essentially, if the tolerances in run_sampling_test are loose enough, that function might pass cases where the desired behavior from the extra sampler parameters isn't actually implemented.
  • Conversely, we may not need to run even 10k iterations to confirm that the desired behavior is being honored. For instance, when setting the number of walkers, it may be enough to construct the sampler object, and then assert on the dimensions of its _thetas member (along with maybe some tests for e.g. stretch_move specifically, to make sure that function is actually using all available walkers appropriately).

I'm not proposing to test a bunch of volatile implementation details, but piercing the veil a little bit might lead to more specific tests and a faster test suite overall.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure what you're asking me to do here.

this can get a bit burdensome in test execution time

Our whole test suite, including this one, runs in 13s on my desktop. Is this really a concern or can I just ignore this for now?

it's not clear that the "sampler worked" base test is the right level of granularity
If the run_sampling_test call passes with the sampler's default parameters, it isn't clear that rerunning it with different parameters can capture how those parameters have changed the underlying behavior.

There's no good theory around this to test in the affine-invariant case. The hyperparameter a is similar to the proposal scale in random-walk Metropolis. There will be an optimal value. In many cases we know what that is for Metropolis but there's not much theory around the affine-invariant ensemble methods. Should we start by testing the behavior of differently scaled or differently shaped proposals in Metropolis?

If we're willing to open up the box and test internals, I could maybe rewrite parts of the code to expose as functions rather than just as code. For example, we could test that the variance of proposals is higher with larger a.

ssert on the dimensions of its _thetas member (along with maybe some tests for e.g. stretch_move specifically, to make sure that function is actually using all available walkers appropriately).

That won't test that they're being used appropriately. Everything's stochastic all the way down in that updating a single walker involves selecting a random complementary walking, proposing, then doing an MH accept/reject step. I'm not sure how to test things like that we have a uniform distribution over all complementary walkers. Again, we'd have to decompose the sampler into finer grained pieces and test those.

For another example, as written, we test that an initialization value can be passed successfully, but it's not clear whether that value was actually honored, e.g. by seeing whether choosing a good initial value results in faster convergence/tighter tolerance/etc.

Again, same problem for Metropolis and every other sampler that we use. And again, we can open up the box and test that we really are setting the values appropriately on initialization. We might want to test that everything breaks if each ensemble member is initialized at the same value.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First off, this does not seem to me to be a blocker for the PR going through--as @roualdes says, it's definitely a good step forward!--and I don't intend to single out this PR for particular scrutiny, it just happened to be what made me realize that we need to revisit the topic of testing strategy.

Also, I wasn't clear enough in my earlier comment. I'm not requesting even-stricter tests; I just want to make sure that each test adds new information. If we have several tests that don't capture different specific behavior of the subject-under-test, one option is to write them to test more specific details; another is to say "These are testing the same thing, maybe we'd be fine cutting some of them to save runtime." Both are valid responses, as long as we're satisfied with the resulting coverage.

Here, I'm hearing you say that for the affine-invariant sampler, there's no good theory for how changing the hyperparameters should affect overall algorithmic performance. To me that sounds like it means that even if we rerun the convergence test, we don't have an outcome to assert on afterwards. So I'm thinking there's no need to spend time rerunning it--I'd be happy to cut entirely the tests against a specific initialization point, initializing with different random generators, etc.

(That said, if we do want to test those things, rerunning the full integration test for them will take longer & probably give less new information than testing something more specific. So for the initialization point, instead of repeating the test for convergence, I'd pick a start point out in the boonies with a model centered at 0, take one draw, and assert that the result is within $APPROPRIATE_VALUE of the initialization; for the random generators, pass in a mock generator and assert that that mock was used; etc.)

I do love the idea of testing for higher-order properties that any correct implementation ought to have--you mentioned ensuring correct distribution over walkers; hearing you mention it, that sounds like a great thing to check. For that specifically, we can mock _stretch_move with a function that captures the parameters it was called with, run 500 or so draws (which will be instantaneous, since we don't actually need to evaluate the densities), and then do some assertions on the distribution of parameters used to call it--I'm happy to help with setting this up, it'll be good for me too to figure out the mechanics. But again, I don't think that needs to be present for this PR to go through. (Similarly with checking for higher proposal variance with higher a--that sounds useful to add, and maybe something we can test either through a test against _draw_z, or with some clever mocking or a minor refactor, but I think that's more an "open an issue" thing than a blocker for the PR.)

Our whole test suite, including this one, runs in 13s on my desktop. Is this really a concern or can I just ignore this for now?

Basically, it's a real concern that you can ignore for now 😄 I'm just looking at the possible matrix of parameters and thinking that even if each call to run_sampling_tests takes (say) half a second, the overall runtime could get noticeable. (And again, my objection to that is in the context where we don't have a good theoretical basis to test for new outcomes from the additional integration test calls--I'm perfectly sanguine about taking the extra time if it gives us key information, but it sounds like it might not here.)

Should we start by testing the behavior of differently scaled or differently shaped proposals in Metropolis?

I think this gets into the tension between testing the algorithm and testing the implementation of the algorithm. For the purposes of testing the implementation, I don't think we need to go beyond testing that Metropolis calls what it's given; past that point, it seems like we'd be more testing the Metropolis algorithm itself rather than this implementation of it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the key issue that @jsoules raised:

we'd be more testing the Metropolis algorithm itself rather than this implementation of it

I agree that we should test our implementations. And that what we have now mainly tests the algorithm.

We can test our implementation does the right thing with changes in a, namely that it gives you a broader range of proposals. That's easy to check. And that's why we were talking about mock objects on the way to lunch. It does require refactoring the code so we can test the internal details of the algorithm. Otherwise, all we have is something we can test end-to-end. You can see that I just pushed the new tests for ESS and added the IAT functionality, where the IAT funcitonality is required for the ESS functionality. We can do the same thing here.

Part of what I was doing with all those calls is making sure that we have code coverage. The tests can be simplified.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like we're on the same page!

Re: doing the right thing with changes in a, would you consider hitting _draw_z for N draws each with M different values of a (well, _inv_sqrt_a and _sqrt_a) to be sufficient? Or would you prefer to test only through "public" parts of the API?

My understanding was that the z interpolation factor was pretty specific to this sampler and so isn't likely to have outside consumers if promoted to a public API endpoint, so I'm comfortable leaving the code as currently structured and just testing the private function if it avoids some of the refactoring. Also happy to go along with a bigger restructure if there's a practical or policy reason to do so--just looking for an appropriately minimized amount of effort here.

(To be clear, I'm imagining a test that in pseudocode says basically:

def test_variance_of_interpolation_factor_increases_as_a_increases():
    def test_an_a(a: int) -> Sample[]:
        sampler = # instantiate an AffineInvariantWalker with the given a
        return [sampler._draw_z() for _ in range(100)]
    as = [1, 2, 5, 10, 100]
    zs = [test_an_a(a) for a in as]
    # Do something to fix dimensions, then
    assert var(z[0]) < var(z[1]) < var(z[2]) < var(z[3]) < var(z[4])  

Naturally we could get a lot more specific about the statistical properties we'd expect from _draw_z() and the tolerance to which we'd want those to be realized, this is just what I had in mind right now.)

@WardBrian WardBrian mentioned this pull request Feb 17, 2023
@bob-carpenter
Copy link
Collaborator Author

I brought this up to date with the existing branch. All tests are passing and mypy strict passes.

None of our samplers are tested up to the standard that @jsoules is suggesting in code review for this PR. If @WardBrian and @roualdes think that all the samplers should be tested to this level, I think we should start with our existing implementations as they're all more important than this one.

Given that this is the PR where I fixed all the mypy errors, I'm stuck. I have the ESS/R-hat thing ready to go, but @WardBrian doesn't want me to put that in until it passes mypy. But without it, we can't do anything practical.

@bob-carpenter
Copy link
Collaborator Author

I do think it'd make sense to test the other internal functions, which I haven't done yet. And if we do need to test the internal algorithm details, then it'd make sense to break functions out of the internals that we can test independently.

@WardBrian
Copy link
Collaborator

I’m not very picky about fixing the type errors before anything is merged - in particular, we already have versions merged to main which are both wrong and have type errors, so any marginal improvement is fine by me as long as eventually we end up with a nice state

@roualdes
Copy link
Collaborator

I think this is a good start for testing. Perfect? No, and that's ok. In line with Brian's comment, I think this PR is a good improvement over no affine invariant ensemble sampler and better than an untested implementation.

More specific tests on say init or seed or num_walkers wouldn't be bad and I wouldn't turn any such tests down, but I'm certainly more concerned about the general validity of the implementation.

I imagine a more strenuous test that might necessitate changing the default value of a could be beneficial for validating this sampler. Though this certainly won't lead to a faster test suite.

For instance, testing exclusively on 1d Std Normal models is not sufficiently validating this sampler, in my opinion. So maybe another test on say a 100d Normal, where decreasing a might improve the sampling efficiency, could help clear up whether or not this sampler works more generally and simultaneously test the argument a.

I'll start a new issue and make some suggestions for testing in general and maybe there can be some work along the lines of my suggestion in future PRs.

P.S. @jsoules I absolutely love the yellow billed magpie photo you've got.

@bob-carpenter
Copy link
Collaborator Author

Given @WardBrian's comments about typing, I'll try to get the ESS updates in first, which will in turn make it easier to write a generic testing framework, which will in turn make it trivial to write individual sampler tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants