-
Notifications
You must be signed in to change notification settings - Fork 144
Description
PQ construction relies on this method:
static List<VectorFloat<?>> extractTrainingVectors(RandomAccessVectorValues ravv, ForkJoinPool parallelExecutor) {
// limit the number of vectors we train on
var P = min(1.0f, MAX_PQ_TRAINING_SET_SIZE / (float) ravv.size());
var ravvCopy = ravv.threadLocalSupplier();
return parallelExecutor.submit(() -> IntStream.range(0, ravv.size()).parallel()
.filter(i -> ThreadLocalRandom.current().nextFloat() < P)
.mapToObj(targetOrd -> {
var localRavv = ravvCopy.get();
VectorFloat<?> v = localRavv.getVector(targetOrd);
return localRavv.isValueShared() ? v.copy() : v;
})
.collect(Collectors.toList()))
.join();
}This method of producing a list of vectors is not guaranteed to produce a list of size MAX_PQ_TRAINING_SET_SIZE or ravv.size(). How much of a problem is that?
It also seems like we could skip this step by mapping all of these elements to a list and then calling Collections.shuffle() on the list, which has a linear time complexity, and taking the first N values from the list. I like this predictability, though this might not be efficient when ravv.size() >> MAX_PQ_TRAINING_SET_SIZE.
Reproducing the issue
You can reproduce the issue of getting more than MAX_PQ_TRAINING_SET_SIZE vectors by setting MAX_PQ_TRAINING_SET_SIZE = 300, then adding this conditional exception:
static List<VectorFloat<?>> extractTrainingVectors(RandomAccessVectorValues ravv, ForkJoinPool parallelExecutor) {
// limit the number of vectors we train on
var P = min(1.0f, MAX_PQ_TRAINING_SET_SIZE / (float) ravv.size());
var ravvCopy = ravv.threadLocalSupplier();
var result = parallelExecutor.submit(() -> IntStream.range(0, ravv.size()).parallel()
.filter(i -> ThreadLocalRandom.current().nextFloat() < P)
.mapToObj(targetOrd -> {
var localRavv = ravvCopy.get();
VectorFloat<?> v = localRavv.getVector(targetOrd);
return localRavv.isValueShared() ? v.copy() : v;
})
.collect(Collectors.toList()))
.join();
if (result.size() > MAX_PQ_TRAINING_SET_SIZE) {
throw new IllegalStateException("Got " + result.size() + " vectors, which is more than " + MAX_PQ_TRAINING_SET_SIZE);
}
return result;
}And finally, run the TestProductQuantization test suite, which will fail with relevant exceptions.