Many NVIDIA GPUs have a (supposedly, kind-of) adjustable maximum amount of shared memory per block, with the default being 48 KiB or so. We currently do nothing to adjust this - and simply schedule the compiled kernel. We should, instead, try to arrange it so that our larger-shared-memory-utilization kernels are accepted and scheduled rather than rejected.