Without exact match is difficult to guarantee that we haven't introduced small errors
Some ideas why they could be differences:
- It is compiler-dependent and the baseline is gcc
- It is number-of-ranks dependent but we just have one for all tests
- We may need the config REPRODUCIBLE_REDUCTIONS=true
- We may need the 'math_uniform' in gpu run