Initialization time is too long to try a scale-out run (# of GPUs > 2000).
When I was trying QWEN3 235B scale out run (2048 GPU-scale), it took 1 hour to see the following setup complete log.
============================================================
SETUP COMPLETE
============================================================
This is wasting 2048 GPU hours just for the initialization.
It is crucial to reduce this latency to initiate further scale-out performance study.