-
Notifications
You must be signed in to change notification settings - Fork 118
Open
Description
Hi, thanks for sharing this interesting work and making the code available. I'm trying to reproduce the results reported in the paper, but I've encountered several issues that make it difficult to achieve the claimed performance. I'd appreciate your clarification on the following points:
The paper states:
"All the experiments are implemented with the PyTorch platform and trained/tested on 4 NVIDIA A100 GPUs."*
However, the current codebase does not appear to fully support multi-GPU training:
- The
TODOlist includes an unchecked item: "Fix bugs in Multi-GPU parallel", suggesting known issues in distributed training. - The training script (
train.py) usesCUDA_VISIBLE_DEVICESand single-process execution, but does not usetorch.distributedorDistributedDataParallel(DDP). This limits training to single-GPU or inefficientDataParallelmode. - There is no use of
local_rank,DistributedSampler, or proper process group initialization.
Could you clarify:
- Were the reported results indeed obtained using 4 A100 GPUs in a distributed setting?
- If so, was a different (internal) version of the code used? If yes, could you release the fixed version or provide guidance on how to properly enable multi-GPU training?
Without a working multi-GPU setup, it's challenging to train at the scale described in the paper, especially for 3D medical data.
ZXYBUAA301
Metadata
Metadata
Assignees
Labels
No labels