Skip to content

Discrepancy between paper and code: Multi-GPU training support #139

@ZXYBUAA301

Description

@ZXYBUAA301

Hi, thanks for sharing this interesting work and making the code available. I'm trying to reproduce the results reported in the paper, but I've encountered several issues that make it difficult to achieve the claimed performance. I'd appreciate your clarification on the following points:

The paper states:

"All the experiments are implemented with the PyTorch platform and trained/tested on 4 NVIDIA A100 GPUs."*

However, the current codebase does not appear to fully support multi-GPU training:

  • The TODO list includes an unchecked item: "Fix bugs in Multi-GPU parallel", suggesting known issues in distributed training.
  • The training script (train.py) uses CUDA_VISIBLE_DEVICES and single-process execution, but does not use torch.distributed or DistributedDataParallel (DDP). This limits training to single-GPU or inefficient DataParallel mode.
  • There is no use of local_rank, DistributedSampler, or proper process group initialization.

Could you clarify:

  • Were the reported results indeed obtained using 4 A100 GPUs in a distributed setting?
  • If so, was a different (internal) version of the code used? If yes, could you release the fixed version or provide guidance on how to properly enable multi-GPU training?

Without a working multi-GPU setup, it's challenging to train at the scale described in the paper, especially for 3D medical data.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions