-
Notifications
You must be signed in to change notification settings - Fork 104
Description
单卡A800运行的时候就正常的,能够进行训练。但是转化为2张卡时候,就会卡在模型初始化这一步,也不报错。输出:
[2025-09-03 17:03:50,412] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO 09-03 17:03:52 init.py:190] Automatically detected platform cuda.
INFO 09-03 17:03:52 init.py:190] Automatically detected platform cuda.
[2025-09-03 17:03:53,038] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-09-03 17:03:53,038] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2025-09-03 17:03:53,103] [INFO] [config.py:733:init] Config mesh_device None world_size = 2
[2025-09-03 17:03:53,164] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-09-03 17:03:53,398] [INFO] [config.py:733:init] Config mesh_device None world_size = 2
经过print大法就会卡在
if "Qwen2-VL" in model_id:
----》model = Qwen2VLForConditionalGeneration.from_pretrained(model, **model_init_kwargs)
elif "Qwen2.5-VL" in model_id: