Hello,
I encountered several issues while trying to reproduce the results:
-
Evaluation Dataset
The dataset required for evaluation is not provided.
The evaluation path is hard-coded to the author’s local desktop, making evaluation impossible after training.
Could you please provide the dataset or a valid path?
-
Training Instability
When running on the experiments, the training loss consistently drops to 0.
In GRPO, the Advantage term always becomes 0 (huggingface/open-r1#239 (comment)), but training is still possible since the KL divergence term is non-zero.
However, in Visual RFT, the KL divergence coefficient is set to 0, which makes the training dynamics unclear.
I am unsure if this is intended or a mistake.
Could you please share the missing hyperparameters or supplementary material that includes detailed training logs? This would help in reproducing and validating the reported results.
Thank you!