Skip to content

Conversation

@akoumpa
Copy link
Contributor

@akoumpa akoumpa commented Oct 4, 2025

The automodel CLI enables users to run jobs interactively or in batch mode (e.g., on SLURM or k8s) based on the contents of a YAML config. The Wandb is often used to log metrics (such as loss and gradient_norm), but also the config that was used to launch a job.

This PR introduces two more commands to the automodel CLI:

  • replicate
  • resume

By running automodel replicate <wandb_job_url> the same job is launched again on the user's hardware. This is a great feature for collaboration, but also for debugging, since with the help of uv/docker we can get the exact same environment.

And finally, the automodel resume <wandb_job_url> allows users to resume a previous run job -- assuming that the checkpoints for the job are still accessible.

Signed-off-by: Alexandros Koumparoulis <[email protected]>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 4, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@akoumpa akoumpa changed the title feat: easy job replication from W&B feat: easy job replication from W&B job url Oct 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants