Run is the primary unit of workload in dstack. Users can:
- Submit a run using
dstack applyor the API. - Stop a run using
dstack stopor the API.
Runs are created from run configurations. There are three types of run configurations:
dev-environment— runs a VS Code server.task— runs the user's bash script until completion.service— runs the user's bash script and exposes a port through dstack-proxy.
A run can spawn one or multiple jobs, depending on the configuration. A task that specifies multiple nodes spawns a job for every node (a multi-node task). A service that specifies multiple replicas spawns a job for every replica. A job submission is always assigned to one particular instance. If a job fails and the configuration allows retrying, the server creates a new job submission for the job.
- STEP 1: The user submits the run.
services.runs.submit_runcreates jobs with statusSUBMITTED. The run starts inSUBMITTED. - STEP 2:
RunPipelinecontinuously processes unfinished runs.- For active runs, it derives the run status from the latest job states in priority order:
- If any non-retryable failure is present, the run becomes
TERMINATINGwith the relevantRunTerminationReason. - If
stop_criteria == MASTER_DONEand the master job is done, the run becomesTERMINATINGwithALL_JOBS_DONE. - Otherwise, if any job is
RUNNING, the run becomesRUNNING. - Otherwise, if any job is
PROVISIONINGorPULLING, the run becomesPROVISIONING. - Otherwise, if jobs are still waiting for placement or provisioning, the run stays
SUBMITTED. - Otherwise, if all contributing jobs are
DONE, the run becomesTERMINATINGwithALL_JOBS_DONE. - Otherwise, if no active replicas remain and the run should be retried, the run becomes
PENDING.
- If any non-retryable failure is present, the run becomes
- Retryable replica failures are handled before the final transition is applied:
- If a replica fails with a retryable reason while other replicas are still active,
RunPipelinecreates a newSUBMITTEDsubmission for that replica and terminates the old jobs in that replica. - If all remaining work is retryable, the run ends up in
PENDING.
- If a replica fails with a retryable reason while other replicas are still active,
- For active runs, it derives the run status from the latest job states in priority order:
- STEP 3: If the run is
PENDING,RunPipelineprocesses it in the pending phase.- For retrying runs, it waits for an exponential backoff before resubmitting.
- For scheduled runs, it waits until
next_triggered_at. - For scaled-to-zero services, it can keep the run in
PENDINGuntil autoscaling wants replicas again. - Once the run is ready to continue,
RunPipelinecreates newSUBMITTEDjobs and moves the run back toSUBMITTED.
- STEP 4: If the run is
TERMINATING,RunPipelinemarks active jobs asTERMINATINGand assigns the correspondingJobTerminationReason. - STEP 5: Once all jobs are finished, the terminating phase of
RunPipelineeither:- assigns the final run status (
TERMINATED,DONE, orFAILED), or - for scheduled runs that were not stopped or aborted by the user, returns the run to
PENDINGand computes a newnext_triggered_at.
- assigns the final run status (
Services' run lifecycle has some modifications:
- During STEP 1, the service itself is registered on the gateway or the in-server proxy. If the gateway is not accessible or the domain name is taken, submission fails.
- During STEP 2, active run processing also computes desired replica counts from gateway stats and handles scale-up, scale-down, rolling deployment, and cleanup of removed replica groups.
- During STEP 2, jobs already marked
SCALED_DOWNdo not contribute to the run status. - During STEP 3, a service can stay in
PENDINGwhen autoscaling currently wants zero replicas. - During STEP 5, the terminating phase of
RunPipelineunregisters the service from the gateway.
dstack retries the run only if:
- The configuration enables
retry. - The job termination reason is covered by
retry.on_events. - The
retry.durationis not exceeded.
- STEP 1: A newly submitted job has status
SUBMITTED. It is not assigned to any instance yet. - STEP 2:
JobSubmittedPipelinetries to assign an existing instance or provision new capacity.- On success, the job becomes
PROVISIONING. - On failure, the job becomes
TERMINATING.JobTerminatingPipelinelater assigns the final failed status.
- On success, the job becomes
- STEP 3:
JobRunningPipelineprocessesPROVISIONING,PULLING, andRUNNINGjobs.- While
dstack-shim/dstack-runneris not responding, the job staysPROVISIONING. - Once
dstack-shim(for VM-featured backends) becomes available, the pipeline submits the image and the job becomesPULLING. - Once
dstack-runnerinside the container becomes available, the pipeline uploads the code and job spec, and the job becomesRUNNING. - While the job is
RUNNING, the pipeline keeps collecting logs and runner status. - If startup, runner communication, or replica registration fails, the job becomes
TERMINATING.
- While
- STEP 4: Once the job is actually ready,
JobRunningPipelineinitializes probes. - STEP 5:
JobTerminatingPipelineprocessesTERMINATINGjobs.- If the job has
remove_atin the future, it waits. This gives the job time for a graceful stop. - Once
remove_atis in the past, it stops the container, detaches volumes, unregisters service replicas if needed, and releases the instance assignment. - If some volumes are not detached yet, the job stays
TERMINATINGand is retried. - When cleanup is complete, the job becomes
TERMINATED,DONE,FAILED, orABORTEDbased onJobTerminationReason.
- If the job has
Services' jobs lifecycle has some modifications:
- During STEP 3, once the primary job of a replica is
RUNNINGand ready to receive traffic,JobRunningPipelineregisters that replica on the gateway. If the gateway is not accessible, the job fails with a gateway-related termination reason. - During STEP 5,
JobTerminatingPipelineunregisters the replica from receiving requests before the job is fully cleaned up.