Documentation Index
Fetch the complete documentation index at: https://docs.arkor.ai/llms.txt
Use this file to discover all available pages before exploring further.
Trainer control
createTrainer returns a Trainer object with three methods:
arkor start and Studio’s “Run training” button both call start() followed by wait(). You only call them yourself when you wire training into your own code (a server, a script, a custom CLI).
start()
- Submits the job to the cloud API and resolves once the backend has accepted it. The returned
jobIdis the same id you see in Studio and in the SDK’sTrainingJob.id. - Idempotent: calling
start()a second time on the same trainer returns the samejobIdwithout resubmitting (packages/arkor/src/core/trainer.ts:275-289). - Does not open the event stream and does not dispatch any callbacks.
wait()is what does that.
wait()
- Opens the SSE event stream for the run, dispatches each frame to your callbacks, and resolves with the terminal
TrainingResultwhen the stream reportstraining.completedortraining.failed. - Calls
start()for you if you have not called it yet. - All five lifecycle callbacks fire from inside
wait(). If you callstart()withoutwait(), no callbacks run, even though the run continues on the backend. - Reconnects on transient SSE errors (see below).
cancel()
- If
start()has not been called yet,cancel()is a no-op (early return at:388-389). - Otherwise it sends a cancel request to the backend.
- Best-effort. The SDK does not short-circuit on terminal status; if the run already completed, failed, or was cancelled, the backend may return a non-2xx and
cancel()rejects. Wrap intry / catchif you call it speculatively.
abortSignal
abortSignal controls only your local wait() loop. When the signal aborts:
- The pending SSE fetch is aborted (
trainer.ts:325-328). - Any active reconnect-backoff
delayrejects withsignal.reason(trainer.ts:178). handleFailurere-throws when the signal is aborted (trainer.ts:308).wait()therefore rejects, not resolves, when you abort.
cancel() and does not send anything to the backend. The job keeps using GPU time on the managed side.
If you want both effects (stop waiting locally and stop the run on the backend), do them separately:
abortSignal for “I no longer care about waiting on this run” (request timeout, parent process exit). Use cancel() for “stop the run on the backend”.
Reconnects
wait() keeps the SSE stream alive across transient failures by default:
- A clean stream EOF after at least one received frame triggers an immediate reconnect at the base delay (
initialReconnectDelayMs, default 1000 ms) without counting against the failure budget. The stream resumes viaLast-Event-ID. - A connect error or a stream EOF without any received frame counts as a failure and goes through
handleFailure: exponential backoff viainitialReconnectDelayMs * 2 ** attempt, with the per-attempt delay clamped atmaxReconnectDelayMs(default 60 000 ms) and the consecutive-failure count capped atmaxReconnectAttempts. maxReconnectAttemptsdefaults toundefined(unlimited consecutive failures). It is not configurable throughTrainerInput; the only way to set it (along withreconnectDelayMsandmaxReconnectDelayMs) is the secondcontextargument tocreateTrainer, annotated@internaland subject to change. For most projects this means transient SSE failures are silently retried for as long as the job runs.
wait() to reject.
Two-process pattern
A common shape for non-CLI use is to keep a long-lived trainer reference and let your own code orchestratestart, wait, and cancel:
arkor start does, minus the entry resolution from runTrainer.