Documentation Index
Fetch the complete documentation index at: https://docs.arkor.ai/llms.txt
Use this file to discover all available pages before exploring further.
Lifecycle callbacks
Pass callbacks undercreateTrainer({ callbacks: { ... } }). All five are optional; the SDK type is Partial<TrainerCallbacks>. They run inside trainer.wait(), dispatched from the backend’s SSE event stream.
When each callback fires
start() without wait(), no callbacks ever run. arkor start calls both for you; programmatic callers must do the same.
onStarted({ job })
Fires when the SSE stream reports training.started. Use it for log lines or a “training started” notification.
onLog({ step, loss, evalLoss, learningRate, epoch, samplesPerSecond, job })
Fires repeatedly as training progresses. Each numeric field is number | null: backends only fill in fields they have on a given step (so evalLoss is null on non-eval steps, learningRate may be null between LR-scheduler updates, etc.).
abortSignal only stops your local wait(); call trainer.cancel() afterwards to actually stop the GPU on the backend.
onCheckpoint({ step, adapter, job, infer, artifacts })
Fires when an adapter checkpoint is saved on the backend, while the run is still going. adapter is { kind: "checkpoint", jobId, step }. infer is described in detail on the infer page; in short it takes a chat-style request and returns a raw Response.
onCompleted({ job, artifacts })
Fires once on success. artifacts is unknown[]: the raw artifact list the backend sent. Schemas evolve, so the SDK does not narrow it.
onFailed({ job, error })
Fires once on a backend-reported failure. error is a string (the message the backend sent), not an Error instance.
onFailed is only for backend-side failures. Exceptions thrown inside your other callbacks do not reach onFailed; see below for what does happen to them.
Sequencing
Each callback is awaited before the next event is dispatched. You can return a promise (writing to a database, posting to Slack, callinginfer) and the SDK will wait for it before processing the next frame. There are no concurrent callback invocations for the same trainer.
Exception handling (read carefully)
Throwing inside a callback does not behave like a normal Promise rejection. The SDK’s event loop wraps dispatch in a try/catch and routes any throw to the SSE reconnect handler (packages/arkor/src/core/trainer.ts:335-364, then handleFailure at :307-320):
- If
abortSignal.abortedis set, the error re-throws andwait()rejects. - Otherwise, if
maxReconnectAttemptswas configured and the counter is exceeded,wait()rejects with a wrapping error. - Otherwise, the SDK delays and reopens the SSE stream.
maxReconnectAttempts defaults to undefined (unlimited). It is not configurable through TrainerInput; the only way to set it is the second context argument to createTrainer, which is annotated @internal and may change without notice. In practice, with default settings, a thrown callback is caught and retried, possibly indefinitely. If Last-Event-ID advances across the retry, the originally failing event is also skipped.
For deterministic error handling, catch inside the callback:
Type sketches
TrainingLogContext and CheckpointContext are not exported by name from arkor; mirror the shapes inline if you want typed callback parameters in your own code.