> ## Documentation Index
> Fetch the complete documentation index at: https://arkor-92aeef0e-eng-353.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# Lifecycle callbacks

> The five callbacks Arkor fires during a training run, and what each one is good for.

Arkor fires five callbacks as a training run progresses. They are all optional, and each one is a plain TypeScript function that runs in your process. This is what makes the loop feel like the rest of your application: no notebook, no out-of-band dashboard, no separate config language.

```ts theme={null}
createTrainer({
  name: "support-bot-v1",
  model: "unsloth/gemma-4-E4B-it",
  dataset: { type: "huggingface", name: "arkorlab/triage-demo" },
  callbacks: {
    onStarted, onLog, onCheckpoint, onCompleted, onFailed,
  },
});
```

## When each callback fires

```
.start()                 ── submits the job, returns { jobId }. No callbacks here.
   │
   ▼
.wait()                  ── opens the SSE event stream. Callbacks dispatch from here.
   │
   ▼
onStarted ─── once, when the stream reports `training.started`
   │
   ▼
onLog     ─── many times, once per training-step batch of metrics
   │
   ▼
onCheckpoint ── several times, when an adapter checkpoint is saved
   │
   ▼
onCompleted  ── once, on successful finish
or
onFailed     ── once, on backend-reported failure
```

All five callbacks are dispatched from inside `wait()`. If you call `start()` without later calling `wait()`, **no callbacks fire**, even though the run is still happening on the backend. `arkor start` calls `wait()` for you; if you wire training into your own code outside the CLI, make sure you do too.

`onCompleted` and `onFailed` are mutually exclusive: at most one of them fires per run. If `wait()` throws before a terminal event arrives (for example when `abortSignal` is aborted, or reconnect attempts are exhausted), it is possible that neither one fires.

Returning a promise from any callback is fine. Arkor awaits it before moving on, so you can do async work (writing to a database, posting to Slack, calling `infer`) without races.

## `onStarted({ job })`

Fires once, when the SSE stream opened by `wait()` reports a `training.started` event. Note that this is not the same moment `start()` resolves: `start()` only submits the job and returns its `jobId`.

```ts theme={null}
onStarted: ({ job }) => {
  console.log(`Run ${job.id} accepted`);
},
```

Use it for log lines, metric counters, or sending a "training started" notification.

## `onLog({ step, loss, evalLoss, learningRate, epoch, samplesPerSecond, job })`

Fires repeatedly as training progresses. Each numeric field can be `null` when the backend has not produced that metric yet (for example `evalLoss` only fires on eval steps).

```ts theme={null}
onLog: ({ step, loss, evalLoss }) => {
  if (loss !== null) {
    console.log(`step=${step} loss=${loss.toFixed(4)} evalLoss=${evalLoss ?? "-"}`);
  }
},
```

Common uses:

* Forward to your own metrics pipeline (e.g. PostHog, Datadog).
* Detect divergence early: if `loss` is climbing, abort `wait()` via `abortSignal` and call `trainer.cancel()` to stop the GPU on the backend.
* Implement custom early stopping (abort a run automatically when metrics regress — see the [Early stopping recipe](/cookbook/early-stopping)).

## `onCheckpoint({ step, adapter, job, infer, artifacts })`

Fires when a checkpoint is saved on the backend, while the run is still going.

```ts theme={null}
onCheckpoint: async ({ step, infer }) => {
  const res = await infer({
    messages: [{ role: "user", content: "Can't log in" }],
  });
  console.log(`step=${step} sample=`, await res.text());
},
```

`adapter` is a small object identifying the checkpoint (`{ kind: "checkpoint", jobId, step }`). `infer` is a function: it takes a chat-style request and returns a raw `Response`. You call `await res.text()` (or `res.json()`, or stream the body) to read it.

This is the most useful callback in practice. It lets you sanity-check the model mid-run rather than waiting until the end. If the checkpoint is already worse than the base model, you know to stop.

## `onCompleted({ job, artifacts })`

Fires once, on success. `artifacts` lists what the backend produced for this run. Use it to:

* Persist the final adapter ID where the rest of your app can find it.
* Run a final smoke test before promoting the model.
* Send a "training done" notification.

```ts theme={null}
onCompleted: ({ job, artifacts }) => {
  console.log(`Run ${job.id} done, ${artifacts.length} artifacts`);
},
```

## `onFailed({ job, error })`

Fires once if the backend reports a failure. Note that `error` is a `string` (the message the backend sent), not an `Error` instance:

```ts theme={null}
onFailed: ({ job, error }) => {
  console.error(`Run ${job.id} failed: ${error}`);
},
```

`onFailed` is for backend-reported failures only. A thrown exception inside one of your own callbacks does **not** route through `onFailed`, and it is also not guaranteed to fail `wait()` cleanly: the current runtime catches errors thrown during event dispatch and treats them as SSE failures, which feed into the reconnect loop. The original exception can end up retried or skipped rather than surfacing where you would expect. If you need deterministic behavior, catch errors inside the callback and decide what to do (log, abort, persist) before they escape.

## A complete example

```ts theme={null}
import { createTrainer } from "arkor";

export const trainer = createTrainer({
  name: "support-bot-v1",
  model: "unsloth/gemma-4-E4B-it",
  dataset: { type: "huggingface", name: "arkorlab/triage-demo" },
  lora: { r: 16, alpha: 16 },
  maxSteps: 100,
  callbacks: {
    onStarted: ({ job }) => console.log(`started ${job.id}`),
    onLog: ({ step, loss }) => {
      if (loss !== null) console.log(`step=${step} loss=${loss.toFixed(4)}`);
    },
    onCheckpoint: async ({ step, infer }) => {
      const res = await infer({
        messages: [{ role: "user", content: "Hello!" }],
      });
      console.log(`ckpt @ ${step}:`, await res.text());
    },
    onCompleted: ({ job }) => console.log(`done ${job.id}`),
    onFailed: ({ error }) => console.error(`failed: ${error}`),
  },
});
```

Once you understand callbacks, almost everything else in Arkor is just configuration on top.
