Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.arkor.ai/llms.txt

Use this file to discover all available pages before exploring further.

DatasetSource

createTrainer accepts one dataset, expressed as a discriminated union on type:
type DatasetSource =
  | { type: "huggingface"; name: string; split?: string; subset?: string }
  | { type: "blob";        url: string;  token?: string };

HuggingFace

dataset: {
  type: "huggingface",
  name: "arkorlab/triage-demo",
  // split: "train",
  // subset: "v1",
}
FieldTypeNotes
type"huggingface"Discriminant.
namestringRepository name (e.g. arkorlab/triage-demo). Public repos work without further auth.
splitstring?Override the default split. Optional.
subsetstring?For datasets that publish multiple subsets. Optional.
This is the form the bundled templates (triage / translate / redaction) use. Most projects start here.

Blob URL

dataset: {
  type: "blob",
  url: "https://example.com/data.jsonl",
  // token: process.env.DATASET_TOKEN,
}
FieldTypeNotes
type"blob"Discriminant.
urlstringHTTPS URL the backend can fetch.
tokenstring?Forwarded to the cloud-api in the job config; the backend uses it when fetching the blob. The exact HTTP wire format (header, scheme, etc.) is backend-defined and not part of the SDK contract.
Use this for datasets you host yourself (signed S3 URL, internal CDN, etc.). The backend pulls the URL once at the start of the run.

Picking a form

  • Reach for huggingface when the dataset is already on the Hub. It is the most-tested path.
  • Reach for blob when you need a dataset that cannot live on the Hub (proprietary content, signed URL, internal-only).
Local files ({ type: "file", path: "./data.jsonl" }) are not in DatasetSource today. To use one, host it as a blob URL or upload it to a private HuggingFace repo first.