`DatasetSource`

createTrainer accepts one dataset, expressed as a discriminated union on type:

type DatasetSource =
  | { type: "huggingface"; name: string; split?: string; subset?: string }
  | { type: "blob";        url: string;  token?: string };

HuggingFace

dataset: {
  type: "huggingface",
  name: "arkorlab/triage-demo",
  // split: "train",
  // subset: "v1",
}

Field	Type	Notes
`type`	`"huggingface"`	Discriminant.
`name`	`string`	Repository name (e.g. `arkorlab/triage-demo`). Public repos work without further auth.
`split`	`string?`	Override the default split. Optional.
`subset`	`string?`	For datasets that publish multiple subsets. Optional.

This is the form the bundled templates (triage / translate / redaction) use. Most projects start here.

Blob URL

dataset: {
  type: "blob",
  url: "https://example.com/data.jsonl",
  // token: process.env.DATASET_TOKEN,
}

Field	Type	Notes
`type`	`"blob"`	Discriminant.
`url`	`string`	HTTPS URL the backend can fetch.
`token`	`string?`	Forwarded to the cloud-api in the job config; the backend uses it when fetching the blob. The exact HTTP wire format (header, scheme, etc.) is backend-defined and not part of the SDK contract.

Use this for datasets you host yourself (signed S3 URL, internal CDN, etc.). The backend pulls the URL once at the start of the run.

Picking a form

Reach for huggingface when the dataset is already on the Hub. It is the most-tested path.
Reach for blob when you need a dataset that cannot live on the Hub (proprietary content, signed URL, internal-only).

Local files ({ type: "file", path: "./data.jsonl" }) are not in DatasetSource today. To use one, host it as a blob URL or upload it to a private HuggingFace repo first.

Get started

Concepts

CLI

SDK

Studio

DatasetSource

`DatasetSource`

HuggingFace

Blob URL

Picking a form

Get started

Concepts

CLI

SDK

Studio

Documentation Index

​DatasetSource

​HuggingFace

​Blob URL

​Picking a form

`DatasetSource`

HuggingFace

Blob URL

Picking a form