Training export pipeline

A practical end-to-end recipe for shipping recorded sessions into a model training run.

1. Pick the workflow + filter

Decide what you’re training on. The most common shapes:

All successful runs of a workflow — for behavior cloning a reliable agent.
Failure → recovery pairs — for training a recovery policy.
Outliers — to broaden coverage past the dominant path.

const filter = {
  outcome: "success",
  since: "2026-01-01",
  actor: { kind: "human" },     // exclude scripted runs from the imitation set
  min_duration_ms: 5_000,        // drop trivially short runs
};

2. Choose frame sampling

Goal	Sampling
Cheap, action-only training	`event_only` (default). One frame per action.
Vision-fluency / continuous video	`every_n_ms: 100` (10 fps)
Mid-range	`keyframes_only` — only frames with material screen change

const frame_sampling = { mode: "event_only", keep_action_frames: true };

3. Pick a destination

S3 with cross-account assume-role is the cleanest pattern:

const destination = {
  kind: "s3",
  bucket: "acme-training",
  prefix: "nusomi/process_invoice/v1/",
  region: "us-east-1",
  role_arn: "arn:aws:iam::123456789012:role/NusomiExport",
};

Provision the role with this trust policy:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": { "AWS": "arn:aws:iam::609385282459:root" },
    "Action": "sts:AssumeRole",
    "Condition": {
      "StringEquals": { "sts:ExternalId": "<your-workspace-id>" }
    }
  }]
}

And this permission policy on the destination bucket prefix:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:PutObject", "s3:AbortMultipartUpload"],
      "Resource": "arn:aws:s3:::acme-training/nusomi/*"
    }
  ]
}

4. Trigger the export

import { Nusomi } from "@nusomi/sdk";
const nusomi = new Nusomi({ apiKey: process.env.NUSOMI_API_KEY });

const exp = await nusomi.exports.create({
  workflow: "process_invoice",
  filter,
  format: "webdataset",
  frame_sampling,
  destination,
  tag: `process_invoice@${new Date().toISOString().slice(0, 10)}`,
});

await exp.wait();
console.log(exp.manifest_url);

A few minutes later (size-dependent), the manifest lands at manifest_url.

5. Load it

WebDataset loads cleanly into PyTorch:

import webdataset as wds

url = "pipe:aws s3 cp s3://acme-training/nusomi/process_invoice/v1/shard-{000000..000023}.tar -"

dataset = (
    wds.WebDataset(url)
      .decode("pil")
      .to_tuple("frame.webp", "action.json")
      .shuffle(1000)
      .batched(32)
)

For HuggingFace datasets:

from datasets import load_dataset
ds = load_dataset(
    "parquet",
    data_files="s3://acme-training/nusomi/process_invoice/v1/*.parquet",
)

6. Version it

Always tag exports. Train against a tag, not a moving filter:

const v1 = await nusomi.exports.create({
  ...,
  tag: "process_invoice@v1",
});

// Later, retrain on the same data:
const replay = await nusomi.exports.list({ tag: "process_invoice@v1" });
console.log(replay[0].manifest_url); // same shards, same hashes

7. Schedule it

Most teams cut a fresh dataset weekly:

# .github/workflows/nusomi-export.yml
on:
  schedule:
    - cron: "0 7 * * 1"  # Mondays 07:00 UTC
jobs:
  export:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - run: npx tsx scripts/export-process-invoice.ts
        env:
          NUSOMI_API_KEY: ${{ secrets.NUSOMI_API_KEY }}

The script bumps the tag, kicks off the export, and posts the manifest URL to your training queue. Reliability comes from the manifest — if it lands, the dataset is whole.

Common knobs

Drop low-confidence events. filter.min_event_confidence: 0.9 gates out vision-only events from RDP / Citrix surfaces if you only want high-fidelity supervision.
Limit by path. Use memory.paths to find the dominant path id, then pass filter.path: pth_... to train only on that path.
Exclude specific runs. filter.exclude_session_ids: [...] for known-bad recordings flagged by review.

Get started

Concepts

Capture

Self-hosted

Security

Recipes

Reference

Training export pipeline

1. Pick the workflow + filter

2. Choose frame sampling

3. Pick a destination

4. Trigger the export

5. Load it

6. Version it

7. Schedule it

Common knobs

Get started

Concepts

Capture

Self-hosted

Security

Recipes

Reference

Documentation Index

​1. Pick the workflow + filter

​2. Choose frame sampling

​3. Pick a destination

​4. Trigger the export

​5. Load it

​6. Version it

​7. Schedule it

​Common knobs

1. Pick the workflow + filter

2. Choose frame sampling

3. Pick a destination

4. Trigger the export

5. Load it

6. Version it

7. Schedule it

Common knobs