Finetuning on Captured Episodes
This guide explains how to finetune the agent’s ShowUI/Qwen2VL model using the
episodes recorded by src/capture.py (MosaicML Streaming, aka MDS).
All commands assume you are inside the Nix development shell (nix develop).
What It Trains
The training script turns each annotated action in an episode into a supervised example:
- Prompt: the same chat template used by the agent (system + task + image and optional short action history).
- Target: the JSON action string (e.g.,
{"action":"CLICK", ...}) observed in chat asAction: {...}during capture.
This directly teaches the model to emit the agent’s expected action schema.
Environment Variables
Set these in .env (already scaffolded in .env.example):
# Data sources (MDS)
TRAIN_LOCAL="./data/mds" # defaults to CAPTURE_OUT
# TRAIN_REMOTE="s3://bucket/prefix" # optional remote MDS
TRAIN_CACHE="./data/cache" # local cache for StreamingDataset
# Output
TRAIN_OUTPUT="./checkpoints" # where to save model + processor
# Logging
TRAIN_LOGLEVEL=INFO # DEBUG|INFO|WARNING|ERROR
TRAIN_LOG_FORMAT=text # text|json
# Optimization
TRAIN_EPOCHS=1
TRAIN_BATCH=1
TRAIN_ACCUM=1
TRAIN_LR=5e-6
TRAIN_WD=0.0
TRAIN_MAX_STEPS=0 # 0 disables cap
TRAIN_MAX_SAMPLES_PER_EPOCH=0 # 0 disables per-epoch cap
TRAIN_HISTORY_STEPS=0 # include N prior actions in prompt
SEED=1337
The model ID and image sizes inherit from the agent defaults:
REPO_ID="showlab/ShowUI-2B"
SIZE_SHORTEST_EDGE=224
SIZE_LONGEST_EDGE=1344
Running Training
# Basic run (uses .env)
python src/train.py
# With overrides
TRAIN_BATCH=2 TRAIN_EPOCHS=1 python src/train.py --output ./ckpt/showui-ft
# Using just
just train
Checkpoints are written under TRAIN_OUTPUT (epoch-* and final/). Use them
with the agent by pointing REPO_ID to the checkpoint folder.
Notes
- The script streams episodes with
streaming.StreamingDataset(MDS). It can read from a local directory and/or a remote (S3/R2/MinIO) when configured. - Mixed precision is enabled on CUDA automatically. On Apple MPS it uses the default precision.
- This is a minimal reference trainer for reproducible fine-tuning. For large jobs, consider adding gradient checkpointing, deepspeed, or PEFT adapters in your environment.