Neko Agent Documentation

Welcome to the Neko Agent documentation! This project provides an AI-powered automation agent that interfaces with Neko servers to perform GUI automation tasks through WebRTC connections.

What is Neko Agent?

Neko Agent is a Python-based automation system that:

Connects to Neko servers via WebRTC for real-time GUI interaction
Uses AI vision models (ShowUI-2B/Qwen2VL) for visual reasoning and action planning
Captures training data automatically during automation sessions
Provides voice synthesis capabilities via F5-TTS and WebRTC audio
Supports various actions like clicking, typing, scrolling, and navigation

Key Components

src/agent.py - Core automation agent with WebRTC integration
src/capture.py - Training data capture service using MosaicML Streaming
src/yap.py - Text-to-speech service with F5-TTS and voice management

Getting Started

To get started with Neko Agent:

Set up the development environment using Nix flakes
Configure a Neko server for GUI automation targets
Run the agent to begin automation tasks
Optionally enable capture for training data collection

See the Getting Started guide for detailed setup instructions.

Architecture

The system is designed around WebRTC-based communication with Neko containers, allowing for:

Real-time video frame analysis and action execution
Automated training data collection in MosaicML format
Voice feedback through WebRTC audio streams
Scalable deployment patterns for production use

For technical details, see the Architecture Overview.

Getting Started

This guide will help you set up and run the Neko Agent system for AI-powered GUI automation.

Prerequisites

Nix with flakes enabled - for development environment
Neko server - target environment for automation
NVIDIA GPU (optional) - for optimal AI model performance

Development Environment Setup

The project uses Nix flakes for reproducible development environments. Choose the appropriate shell based on your needs:

Available Development Shells

# Basic CPU environment
nix develop

# GPU-accelerated environment (recommended)
nix develop .#gpu

# Shell with additional AI CLI tools
nix develop .#ai

# Docker-compose helper environment for the bundled Neko stack
nix develop .#neko

# Documentation tooling
nix develop .#docs

# Optimised variants
nix develop .#cpu-opt
nix develop .#gpu-opt

# Trusted Execution Environment tooling
nix develop .#tee

GPU Environment (Recommended)

For the best performance with AI models:

nix develop .#gpu

This provides:

CUDA 12.8 toolkit and libraries
PyTorch with CUDA support
All required Python dependencies
Pre-configured environment variables

Neko Server Setup

The agent requires a running Neko server to connect to. You can either:

Option 1: Use Docker Compose (Included)

# Enter the Neko environment
nix develop .#neko

# Start Neko services
neko-services up

# View logs
neko-services logs

# Stop services
neko-services down

This starts a Neko server at http://localhost:8080 with Chrome ready for automation.

Option 2: External Neko Server

Configure connection to an existing Neko server by setting environment variables:

export NEKO_URL="https://your-neko-server.com"
export NEKO_USER="your-username"
export NEKO_PASS="your-password"

Running the Agent

Basic Usage

# Run with a simple task
uv run src/agent.py --task "Navigate to google.com and search for 'AI automation'"

# Run with REST authentication (the agent performs the login handshake)
uv run src/agent.py \
  --task "Your automation task" \
  --neko-url "http://localhost:8080" \
  --username "user" \
  --password "password"

# Keep the agent online and accept new tasks from chat
uv run src/agent.py --online --neko-url "http://localhost:8080" --username user --password password

With Training Data Capture

To enable training data collection during automation:

# Terminal 1: Start capture service
uv run src/capture.py

# Terminal 2: Run agent (capture watches chat messages for /start and /stop)
uv run src/agent.py --task "Your task description"

See Training Data Capture for detailed capture configuration.

With Voice Output

For voice feedback during automation:

# Terminal 1: Start TTS service (F5-TTS + WebRTC audio)
uv run src/yap.py

# Terminal 2: Run the agent – leave audio enabled (default) so the browser hears YAP
uv run src/agent.py --task "Describe what you see"

Configuration

Environment Variables

Create a .env file in the project root (copy .env.example first):

# Neko server connection
NEKO_URL=http://localhost:8080
NEKO_USER=user
NEKO_PASS=password

# Agent behaviour
NEKO_TASK="Default task description"
NEKO_MODE=web
NEKO_MAX_STEPS=8
NEKO_AUDIO=1
REFINEMENT_STEPS=5
NEKO_ICE_POLICY=strict
NEKO_RTCP_KEEPALIVE=0
NEKO_SKIP_INITIAL_FRAMES=5
NEKO_FORCE_EXIT_GUARD_MS=0
NEKO_LOGLEVEL=INFO
NEKO_LOG_FORMAT=text
NEKO_METRICS_PORT=9000
# FRAME_SAVE_PATH="./tmp/frame.png"
# CLICK_SAVE_PATH="./tmp/actions"

# Capture settings (optional)
CAPTURE_OUT=./data/mds
CAPTURE_FPS=2.0
CAPTURE_REMOTE=s3://your-bucket/training-data

# TTS settings (optional)
YAP_VOICES_DIR=./voices
YAP_SR=48000
YAP_PARALLEL=2
YAP_MAX_CHARS=350

GPU Configuration

For optimal GPU usage:

# Set specific GPU
export CUDA_VISIBLE_DEVICES=0

# Configure memory allocation
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

# Set target architecture
export TORCH_CUDA_ARCH_LIST=8.6

Verification

Test your setup:

# Check dependencies
python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"

# Test agent connection
uv run src/agent.py --healthcheck

# Explore capture options (the script exits when interrupted)
uv run src/capture.py --help

# Test TTS (if enabled)
uv run src/yap.py --healthcheck

Next Steps

Read the Architecture Overview to understand the system design
Explore Training Data Capture to collect data for model improvement
Check the Developer Guide for technical details
Review Neko Integration for advanced server configuration

Troubleshooting

Common Issues

CUDA not detected:

# Verify NVIDIA drivers
nvidia-smi

# Check CUDA installation in shell
echo $CUDA_HOME

Neko connection failed:

# Test Neko server accessibility
curl http://localhost:8080/health

# Check WebSocket endpoint
curl -i -N -H "Connection: Upgrade" -H "Upgrade: websocket" \
  http://localhost:8080/api/ws

Python import errors:

# Regenerate environment
nix develop .#gpu --command python -c "import streaming; import torch; print('OK')"

For more help, see the Developer Guide or check the project issues.

Manual Control Guide

The Manual Control CLI is an interactive tool that lets you remotely control any Neko server through your terminal. Think of it as a command-line remote desktop that you can use for testing, administration, or automating tasks on hosted desktop environments.

Quick Start

Prerequisites

Python 3.8 or newer
Access to a Neko server (either local or hosted)
Network connectivity to the Neko server

Launching the CLI

The manual control tool ships as src/manual.py inside this repository. From a Nix shell you can run:

uv run src/manual.py --help

First Connection

The easiest way to connect is using your Neko server credentials:

uv run src/manual.py \
  --neko-url "https://your-neko-server.com" \
  --username "your-username" \
  --password "your-password"

If successful, you'll see:

Starting Neko manual CLI. Type 'help' for commands. Ctrl+D or 'quit' to exit.
neko>

Basic Usage

Getting Help

Type help at any time to see all available commands:

neko> help

This displays a comprehensive list of all commands with their syntax.

Essential Commands

Mouse Control

# Move mouse cursor to specific coordinates
neko> move 100 200

# Click at current mouse position
neko> click

# Move and click in one command
neko> tap 300 400

# Right-click
neko> rclick

# Double-click
neko> dblclick

Keyboard Input

# Type text
neko> text "Hello, World!"

# Press specific keys
neko> key Enter
neko> key Escape
neko> key F5

# Click somewhere and then type
neko> input 500 300 "username@example.com"

# Scroll in different directions
neko> scroll down 3
neko> scroll up
neko> scroll left 2

# Drag/swipe gestures
neko> swipe 100 100 300 300

CLI switches

manual.py mirrors the agent’s authentication behaviour: provide either a full WebSocket URL (--ws) or REST credentials (--neko-url, --username, --password). Additional flags include:

Flag	Purpose
`--norm`	Interpret coordinates as 0.0–1.0 floats instead of pixels
`--size 1920x1080`	Override the reported screen size when using pixel coordinates
`--no-auto-host`	Do not automatically request host control on connect
`--no-media`	Skip WebRTC negotiation (only for debugging signalling)
`--no-audio`	Disable audio subscription

By default the CLI learns the keyboard layout advertised by the server and stores logs under NEKO_LOGFILE if that environment variable is set.

Normalised coordinates example

uv run src/manual.py --norm \
  --neko-url "https://your-server.com" \
  --username "admin" --password "secret"

# Now coordinates are between 0 and 1
neko> move 0.5 0.5    # Center of screen
neko> tap 0.1 0.9     # Bottom-left area

Common Use Cases

Website Testing

Test web applications by automating browser interactions:

# Navigate to a website
neko> tap 400 60              # Click address bar
neko> text "https://example.com"
neko> key Enter

# Fill out a form
neko> tap 300 200             # Click username field
neko> text "testuser"
neko> key Tab                 # Move to next field
neko> text "password123"
neko> key Enter               # Submit form

# Test navigation
neko> tap 500 300             # Click a link
neko> key F5                  # Refresh page
neko> key Alt+Left            # Go back

Application Testing

Test desktop applications:

# Open application menu
neko> key Super               # Windows/Super key
neko> text "calculator"
neko> key Enter

# Use the application
neko> tap 200 300             # Click number 5
neko> tap 250 350             # Click plus
neko> tap 200 300             # Click number 5
neko> tap 300 400             # Click equals

System Administration

Perform administrative tasks:

# Open terminal
neko> key Ctrl+Alt+t

# Run system commands
neko> text "sudo apt update"
neko> key Enter
neko> text "your-password"    # If prompted
neko> key Enter

# Navigate file system
neko> text "cd /var/log"
neko> key Enter
neko> text "ls -la"
neko> key Enter

File Management

Work with files and folders:

# Open file manager
neko> key Super
neko> text "files"
neko> key Enter

# Navigate and create files
neko> key Ctrl+n              # New folder
neko> text "TestFolder"
neko> key Enter
neko> dblclick                # Enter folder
neko> rclick                  # Right-click for context menu
neko> text "n"                # New file (if available)

Advanced Features

Clipboard Operations

# Copy, cut, and paste
neko> select_all              # Select all text
neko> copy                    # Copy selection
neko> tap 400 500             # Click elsewhere
neko> paste                   # Paste content

# Paste specific text
neko> paste "This is custom text to paste"

Session Management

If you have admin rights, you can manage other users:

# List all connected users
neko> sessions

# Force take control from another user
neko> force-take

# Kick a specific user (use session ID from 'sessions' command)
neko> kick abc123-def456-789

# Release control
neko> force-release

Screen Size Management

# Check current screen size
neko> size

# Set virtual screen size (affects coordinate scaling)
neko> size 1920x1080

Raw Protocol Access

For advanced users, send custom commands:

# Send custom JSON messages to the server
neko> raw '{"event":"system/stats"}'
neko> raw '{"event":"control/scroll","payload":{"delta_x":0,"delta_y":240}}'

Configuration

Environment Variables

Set these to avoid typing credentials each time:

export NEKO_URL="https://your-neko-server.com"
export NEKO_USER="your-username"  
export NEKO_PASS="your-password"
export NEKO_SIZE="1920x1080"

# Now you can just run:
uv run src/manual.py

Command Line Options

Option	Description	Example
`--ws`	Direct WebSocket URL	`wss://neko.example.com/api/ws?token=...`
`--neko-url`	Neko server URL (for REST login)	`https://neko.example.com`
`--username`	Login username	`admin`
`--password`	Login password	`secretpass`
`--norm`	Use 0-1 coordinates	(flag only)
`--size`	Virtual screen size	`1920x1080`
`--no-auto-host`	Don't auto-request control	(flag only)
`--no-media`	Skip video/audio setup	(flag only)
`--no-audio`	Disable audio	(flag only)

Logging

Enable detailed logging for troubleshooting:

export NEKO_LOGLEVEL="DEBUG"
export NEKO_LOGFILE="/tmp/neko-manual.log"
uv run src/manual.py --neko-url "..." --username "..." --password "..."

# In another terminal, watch the logs:
tail -f /tmp/neko-manual.log

Tips and Best Practices

Timing and Delays

For automation scripts, you may need to add delays between actions:

# The tool doesn't have built-in delays, but you can use shell scripting:
echo -e "tap 300 200\ntext hello\nkey Enter" | uv run src/manual.py --neko-url "..." --username "..." --password "..."

Screen Resolution

Always check the screen size when starting:

neko> size
size 1920x1080  normalized=false

This helps you understand the coordinate system you're working with.

Connection Issues

If you have connection problems:

Check credentials: Make sure username/password are correct
Verify URL: Ensure the Neko server URL is accessible
Network connectivity: Test basic HTTP access to the server
Enable debug logging: Use NEKO_LOGLEVEL=DEBUG for detailed logs

Multiple Users

Be aware that Neko servers can have multiple users connected:

Only one user can have "host" control at a time
Use sessions command to see who else is connected
Use host and unhost to request/release control
Admin users can use force-take and force-release

Automation Scripts

For repetitive tasks, consider creating shell scripts:

#!/bin/bash
# test-website.sh

export NEKO_URL="https://neko.example.com"
export NEKO_USER="admin"
export NEKO_PASS="password"

{
    echo "tap 400 60"                    # Click address bar
    echo "text https://example.com"      # Type URL
    echo "key Enter"                     # Navigate
    sleep 3                             # Wait for page load
    echo "tap 300 200"                  # Click login button
    echo "text testuser"                # Username
    echo "key Tab"                      # Next field
    echo "text testpass"                # Password  
    echo "key Enter"                    # Submit
} | uv run src/manual.py

Troubleshooting

Common Issues

"Connection failed"

Verify the Neko server is running and accessible
Check that the URL, username, and password are correct
Ensure no firewall is blocking the connection

"Command not found: manual.py"

Make sure you're in the correct directory
Use the full path: python /path/to/src/manual.py

"No host control"

Another user may have control; use sessions to check
Try host command to request control
Admin users can use force-take

Commands not working

Verify you have host control
Check if coordinates are correct for the screen size
Use size command to verify resolution

High latency

Network delay between you and the Neko server
Try reducing the number of rapid commands
Consider using the server closer to your location

Getting Help

If you encounter issues:

Enable debug logging with NEKO_LOGLEVEL=DEBUG
Check the log output for error messages
Verify basic connectivity to the Neko server
Review the Developer Guide for technical details

The Manual Control CLI is a powerful tool for remote desktop automation and testing. With practice, you can efficiently control any Neko server environment from your terminal.

Training Data Capture User Guide

The Neko capture tool records user interface interactions as training data for machine learning models. It captures screenshots and action sequences from Neko browser sessions, packaging them into datasets ready for training computer vision and UI automation models.

Quick Start

Prerequisites

Running Neko server with WebSocket API enabled
Python environment with mosaicml-streaming package installed
(Optional) S3-compatible storage credentials for remote upload

Basic Setup

Install dependencies:

pip install streaming requests websockets

Set your Neko connection:

export NEKO_URL="https://your-neko-server.com"
export NEKO_USER="your-username"
export NEKO_PASS="your-password"

Start capturing:
```
python src/capture.py
```

Record an episode in Neko chat:

/start Navigate to login page and enter credentials
Action: {"type": "click", "coordinate": [150, 200]}
Action: {"type": "type", "text": "username"}
/stop

That's it! Your episode is now saved as training data in ./data/mds/.

What Gets Captured

The capture tool records complete "episodes" of UI interaction:

Screenshots: JPEG images captured at regular intervals (default: 2 FPS)
Actions: Structured annotations of user interactions (clicks, typing, etc.)
Metadata: Task descriptions, timestamps, screen dimensions
Context: Complete session information for reproducible training

Each episode becomes a single training sample packaged as:

episode.zip
├── meta.json           # Episode metadata and schema
├── frames/
│   ├── 000000.jpg     # Sequential screenshots
│   ├── 000001.jpg
│   └── ...
├── frames.ndjson      # Frame timing information
└── actions.ndjson     # Action annotations

Architecture

Data Flow

Connect to Neko WebSocket API (/api/ws) for chat monitoring
Listen for task boundaries (/start and /stop commands) and action annotations
Capture screenshots at configurable FPS via HTTP endpoint (/api/room/screen/shot.jpg)
Package episodes as ZIP archives containing metadata, frames, and action sequences
Write to MDS shards with automatic S3 upload and shard rotation

Episode Structure

Each episode is packaged as a ZIP archive containing:

episode.zip
├── meta.json           # Episode metadata and schema
├── frames/
│   ├── 000000.jpg     # Sequential JPEG screenshots
│   ├── 000001.jpg
│   └── ...
├── frames.ndjson      # Frame index with timestamps
└── actions.ndjson     # Action annotations with timestamps

Configuration

All configuration is handled via environment variables for 12-factor compliance:

Connection Settings

# REST login (preferred method)
export NEKO_URL="https://neko.example.com"
export NEKO_USER="username"
export NEKO_PASS="password"

# OR direct WebSocket URL (bypasses REST)
export NEKO_WS="wss://neko.example.com/api/ws?token=..."

Output Configuration

# Local MDS directory
export CAPTURE_OUT="./data/mds"

# Remote storage (optional)
export CAPTURE_REMOTE="s3://bucket/prefix"
export CAPTURE_KEEP_LOCAL=0          # 0=delete local after upload, 1=keep

# MDS shard settings
export CAPTURE_COMPRESSION="zstd:6"   # Compression algorithm
export CAPTURE_SHARD_SIZE="512mb"     # Size before shard rotation
export CAPTURE_HASHES="sha1"          # Integrity checking

Capture Parameters

export CAPTURE_FPS=2                  # Screenshots per second
export CAPTURE_JPEG_QUALITY=85        # JPEG quality (0-100)
export CAPTURE_MIN_FRAMES=4           # Minimum frames to save episode
export CAPTURE_EPISODE_TIMEOUT=900    # Auto-end after N seconds

S3 Authentication

For S3 or S3-compatible storage:

export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_DEFAULT_REGION="us-east-1"
export AWS_SESSION_TOKEN="..."        # Optional for temporary credentials

# For S3-compatible endpoints (MinIO, R2, etc.)
export S3_ENDPOINT_URL="https://s3.example.com"

Common Use Cases

Training UI Automation Models

Capture complete workflows for training models to automate repetitive tasks:

# Set up for high-quality capture
export CAPTURE_FPS=3
export CAPTURE_JPEG_QUALITY=90

Then in Neko chat:

/start Fill out customer registration form
Action: {"type": "click", "coordinate": [245, 156], "element": "first_name_field"}
Action: {"type": "type", "text": "John"}
Action: {"type": "click", "coordinate": [245, 186], "element": "last_name_field"}
Action: {"type": "type", "text": "Doe"}
Action: {"type": "click", "coordinate": [245, 216], "element": "email_field"}
Action: {"type": "type", "text": "john.doe@example.com"}
Action: {"type": "click", "coordinate": [300, 350], "element": "submit_button"}
/stop

Record browsing sessions for training navigation models:

# Lower FPS for longer sessions
export CAPTURE_FPS=1
export CAPTURE_EPISODE_TIMEOUT=1800  # 30 minutes

Then in Neko chat:

/start Research product reviews for laptop purchase
Action: {"type": "navigate", "url": "https://amazon.com"}
Action: {"type": "type", "text": "gaming laptop", "element": "search_box"}
Action: {"type": "click", "coordinate": [400, 45], "element": "search_button"}
Action: {"type": "click", "coordinate": [200, 180], "element": "product_link"}
Action: {"type": "scroll", "direction": "down", "amount": 500}
/stop

Capturing Error Recovery Workflows

Document how to handle and recover from errors:

In Neko chat:

/start Handle login failure and password reset
Action: {"type": "type", "text": "wrong_password", "element": "password_field"}
Action: {"type": "click", "coordinate": [300, 200], "element": "login_button"}
Action: {"type": "click", "coordinate": [250, 150], "element": "forgot_password_link"}
Action: {"type": "type", "text": "user@example.com", "element": "email_field"}
Action: {"type": "click", "coordinate": [200, 180], "element": "send_reset_button"}
/stop

Configuration Guide

Basic Configuration

For most users, these environment variables are sufficient:

# Required: Neko connection
export NEKO_URL="https://your-neko-server.com"
export NEKO_USER="your-username"  
export NEKO_PASS="your-password"

# Optional: Local storage (defaults to ./data/mds)
export CAPTURE_OUT="/path/to/training/data"

# Optional: Capture quality
export CAPTURE_FPS=2              # Screenshots per second
export CAPTURE_JPEG_QUALITY=85    # Image quality (0-100)

Cloud Storage Setup

AWS S3

export CAPTURE_REMOTE="s3://my-training-bucket/ui-episodes"
export AWS_ACCESS_KEY_ID="AKIA..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_DEFAULT_REGION="us-east-1"

MinIO/Self-hosted S3

export CAPTURE_REMOTE="s3://training-data/episodes"
export S3_ENDPOINT_URL="https://minio.mycompany.com"
export AWS_ACCESS_KEY_ID="minioadmin"
export AWS_SECRET_ACCESS_KEY="minioadmin"

Cloudflare R2

export CAPTURE_REMOTE="s3://my-r2-bucket/training"
export S3_ENDPOINT_URL="https://your-account.r2.cloudflarestorage.com"
export AWS_ACCESS_KEY_ID="your-r2-token"
export AWS_SECRET_ACCESS_KEY="your-r2-secret"

Advanced Configuration

Fine-tune capture behavior for specific use cases:

# Episode management
export CAPTURE_MIN_FRAMES=10       # Minimum frames to save episode
export CAPTURE_EPISODE_TIMEOUT=600 # Auto-stop after 10 minutes
export CAPTURE_KEEP_LOCAL=1        # Keep local copies when uploading

# Storage optimization
export CAPTURE_COMPRESSION="zstd:9"  # Maximum compression
export CAPTURE_SHARD_SIZE="1gb"      # Larger shards for fewer files
export CAPTURE_HASHES="sha1,md5"     # Multiple integrity checks

# Debugging
export CAPTURE_LOGLEVEL="DEBUG"

Running the Capture Tool

Command Line Usage

Basic usage with environment variables:

python src/capture.py

Override settings with command line flags:

python src/capture.py \
  --neko-url https://neko.example.com \
  --username myuser \
  --password mypass \
  --out ./custom/data/path \
  --remote s3://mybucket/training-data \
  --fps 3.0 \
  --jpeg-quality 95 \
  --episode-timeout 1200

Direct WebSocket Connection

Skip REST authentication with a direct WebSocket URL:

export NEKO_WS="wss://neko.example.com/api/ws?token=your-jwt-token"
python src/capture.py

Implementation Details

Core Classes

`EpisodeBuffer` (`src/capture.py:161`)

Manages temporary storage for a single episode
Handles frame and action storage during capture
Finalizes episode data into ZIP archive format

`NekoSession` (`src/capture.py:342`)

WebSocket client for Neko server communication
Handles authentication via REST API
Manages screenshot polling and message queuing

`Capture` (`src/capture.py:542`)

Main orchestrator coordinating capture workflow
Processes chat events for episode boundaries
Manages episode lifecycle and MDS writing

Episode Lifecycle

Start Detection: Chat message matching /start <task> pattern
Frame Capture: Screenshot polling thread captures frames at specified FPS
Action Parsing: Chat messages matching Action: {...} are parsed and stored
End Conditions:
- Manual /stop command
- Episode timeout (default 900 seconds)
- Application shutdown
Finalization: Episode packaged and written to MDS if minimum frame count met

MDS Schema

Each MDS record contains:

Column	Type	Description
`episode_id`	str	Unique episode identifier
`task`	str	Task description from `/start` command
`payload`	bytes	ZIP archive containing episode data
`num_frames`	int	Number of screenshot frames captured
`num_actions`	int	Number of action annotations
`started_at`	str	Episode start timestamp (ISO 8601)
`ended_at`	str	Episode end timestamp (ISO 8601)
`screen_w`	int	Screen width in pixels
`screen_h`	int	Screen height in pixels
`agent`	json	Agent metadata and configuration

Error Handling and Resilience

Network Resilience

Automatic WebSocket reconnection with exponential backoff
Screenshot polling continues through temporary network issues
MDS writer handles partial uploads and resumes interrupted transfers

Data Integrity

SHA1 hashing for shard integrity verification
Episode validation before MDS writing
Graceful handling of malformed action annotations

Resource Management

Temporary episode directories cleaned up after packaging
Queue overflow protection prevents memory exhaustion
Configurable episode timeouts prevent runaway captures

Integration with Training Pipelines

Loading Data

from streaming import StreamingDataset
import zipfile
import json

# Create streaming dataset
dataset = StreamingDataset(
    remote="s3://bucket/training-data",
    local="./cache",
    shuffle=True
)

# Process episodes
for sample in dataset:
    episode_id = sample['episode_id']
    task = sample['task']
    payload = sample['payload']
    
    # Extract episode contents
    with zipfile.ZipFile(io.BytesIO(payload)) as zf:
        meta = json.loads(zf.read('meta.json'))
        frames_index = [json.loads(line) for line in zf.read('frames.ndjson').decode().splitlines()]
        actions = [json.loads(line) for line in zf.read('actions.ndjson').decode().splitlines()]
        
        # Load frame images
        for frame_info in frames_index:
            frame_data = zf.read(frame_info['file'])
            # Process frame...

Random Window Sampling

The episode-as-record format enables efficient random window sampling for sequence models:

def sample_windows(episode_payload, window_size=32):
    with zipfile.ZipFile(io.BytesIO(episode_payload)) as zf:
        frames_index = [json.loads(line) for line in zf.read('frames.ndjson').decode().splitlines()]
        
        if len(frames_index) < window_size:
            return None
            
        start_idx = random.randint(0, len(frames_index) - window_size)
        window = frames_index[start_idx:start_idx + window_size]
        
        # Load frames for this window
        frames = []
        for frame_info in window:
            frame_data = zf.read(frame_info['file'])
            frames.append(frame_data)
            
        return frames

Troubleshooting

Quick Diagnostics

First, check if the capture tool can connect to your Neko server:

# Test basic connectivity
curl -I https://your-neko-server.com/api/health

# Test WebSocket endpoint (if available)
python -c "
import websockets
import asyncio
asyncio.run(websockets.connect('wss://your-neko-server.com/api/ws'))
print('WebSocket connection successful')
"

Common Issues & Solutions

❌ Connection Issues

Problem: Connection refused or Unable to connect to Neko server

Solutions:

Verify Neko server is running: curl https://your-neko-server.com
Check firewall settings - ensure ports 80/443 and WebSocket ports are open
Verify URL format: https:// (not http://) for secure connections
Test from the same network as your Neko server first

Problem: SSL certificate verification failed

Solutions:

For self-signed certificates, set: export PYTHONHTTPSVERIFY=0 (development only)
Or add certificate to system trust store
Use IP address instead of hostname if DNS issues

❌ Authentication Issues

Problem: Authentication failed or 401 Unauthorized

Solutions:

Double-check username/password: echo $NEKO_USER $NEKO_PASS
Verify user has WebSocket API access in Neko admin panel

Try direct WebSocket token if available:

# Get token via REST API first
curl -X POST https://your-neko-server.com/api/login \
  -H "Content-Type: application/json" \
  -d '{"username":"user","password":"pass"}'

# Use token directly
export NEKO_WS="wss://your-neko-server.com/api/ws?token=YOUR_TOKEN"

❌ Episode Recording Issues

Problem: Episodes not being saved/captured

Solutions:

Check command format: Commands must be exact:

/start task description here    ✅
/Start task description         ❌ (wrong case)
/ start task description        ❌ (extra space)

Verify action format: Actions must be valid JSON:

Action: {"type": "click", "coordinate": [100, 200]}    ✅
Action: {type: "click", coordinate: [100, 200]}        ❌ (missing quotes)
Action {"type": "click"}                               ❌ (missing colon)

Check minimum frames: Episodes with fewer than CAPTURE_MIN_FRAMES are discarded:
```
export CAPTURE_MIN_FRAMES=1  # Save all episodes
```

Enable debug logging:

export CAPTURE_LOGLEVEL=DEBUG
python src/capture.py

❌ Storage Issues

Problem: Permission denied writing to local directory

Solutions:

Check directory permissions: ls -la ./data/
Create directory manually: mkdir -p ./data/mds
Use a different path: export CAPTURE_OUT=/tmp/capture-data

Problem: S3 upload failures

Solutions:

Verify credentials:

aws sts get-caller-identity  # Test AWS credentials

Check bucket permissions: Ensure your credentials can write to the bucket

Test endpoint connectivity:

# For MinIO/R2
curl -I $S3_ENDPOINT_URL

Debug with minimal config:

unset CAPTURE_REMOTE  # Disable S3, use local only
python src/capture.py

❌ Performance Issues

Problem: High memory usage or slow performance

Solutions:

Reduce capture rate:

export CAPTURE_FPS=1              # Lower frame rate
export CAPTURE_JPEG_QUALITY=70    # Lower image quality

Shorter episodes:

export CAPTURE_EPISODE_TIMEOUT=300  # 5 minutes max

Monitor resources:

# Watch memory usage
watch 'ps aux | grep capture.py'

# Check disk space
df -h ./data/mds/

Debugging Commands

View Current Configuration

python src/capture.py --help  # See all options
env | grep CAPTURE            # Show capture settings
env | grep NEKO               # Show connection settings

Test Components Individually

Test REST Authentication:

python -c "
import requests, os
r = requests.post(f'{os.getenv(\"NEKO_URL\")}/api/login',
                 json={'username': os.getenv('NEKO_USER'),
                       'password': os.getenv('NEKO_PASS')})
print(f'Status: {r.status_code}')
print(f'Response: {r.text}')
"

Test Screenshot Endpoint:

python -c "
import requests, os
r = requests.get(f'{os.getenv(\"NEKO_URL\")}/api/room/screen/shot.jpg')
print(f'Status: {r.status_code}')
print(f'Content-Type: {r.headers.get(\"content-type\")}')
with open('/tmp/test_screenshot.jpg', 'wb') as f:
    f.write(r.content)
print('Screenshot saved to /tmp/test_screenshot.jpg')
"

Log Analysis

Enable detailed logging to diagnose issues:

export CAPTURE_LOGLEVEL=DEBUG
python src/capture.py 2>&1 | tee capture.log

Key log messages to look for:

REST login ok - Authentication successful
Screen size: 1920x1080 - WebSocket connection established
Episode [id] started - Episode recording began
Episode [id] committed - Episode saved successfully
WS loop error - WebSocket connection issues
Shot.jpg non-OK - Screenshot endpoint problems

Getting Help

If you're still having issues:

Check the logs with CAPTURE_LOGLEVEL=DEBUG
Verify your environment with env | grep -E "(NEKO|CAPTURE|AWS)"
Test components separately using the debug commands above

Create a minimal test case:

# Simplest possible configuration
export NEKO_URL="https://your-server.com"
export NEKO_USER="testuser"
export NEKO_PASS="testpass"
export CAPTURE_OUT="/tmp/test-capture"
export CAPTURE_LOGLEVEL="DEBUG"

python src/capture.py

Performance Tuning

For optimal performance in different scenarios:

High-Quality Training Data

export CAPTURE_FPS=3
export CAPTURE_JPEG_QUALITY=95
export CAPTURE_COMPRESSION="zstd:3"  # Faster compression
export CAPTURE_SHARD_SIZE="256mb"    # Smaller shards for faster upload

Long Sessions/Low Memory

export CAPTURE_FPS=1
export CAPTURE_JPEG_QUALITY=70
export CAPTURE_EPISODE_TIMEOUT=1800
export CAPTURE_COMPRESSION="zstd:9"  # Maximum compression

Local Development

unset CAPTURE_REMOTE              # No S3 upload
export CAPTURE_COMPRESSION="none" # Faster local writes
export CAPTURE_LOGLEVEL="INFO"

Best Practices

Episode Design

Keep episodes focused: Record single, complete tasks rather than mixing multiple workflows:

# Good: focused task
/start Complete user registration process
# ... perform registration steps ...
/stop

# Poor: mixed tasks  
/start Do registration then check email then browse products

Use descriptive task names: Help your training pipeline understand the data:

/start Handle payment form validation errors
/start Navigate product catalog with filters
/start Recover from session timeout during checkout

Include error scenarios: Capture both success and failure paths:

/start Login with invalid credentials and recover
/start Handle network interruption during file upload
/start Deal with form validation errors

Action Annotation Guidelines

Be consistent with action types: Use standardized action schemas:

{"type": "click", "coordinate": [x, y], "element": "button_id"}
{"type": "type", "text": "input text", "element": "field_id"}  
{"type": "scroll", "direction": "down", "amount": 300}
{"type": "navigate", "url": "https://example.com"}
{"type": "wait", "duration": 2.0, "reason": "page_load"}

Include context in actions: Add semantic information when possible:

{"type": "click", "coordinate": [200, 100], "element": "login_button", "intent": "submit_form"}
{"type": "type", "text": "user@example.com", "element": "email_field", "intent": "enter_credentials"}

Storage Management

Choose appropriate compression: Balance speed vs storage:

Development: CAPTURE_COMPRESSION="none" (fastest)
Production: CAPTURE_COMPRESSION="zstd:6" (balanced)
Archive: CAPTURE_COMPRESSION="zstd:9" (smallest)

Optimize shard sizes for your infrastructure:

Fast networks: CAPTURE_SHARD_SIZE="1gb" (fewer files)
Slow networks: CAPTURE_SHARD_SIZE="128mb" (faster uploads)
Mobile/edge: CAPTURE_SHARD_SIZE="64mb" (reliable transfers)

Use appropriate retention policies:

# Keep local copies for immediate access
export CAPTURE_KEEP_LOCAL=1

# Or upload and clean for space efficiency  
export CAPTURE_KEEP_LOCAL=0

Advanced Usage

Continuous Capture Workflows

For long-running capture sessions, use systemd or docker for reliability:

Systemd service (/etc/systemd/system/neko-capture.service):

[Unit]
Description=Neko Training Data Capture
After=network.target

[Service]
Type=simple
User=capture
WorkingDirectory=/opt/neko-capture
Environment=NEKO_URL=https://neko.example.com
Environment=NEKO_USER=capture-bot
EnvironmentFile=/etc/neko-capture/config
ExecStart=/usr/bin/python3 src/capture.py
Restart=always
RestartSec=30

[Install]
WantedBy=multi-user.target

Docker deployment:

FROM python:3.11-slim
RUN pip install streaming requests websockets
COPY src/capture.py /app/
ENV NEKO_URL=https://neko.example.com
CMD ["python", "/app/capture.py"]

Multi-instance Scaling

Run multiple capture instances for parallel data collection:

# Instance 1: UI automation tasks
export CAPTURE_OUT="./data/ui-automation"
export CAPTURE_REMOTE="s3://training/ui-automation"
python src/capture.py &

# Instance 2: Navigation tasks  
export CAPTURE_OUT="./data/navigation"
export CAPTURE_REMOTE="s3://training/navigation"  
python src/capture.py &

# Instance 3: Error handling
export CAPTURE_OUT="./data/error-recovery"
export CAPTURE_REMOTE="s3://training/error-recovery"
python src/capture.py &

Integration with Training Pipelines

Streaming data loader example:

from streaming import StreamingDataset
import torch
from torch.utils.data import DataLoader

class UIDataset(StreamingDataset):
    def __getitem__(self, idx):
        sample = super().__getitem__(idx)
        
        # Extract and process episode
        payload = sample['payload']
        episode = self.process_episode(payload)
        
        return {
            'frames': episode['frames'],
            'actions': episode['actions'], 
            'task': sample['task']
        }

# Use with PyTorch
dataset = UIDataset(remote="s3://training/episodes")
loader = DataLoader(dataset, batch_size=32, num_workers=4)

Quality Assurance

Validate episodes before training:

def validate_episode(episode_data):
    """Check episode quality before using for training"""
    frames = episode_data['frames']
    actions = episode_data['actions']
    
    # Check minimum length
    if len(frames) < 10:
        return False, "Too few frames"
        
    # Check action/frame ratio
    if len(actions) / len(frames) < 0.1:
        return False, "Too few actions relative to frames"
        
    # Check for valid coordinates
    for action in actions:
        if action.get('type') == 'click':
            coord = action.get('coordinate', [])
            if not (0 <= coord[0] <= 1920 and 0 <= coord[1] <= 1080):
                return False, "Invalid click coordinates"
                
    return True, "Valid episode"

Monitor capture quality:

# Check recent episodes
python -c "
from streaming import StreamingDataset
ds = StreamingDataset(local='./data/mds')
for i, sample in enumerate(ds):
    if i >= 10: break
    print(f'Episode {sample[\"episode_id\"]}: {sample[\"num_frames\"]} frames, {sample[\"num_actions\"]} actions')
"

Finetuning on Captured Episodes

This guide explains how to finetune the agent’s ShowUI/Qwen2VL model using the episodes recorded by src/capture.py (MosaicML Streaming, aka MDS).

All commands assume you are inside the Nix development shell (nix develop).

What It Trains

The training script turns each annotated action in an episode into a supervised example:

Prompt: the same chat template used by the agent (system + task + image and optional short action history).
Target: the JSON action string (e.g., {"action":"CLICK", ...}) observed in chat as Action: {...} during capture.

This directly teaches the model to emit the agent’s expected action schema.

Environment Variables

Set these in .env (already scaffolded in .env.example):

# Data sources (MDS)
TRAIN_LOCAL="./data/mds"              # defaults to CAPTURE_OUT
# TRAIN_REMOTE="s3://bucket/prefix"    # optional remote MDS
TRAIN_CACHE="./data/cache"            # local cache for StreamingDataset

# Output
TRAIN_OUTPUT="./checkpoints"          # where to save model + processor

# Logging
TRAIN_LOGLEVEL=INFO                    # DEBUG|INFO|WARNING|ERROR
TRAIN_LOG_FORMAT=text                  # text|json

# Optimization
TRAIN_EPOCHS=1
TRAIN_BATCH=1
TRAIN_ACCUM=1
TRAIN_LR=5e-6
TRAIN_WD=0.0
TRAIN_MAX_STEPS=0                      # 0 disables cap
TRAIN_MAX_SAMPLES_PER_EPOCH=0          # 0 disables per-epoch cap
TRAIN_HISTORY_STEPS=0                  # include N prior actions in prompt
SEED=1337

The model ID and image sizes inherit from the agent defaults:

REPO_ID="showlab/ShowUI-2B"
SIZE_SHORTEST_EDGE=224
SIZE_LONGEST_EDGE=1344

Running Training

# Basic run (uses .env)
python src/train.py

# With overrides
TRAIN_BATCH=2 TRAIN_EPOCHS=1 python src/train.py --output ./ckpt/showui-ft

# Using just
just train

Checkpoints are written under TRAIN_OUTPUT (epoch-* and final/). Use them with the agent by pointing REPO_ID to the checkpoint folder.

Notes

The script streams episodes with streaming.StreamingDataset (MDS). It can read from a local directory and/or a remote (S3/R2/MinIO) when configured.
Mixed precision is enabled on CUDA automatically. On Apple MPS it uses the default precision.
This is a minimal reference trainer for reproducible fine-tuning. For large jobs, consider adding gradient checkpointing, deepspeed, or PEFT adapters in your environment.

Voice Synthesis (YAP)

YAP (Yet Another Presenter) provides real-time text-to-speech capabilities for your Neko automation sessions. Convert text messages into natural-sounding speech that plays directly in the browser.

What is YAP?

YAP transforms text into speech using advanced AI voice synthesis (F5-TTS) and streams the audio through WebRTC to your Neko browser session. This enables:

Voice announcements during automation tasks
Interactive conversations through chat commands
Live narration of automation steps
Multiple voice personas with custom characteristics

Quick Start

1. Prerequisites

Before using YAP, ensure you have:

A running Neko server (see Getting Started)
GPU environment for optimal performance: nix develop .#gpu
Basic voice files in the ./voices directory

2. Start YAP Service

# Connect to local Neko server
export NEKO_URL="http://localhost:8080"
export NEKO_USER="user" 
export NEKO_PASS="password"

uv run src/yap.py

You should see:

[12:34:56] yap INFO - WS connected
[12:34:56] yap INFO - RTC answer sent; audio track live.
[12:34:56] yap INFO - Voices reloaded (1 entries)

3. Test Voice Output

In your Neko browser chat, type:

/yap Hello! I am your voice assistant.

You should hear the text spoken through the browser audio.

Voice Commands

YAP responds to commands in the Neko chat interface:

Immediate Speech

Speak text immediately:

/yap Good morning! Ready to start automation.
/yap The task has been completed successfully.

Streaming Mode

For longer conversations or live narration:

/yap:begin
I'm starting the automation task now...
Navigating to the website...
Filling out the form...
Submitting the data...
/yap:end

In streaming mode, YAP processes text incrementally as you type, enabling natural conversation flow.

Stop/Clear Queue

Cancel current speech and clear the queue:

/yap:stop

Voice Management

Default Voice Setup

YAP needs at least one voice configured. Create a basic setup:

Create voices directory:
```
mkdir -p voices
```

Add a reference audio file:

# Record or copy a 3-10 second WAV file
cp your-voice-sample.wav voices/default.wav

YAP will auto-create voices.json on first run with default settings.

Adding New Voices

Add voices through chat commands:

/yap:voice add --spk alice --ref ./voices/alice.wav --ref-text "Hello, my name is Alice" --styles "friendly,calm"

Parameters:

--spk: Voice ID/name
--ref: Path to reference audio file (WAV format, 3-10 seconds)
--ref-text: Transcript of the reference audio
--styles: Comma-separated style tags
--rate: Speech speed (0.5-2.0, default 1.0)
--pitch: Pitch shift in semitones (-12 to +12, default 0.0)

Switching Voices

Change the active voice and parameters:

/yap:voice set --spk alice
/yap:voice set --spk bob --rate 1.2 --pitch -0.5
/yap:voice set --rate 0.8

Reload Voice Configuration

After manually editing voices/voices.json:

/yap:voice reload

Configuration

Basic Settings

Set these environment variables before starting YAP:

# Connection (choose one method)
export YAP_WS="wss://demo.neko.com/api/ws?token=your_token"
# OR
export NEKO_URL="https://demo.neko.com"
export NEKO_USER="username"
export NEKO_PASS="password"

# Voice directory
export YAP_VOICES_DIR="./voices"
export YAP_SPK_DEFAULT="default"

Audio Quality

# Audio format (recommended settings)
export YAP_SR=48000              # Sample rate (Hz)
export YAP_AUDIO_CHANNELS=1      # Channels (1=mono, 2=stereo)
export YAP_FRAME_MS=20           # WebRTC frame size

# Processing
export YAP_PARALLEL=2            # TTS worker threads
export YAP_MAX_CHARS=350         # Max characters per chunk
export YAP_OVERLAP_MS=30         # Audio crossfade overlap

Performance Tuning

# For faster response (lower quality)
export YAP_MAX_CHARS=200
export YAP_PARALLEL=4

# For better quality (higher latency)  
export YAP_MAX_CHARS=500
export YAP_OVERLAP_MS=50

# Buffer management
export YAP_JITTER_MAX_SEC=6.0    # Audio buffer size

Voice Configuration File

YAP stores voice settings in voices/voices.json:

{
  "default": {
    "ref_audio": "./voices/default.wav",
    "ref_text": "This is my default voice sample.",
    "styles": ["neutral"],
    "params": {
      "rate": 1.0,
      "pitch": 0.0
    }
  },
  "alice": {
    "ref_audio": "./voices/alice.wav", 
    "ref_text": "Hello, my name is Alice and I sound friendly.",
    "styles": ["friendly", "energetic"],
    "params": {
      "rate": 1.1,
      "pitch": 0.2
    }
  },
  "narrator": {
    "ref_audio": "./voices/narrator.wav",
    "ref_text": "I will be narrating the automation process.",
    "styles": ["professional", "clear"],
    "params": {
      "rate": 0.9,
      "pitch": -0.3
    }
  }
}

Voice Parameters

ref_audio: Path to reference WAV file (3-10 seconds, clear speech)
ref_text: Exact transcript of the reference audio
styles: Descriptive tags (friendly, professional, calm, energetic)
rate: Speech speed multiplier (0.5=slow, 1.0=normal, 2.0=fast)
pitch: Pitch adjustment in semitones (negative=lower, positive=higher)

Usage Scenarios

Automation Announcements

Start YAP alongside your automation agent:

# Terminal 1: Start YAP
uv run src/yap.py

# Terminal 2: Run automation with voice (audio is enabled by default)
uv run src/agent.py --task "Fill out contact form"

The agent can announce progress:

/yap Starting automation task: Fill out contact form
/yap Navigating to the website...
/yap Form submitted successfully!

Interactive Sessions

Use YAP for live interaction during manual control:

# Terminal 1: Start YAP
uv run src/yap.py

# Terminal 2: Manual control
python src/manual.py

Then control both automation and voice through chat:

!click 100 200
/yap Clicked on the submit button
!type "Hello World"
/yap Entered text in the field

Multi-Voice Conversations

Set up different voices for different purposes:

# Set up voices
/yap:voice add --spk system --ref ./voices/system.wav --ref-text "System notification" --rate 0.9
/yap:voice add --spk user --ref ./voices/user.wav --ref-text "User interaction" --rate 1.1

# Use in conversation
/yap:voice set --spk system
/yap System initialization complete.

/yap:voice set --spk user  
/yap Thank you for the update!

Troubleshooting

No Audio Output

Check browser audio permissions:

Click the browser's address bar lock icon
Ensure "Sound" is allowed
Check browser volume settings

Verify WebRTC connection:

Open browser developer tools (F12)
Go to Console tab
Look for WebRTC connection messages
Check for audio stream indicators

Test connection:

uv run src/yap.py --healthcheck

Poor Audio Quality

Check reference audio:

Use high-quality WAV files (16-bit, 22kHz+)
3-10 second samples with clear speech
No background noise or music
Single speaker only

Adjust processing:

# Increase overlap for smoother transitions
export YAP_OVERLAP_MS=50

# Reduce chunk size for lower latency
export YAP_MAX_CHARS=250

High Latency

Optimize for speed:

# Increase parallel workers
export YAP_PARALLEL=4

# Reduce chunk size
export YAP_MAX_CHARS=200

# Reduce buffer size
export YAP_JITTER_MAX_SEC=3.0

Check GPU usage:

# Monitor GPU usage while YAP is running
nvidia-smi -l 1

Connection Issues

WebSocket connection failed:

# Test WebSocket endpoint
curl -i -N -H "Connection: Upgrade" -H "Upgrade: websocket" \
  http://localhost:8080/api/ws

Authentication failed:

# Test REST login
curl -X POST http://localhost:8080/api/login \
  -H "Content-Type: application/json" \
  -d '{"username":"user","password":"password"}'

Check firewall/network:

Ensure ports 8080 (HTTP) and WebRTC ports are accessible
Test with STUN server connectivity
Check for corporate proxy/firewall blocking WebRTC

Debug Mode

Enable detailed logging:

export YAP_LOGLEVEL=DEBUG
export YAP_LOG_FORMAT=json
uv run src/yap.py 2>&1 | jq .

Look for:

WebSocket connection status
WebRTC negotiation progress
TTS processing times
Audio buffer status

Advanced Usage

Custom Voice Training

For better voice quality, record multiple reference samples:

Record varied samples:

# Different emotions/styles
voices/alice-happy.wav
voices/alice-serious.wav
voices/alice-excited.wav

Test different samples:

/yap:voice set --spk alice-happy
/yap I'm so excited about this automation!

/yap:voice set --spk alice-serious  
/yap Please review these results carefully.

Integration with Automation

Modify automation scripts to include voice feedback:

# In your automation script
import requests

def announce(text):
    """Send voice announcement to YAP"""
    requests.post(f"{neko_url}/api/chat", json={
        "message": f"/yap {text}"
    })

# Use in automation
announce("Starting login process")
agent.click_login_button()
announce("Login successful, proceeding to dashboard")

Docker Deployment

Deploy YAP as a container service:

FROM python:3.9-slim

# Install system dependencies
RUN apt-get update && apt-get install -y ffmpeg

# Install Python dependencies  
COPY requirements.txt .
RUN pip install -r requirements.txt

# Copy application and voices
COPY src/ /app/src/
COPY voices/ /app/voices/
WORKDIR /app

# Configuration
ENV YAP_VOICES_DIR=/app/voices
ENV YAP_SR=48000
ENV YAP_PARALLEL=2

ENTRYPOINT ["python", "src/yap.py"]

Next Steps

Learn about Training Data Capture to improve voice models
Explore Core Agent for automation integration
Read TTS Service Technical Details for advanced configuration
Check Neko Integration for server setup options

Getting Started - Initial system setup
Training Data Capture - Data collection for improvements
Manual Control CLI - Interactive testing
Architecture Overview - System design

Architecture Overview

This document provides a comprehensive overview of the Neko Agent system architecture, covering the core components, data flow, and integration patterns.

System Overview

graph TB
    subgraph "Development Environment"
        Agent["`**Neko Agent**
        (src/agent.py)
        • WebRTC Client
        • AI Vision Model
        • Action Execution`"]
        
        Capture["`**Capture Service**
        (src/capture.py)
        • Training Data Collection
        • MosaicML Streaming
        • S3 Upload`"]
        
        TTS["`**TTS Service**
        (src/yap.py)
        • F5-TTS Synthesis
        • Voice Management
        • WebRTC Audio`"]
    end
    
    subgraph "Neko Server"
        Chrome["`**Chrome Container**
        • GUI Target
        • WebRTC Server
        • Screen Sharing`"]
        
        NekoAPI["`**Neko API**
        • Authentication
        • WebSocket Signaling
        • Session Management`"]
    end
    
    subgraph "External Services"
        S3["`**S3 Storage**
        • Training Data
        • Model Artifacts
        • Voice Assets`"]
        
        Models["`**AI Models**
        • ShowUI-2B (Vision)
        • F5-TTS (Audio)
        • Custom Voices`"]
    end
    
    Agent <-->|WebRTC Data| Chrome
    Agent <-->|Control API| NekoAPI
    Agent <-->|Screenshots| Capture
    Agent <-->|Voice Commands| TTS
    
    Capture -->|MDS Shards| S3
    TTS <-->|Audio Stream| Chrome
    TTS <-->|Voice Models| Models
    
    Agent <-->|Vision Inference| Models

Core Components

1. Neko Agent (`src/agent.py`)

The main automation agent that orchestrates GUI interactions through AI-powered decision making.

Key Features:

WebRTC Integration: Real-time connection to Neko Chrome containers
AI Vision: ShowUI-2B (Qwen2VL) model for visual reasoning and action planning
Action Execution: Supports CLICK, INPUT, SCROLL, SWIPE, TAP, ANSWER actions
Prometheus Metrics: Built-in observability and performance monitoring

Architecture:

class NekoAgent:
    def __init__(self):
        self.model = ShowUI2B()
        self.webrtc = WebRTCClient()
        self.metrics = PrometheusMetrics()
    
    async def run(self, task: str):
        # 1. Connect to Neko via WebRTC
        await self.webrtc.connect()
        
        # 2. Capture current screen
        frame = await self.webrtc.get_frame()
        
        # 3. AI model analyzes screen and plans action
        action = await self.model.analyze(frame, task)
        
        # 4. Execute action through WebRTC
        await self.webrtc.execute_action(action)
        
        # 5. Repeat until task complete

2. Capture Service (`src/capture.py`)

Automated training data collection system that records agent sessions for model improvement.

Key Features:

Episode Recording: Packages complete automation sessions as training samples
MosaicML Streaming: Generates MDS shards for efficient distributed training
S3 Integration: Direct upload to cloud storage with configurable backends
WebSocket Monitoring: Real-time capture of agent actions and screen states

Data Flow:

sequenceDiagram
    participant User
    participant Agent
    participant Capture
    participant Neko
    participant S3
    
    User->>Agent: Start automation task
    Agent->>Capture: Begin episode recording
    
    loop Every Action
        Agent->>Neko: Execute action
        Neko->>Agent: Screen update
        Agent->>Capture: Log action + timestamp
        Capture->>Capture: Store frame + metadata
    end
    
    Agent->>Capture: End episode
    Capture->>Capture: Generate ZIP archive
    Capture->>S3: Upload MDS shard

3. TTS Service (`src/yap.py`)

Real-time text-to-speech system providing voice feedback during automation.

Key Features:

F5-TTS Integration: High-quality voice synthesis with custom voice models
WebRTC Audio: Direct audio streaming to Neko sessions
Voice Management: Hot-reloadable voice registry with style parameters
Streaming Support: Both immediate and incremental text processing

Audio Pipeline:

graph LR
    Text["`Text Input
    /yap commands`"] --> Segment["`Segmentation
    Punctuation-aware
    chunks`"]
    
    Segment --> Workers["`Parallel Workers
    F5-TTS synthesis
    Rate/pitch control`"]
    
    Workers --> Splicer["`Audio Splicer
    Crossfade blending
    Jitter buffering`"]
    
    Splicer --> WebRTC["`WebRTC Output
    48kHz audio stream
    Real-time delivery`"]

Data Flow and Integration

WebRTC Communication

The system uses WebRTC for low-latency communication with Neko containers:

# WebRTC connection establishment
async def connect_to_neko():
    # 1. REST API authentication
    token = await authenticate(neko_url, username, password)
    
    # 2. WebSocket signaling
    ws_url = f"wss://{host}/api/ws?token={token}"
    signaling = await websockets.connect(ws_url)
    
    # 3. WebRTC peer connection
    pc = RTCPeerConnection()
    
    # 4. Add video track (for screen capture)
    @pc.on("track")
    async def on_track(track):
        if track.kind == "video":
            frame = await track.recv()
            # Process frame for AI analysis
    
    # 5. Exchange SDP offer/answer
    await exchange_sdp(pc, signaling)

Training Data Format

The capture service generates training samples in this format:

episode_xyz789.zip
├── meta.json              # Episode metadata
├── frames/                # Screenshot sequence
│   ├── 000000.jpg         #   Frame 0 (t=0.0s)
│   ├── 000001.jpg         #   Frame 1 (t=0.5s)
│   └── ...
├── frames.ndjson          # Frame index with timestamps
└── actions.ndjson         # Action sequence with timestamps

MDS Shard Structure:

# MosaicML Streaming format
columns = {
    "episode_id": "str",
    "task": "str", 
    "payload": "bytes",      # ZIP archive
    "num_frames": "int",
    "num_actions": "int",
    "started_at": "str",
    "ended_at": "str",
    "screen_w": "int",
    "screen_h": "int"
}

Voice Model Management

The TTS service manages voices through a JSON registry:

{
  "instructor": {
    "ref_audio": "/voices/instructor.wav",
    "ref_text": "This is an instructional voice for tutorials",
    "styles": ["calm", "explanatory"],
    "params": {
      "rate": 0.95,
      "pitch": -1.0
    }
  }
}

Deployment Patterns

Development Mode

# Terminal 1: Neko server
nix develop .#neko
neko-services up

# Terminal 2: Agent
nix develop .#gpu  
uv run src/agent.py --task "automation task"

# Terminal 3: Capture (optional)
uv run src/capture.py

# Terminal 4: TTS (optional)
uv run src/yap.py

Production Considerations

For production deployments, consider:

Container Orchestration: Deploy components as separate containers
GPU Resource Management: Efficient sharing between agent and TTS
Storage Strategy: S3-compatible backends for training data
Monitoring: Prometheus metrics and health checks
Scaling: Multiple agent instances with load balancing

See the API Mode Design for detailed production architecture plans.

Error Handling and Resilience

Connection Resilience

# Exponential backoff for WebRTC reconnection
async def connect_with_retry():
    backoff = 1.0
    max_retries = 5
    
    for attempt in range(max_retries):
        try:
            return await establish_webrtc_connection()
        except ConnectionError:
            await asyncio.sleep(backoff)
            backoff = min(30.0, backoff * 2)
    
    raise MaxRetriesExceeded()

Graceful Degradation

Capture failures: Agent continues without training data collection
TTS unavailable: Silent operation mode
Model errors: Fallback to basic automation patterns
Network issues: Local caching and offline capabilities

Performance Characteristics

Latency Profile

Frame processing: ~100-200ms (GPU inference)
Action execution: ~50-100ms (WebRTC round-trip)
Voice synthesis: ~200-500ms (F5-TTS generation)
Training data write: ~10-50ms (local storage)

Resource Usage

GPU Memory: 2-4GB (ShowUI-2B + F5-TTS)
System Memory: 1-2GB (Python runtime + buffers)
Network: 5-10 Mbps (WebRTC video + audio)
Storage: Variable (training data accumulation)

Security Considerations

Authentication

JWT tokens for Neko API access
WebSocket secure connections (WSS)
S3 IAM roles for training data upload

Data Privacy

Local frame processing (no external API calls)
Configurable training data retention
Optional data encryption at rest

Network Security

TLS/SSL for all external connections
WebRTC DTLS encryption
Firewall-friendly STUN/TURN protocols

Future Architecture Evolution

The current architecture is designed to support future enhancements:

API Mode: RESTful service with horizontal scaling
Multi-Modal: Vision + audio + text understanding
Distributed Training: Cloud-native ML pipelines
Edge Deployment: Lightweight inference containers

See API Mode Design for the complete next-generation architecture.

Component Documentation

This section provides detailed documentation for each component of the Neko Agent system.

Overview

The Neko Agent system consists of four main components:

Core Agent (src/agent.py) - Main automation engine
Capture Service (src/capture.py) - Training data collection
Manual Control CLI (src/manual.py) - Interactive remote control interface
TTS Service (src/yap.py) - Voice synthesis and audio

Each component is designed to work independently while providing seamless integration when used together.

Component Interaction

graph TD
    Agent[Core Agent] --> Neko[Neko Server]
    Agent -.->|Optional| Capture[Capture Service]
    Agent -.->|Optional| TTS[TTS Service]
    
    Manual[Manual Control CLI] --> Neko
    Manual -.->|Admin API| Sessions[Session Management]
    
    Capture --> Storage[Training Data Storage]
    TTS --> Audio[WebRTC Audio Stream]
    
    Neko --> Chrome[Chrome Container]
    Audio --> Chrome

Development Workflow

When developing with multiple components:

Start Neko Server (if using local setup)
Launch Core Agent for basic automation
Add Capture Service for training data collection
Use Manual Control CLI for testing and debugging
Add TTS Service for voice feedback

Each component has its own configuration and can be enabled/disabled as needed.

Next Steps

Review individual component documentation in the subsections
See Development Setup for configuration details
Check Architecture Overview for system design

Core Agent (`src/agent.py`)

The core agent (src/agent.py) is the main automation engine of the Neko Agent system. It provides AI-powered GUI automation through WebRTC connections to Neko servers, using the ShowUI-2B vision model for intelligent action planning and execution.

Overview

The agent operates as a production-ready, 12-Factor App compliant service that:

Connects to Neko servers via WebRTC for real-time GUI interaction
Uses ShowUI-2B (Qwen2VL) for visual reasoning and action planning
Executes precise actions with iterative refinement for improved accuracy
Supports multiple modes including web automation and mobile device control
Provides comprehensive metrics via Prometheus for monitoring and observability
Handles graceful shutdown with proper resource cleanup

Architecture

graph TB
    subgraph "Agent Core"
        Agent[NekoAgent]
        Settings[Settings]
        Signaler[Signaler]
    end

    subgraph "Frame Sources"
        WebRTC[WebRTCFrameSource]
        Lite[LiteFrameSource]
    end

    subgraph "AI Processing"
        Model[ShowUI-2B Model]
        Processor[AutoProcessor]
        Executor[ThreadPoolExecutor]
    end

    subgraph "External Services"
        Neko[Neko Server]
        Metrics[Prometheus Metrics]
        Storage[Frame/Click Storage]
    end

    Agent --> Signaler
    Agent --> WebRTC
    Agent --> Lite
    Agent --> Model
    Agent --> Processor

    Signaler <-->|WebSocket| Neko
    WebRTC <-->|WebRTC| Neko
    Lite <-->|HTTP Polling| Neko

    Agent --> Metrics
    Agent --> Storage

    Model --> Executor

Core Classes

1. Settings

Purpose: Centralized configuration management following 12-Factor App principles.

Key Features:

Environment variable-driven configuration
Validation with error reporting
Support for both development and production deployments
Port priority handling ($PORT overrides NEKO_METRICS_PORT)

Configuration Options:

Category	Environment Variable	Default	Description
Model	`REPO_ID`	`showlab/ShowUI-2B`	Hugging Face model repository
	`SIZE_SHORTEST_EDGE`	`224`	Image resize shortest edge
	`SIZE_LONGEST_EDGE`	`1344`	Image resize longest edge
Network	`NEKO_WS`	`wss://neko.example.com/api/ws`	WebSocket URL
	`NEKO_ICE_POLICY`	`strict`	ICE candidate policy
	`NEKO_STUN_URL`	`stun:stun.l.google.com:19302`	STUN server
	`NEKO_TURN_URL`	-	TURN server (optional)
Behavior	`NEKO_MAX_STEPS`	`8`	Maximum automation steps
	`REFINEMENT_STEPS`	`5`	Click refinement iterations
	`NEKO_AUDIO`	`1`	Enable audio streaming
Logging	`NEKO_LOGLEVEL`	`INFO`	Log level
	`NEKO_LOG_FORMAT`	`text`	Log format (text/json)
	`NEKO_LOGFILE`	-	Log file path (optional)
Storage	`FRAME_SAVE_PATH`	-	Frame screenshot storage
	`CLICK_SAVE_PATH`	-	Action visualization storage
	`OFFLOAD_FOLDER`	`./offload`	Model cache directory
Metrics	`PORT` / `NEKO_METRICS_PORT`	`9000`	Prometheus metrics port

Example Usage:

# Load configuration from environment
settings = Settings.from_env()

# Validate configuration
errors = settings.validate()
if errors:
    for error in errors:
        print(f"Configuration error: {error}")
    sys.exit(1)

2. NekoAgent

Purpose: Main orchestration class that manages WebRTC connections, AI processing, and action execution.

Initialization:

agent = NekoAgent(
    model=model,                    # ShowUI-2B model instance
    processor=processor,            # AutoProcessor for model inputs
    ws_url="wss://neko.example.com/api/ws",
    nav_task="Navigate to google.com and search for AI",
    nav_mode="web",                 # "web" or "phone"
    settings=settings,
    logger=logger,
    max_steps=10,
    refinement_steps=3,
    audio=True,
    online=False                    # False=single task, True=multi-task chat mode
)

Key Methods:

`async def run()`

Main execution loop that handles connection lifecycle, reconnection logic, and task processing.

Flow:

Signal handling - Register SIGINT/SIGTERM handlers
Connection establishment - WebSocket connection with backoff
Media negotiation - WebRTC SDP offer/answer exchange
Frame processing - Continuous screen capture and AI analysis
Action execution - Translate AI decisions to control commands
Graceful shutdown - Clean resource cleanup

`async def _navigate_once(img, history, step)`

Performs single navigation step using AI model inference.

Parameters:

img: Current screen image (PIL Image)
history: List of previous actions for context
step: Current step number for logging

Process:

Prepare inputs - Format image and chat template for model
Model inference - Generate action prediction using ShowUI-2B
Action parsing - Extract structured action from model output
Iterative refinement - Crop and re-infer for precise click coordinates
Action execution - Send control commands via WebSocket

Refinement Logic:

for i in range(self.refinement_steps):
    # Initial inference on full image or cropped region
    action = await model_inference(current_img)

    if action.type == "CLICK":
        # Crop around predicted click location
        current_img, crop_box = self._crop_image(original_img, action.position)
        # Re-infer on cropped image for better precision
    else:
        break  # No refinement for non-click actions

`async def _execute_action(action, size)`

Translates high-level actions into low-level control commands.

Supported Actions:

Action	Description	Parameters
`CLICK`	Mouse click at coordinates	`position: [x, y]` (normalized)
`INPUT`	Type text string	`value: str`
`SCROLL`	Scroll in direction	`direction: "up/down/left/right"`
`SWIPE`	Touch swipe gesture	`startPosition: [x, y]`, `endPosition: [x, y]`
`TAP`	Mobile tap action	`position: [x, y]`
`ANSWER`	Task completion signal	`value: "final answer"`

Example Action Execution:

# CLICK action
action = {
    "action": "CLICK",
    "position": [0.5, 0.3],  # Normalized coordinates (50%, 30%)
    "value": None
}

# Converts to pixel coordinates and sends control command
x, y = int(0.5 * screen_width), int(0.3 * screen_height)
await signaler.send({
    "event": "control/buttonpress",
    "payload": {"x": x, "y": y, "code": 1}  # Left click
})

3. Frame Sources

The agent supports multiple frame capture mechanisms:

WebRTCFrameSource

Purpose: Real-time frame capture via WebRTC video streams.

Features:

Low latency - Direct WebRTC video track access
High quality - Full resolution screen capture
Automatic buffering - Handles frame timing and delivery
Error resilience - Graceful handling of stream interruptions

LiteFrameSource

Purpose: HTTP-based frame capture for environments where WebRTC is unavailable.

Features:

Simple protocol - HTTP GET requests for screenshots
Fallback mode - Works when WebRTC fails or is blocked
Configurable polling - Adjustable frame rate
Resource efficient - Lower bandwidth than video streams

4. Signaler

Purpose: WebSocket communication layer for control and signaling.

Features:

Event-driven messaging - Pub/sub pattern with topic routing
Automatic reconnection - Exponential backoff strategy
Concurrent I/O - Separate read/write loops
Message queuing - Buffered sending with backpressure handling

Event Types:

signal/* - WebRTC signaling (offer/answer/ICE)
control/* - GUI control commands (mouse/keyboard)
chat/* - Text messaging for online mode
system/* - Session management and status

Web Mode (`nav_mode="web"`)

Optimized for: Desktop web browsers, web applications

Action Space:

Available Actions:
1. CLICK(position): Click at normalized coordinates [x, y] where 0 <= x, y <= 1
2. INPUT(value): Type the specified text string
3. SCROLL(direction): Scroll in direction (up, down, left, right)
4. ANSWER(value): Provide final answer when task is complete

Examples:
- CLICK(position=[0.5, 0.3]): Click at center-left of screen
- INPUT(value="search query"): Type text into focused element
- SCROLL(direction="down"): Scroll page downward
- ANSWER(value="Task completed successfully"): Signal completion

Phone Mode (`nav_mode="phone"`)

Optimized for: Mobile device interfaces, touch interactions

Action Space:

Available Actions:
1. TAP(position): Tap at normalized coordinates [x, y] where 0 <= x, y <= 1
2. INPUT(value): Enter text using virtual keyboard
3. SWIPE(startPosition, endPosition): Swipe from start to end coordinates
4. SCROLL(direction): Scroll content in specified direction
5. ANSWER(value): Provide final answer when task is complete

Examples:
- TAP(position=[0.8, 0.1]): Tap near top-right corner
- SWIPE(startPosition=[0.5, 0.8], endPosition=[0.5, 0.2]): Swipe up
- SCROLL(direction="down"): Scroll content downward

AI Model Integration

ShowUI-2B (Qwen2VL)

Model Details:

Architecture: Qwen2VL-based multimodal model specialized for GUI understanding
Input: RGB images + text instructions
Output: Structured action commands
Performance: ~100-200ms inference time on GPU

Processing Pipeline:

# 1. Image preprocessing
image = resize_and_validate_image(frame, settings, logger)

# 2. Chat template preparation
content = [
    {"type": "text", "text": system_prompt},
    {"type": "text", "text": f"Task: {nav_task}"},
    {"type": "text", "text": f"Action history: {history}"},
    {"type": "image", "image": image, "size": {...}}
]

# 3. Model inference
inputs = processor(text=[text], images=[image], return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128)

# 4. Action parsing
action = safe_parse_action(decoded_output, nav_mode="web")

Iterative Refinement: The agent implements a novel refinement strategy for improved click accuracy:

Initial prediction on full screen image
Crop extraction around predicted click location (50% of original size)
Re-inference on cropped region for sub-pixel precision
Coordinate transformation back to full screen space

This typically improves click accuracy by 15-25% over single-shot inference.

Operation Modes

Offline Mode (Single Task)

Usage: Automated execution of a single, predefined task.

agent = NekoAgent(
    model=model,
    processor=processor,
    ws_url=ws_url,
    nav_task="Find the current weather in San Francisco",
    nav_mode="web",
    online=False,  # Single task mode
    max_steps=10
)

await agent.run()  # Executes task and exits

Behavior:

Immediate control - Requests host control on connection
Task execution - Runs navigation loop until completion or max steps
Automatic exit - Terminates after task completion
Host control maintained - Keeps control throughout session

Online Mode (Multi-Task)

Usage: Interactive agent that responds to chat commands for multiple tasks.

agent = NekoAgent(
    model=model,
    processor=processor,
    ws_url=ws_url,
    nav_task="",  # No initial task
    nav_mode="web",
    online=True,  # Multi-task chat mode
    max_steps=15
)

await agent.run()  # Runs indefinitely, responding to chat

Behavior:

On-demand control - Only requests control when active task starts
Chat monitoring - Listens for task commands in chat messages
Task queuing - Handles multiple sequential tasks
Persistent session - Remains connected between tasks
Host control release - Releases control when idle

Chat Commands:

Start a new task:
User: Navigate to amazon.com and find wireless headphones under $100

The agent will:
1. Request host control
2. Execute the navigation task
3. Release host control when complete
4. Wait for next command

Metrics and Observability

Prometheus Metrics

The agent exports comprehensive metrics for monitoring and alerting:

# Connection metrics
reconnects = Counter('neko_agent_reconnects_total', 'WebSocket reconnection attempts')
frames_processed = Counter('neko_agent_frames_processed_total', 'Video frames processed')

# Performance metrics
inference_latency = Histogram('neko_agent_inference_seconds', 'AI model inference time')
action_execution_time = Histogram('neko_agent_action_execution_seconds', 'Action execution time')

# Quality metrics
actions_executed = Counter('neko_agent_actions_executed_total', 'Actions executed by type', ['action_type'])
parse_errors = Counter('neko_agent_parse_errors_total', 'Action parsing failures')

Example Queries:

# Average inference latency over 5 minutes
rate(neko_agent_inference_seconds_sum[5m]) / rate(neko_agent_inference_seconds_count[5m])

# Action success rate
1 - (rate(neko_agent_parse_errors_total[5m]) / rate(neko_agent_frames_processed_total[5m]))

# Actions per minute by type
rate(neko_agent_actions_executed_total[1m]) * 60

Logging

Structured Logging: Supports both text and JSON formats for different environments.

# Text format (development)
export NEKO_LOG_FORMAT=text
# Output: 2024-01-15 10:30:00 INFO Model inference completed (step=3) | Duration: 0.15s | Device: GPU

# JSON format (production)
export NEKO_LOG_FORMAT=json
# Output: {"timestamp":"2024-01-15T10:30:00Z","level":"INFO","message":"Model inference completed","step":3,"duration":0.15,"device":"GPU"}

Log Levels:

DEBUG - Detailed execution flow, frame processing details
INFO - Task progress, action execution, performance metrics
WARNING - Recoverable errors, fallback activations
ERROR - Critical failures, connection issues

Configuration Examples

Development Environment

# Basic development setup
export NEKO_WS="ws://localhost:8080/api/ws"
export NEKO_LOGLEVEL="DEBUG"
export NEKO_LOG_FORMAT="text"
export FRAME_SAVE_PATH="./debug/frames"
export CLICK_SAVE_PATH="./debug/actions"
export NEKO_MAX_STEPS="15"
export REFINEMENT_STEPS="3"

Production Environment

# Production configuration
export NEKO_WS="wss://neko.prod.example.com/api/ws"
export NEKO_LOGLEVEL="INFO"
export NEKO_LOG_FORMAT="json"
export NEKO_LOGFILE="/var/log/neko-agent.log"
export PORT="8080"  # Metrics server port
export NEKO_MAX_STEPS="20"
export REFINEMENT_STEPS="5"
export NEKO_RTCP_KEEPALIVE="1"  # For NAT traversal
export NEKO_FORCE_EXIT_GUARD_MS="5000"  # Force exit after 5s

GPU Optimization

# GPU-specific settings
export CUDA_VISIBLE_DEVICES="0"
export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"
export TORCH_CUDA_ARCH_LIST="8.6"  # For RTX 30xx series
export OFFLOAD_FOLDER="/fast-ssd/model-cache"  # SSD for model loading

Error Handling and Recovery

Connection Resilience

# Automatic reconnection with exponential backoff
async def connect_with_backoff(self):
    backoff = 1.0
    max_backoff = 60.0

    while not self.shutdown.is_set():
        try:
            await self._establish_connection()
            return
        except Exception as e:
            logger.warning(f"Connection failed: {e}, retrying in {backoff}s")
            await asyncio.sleep(backoff)
            backoff = min(max_backoff, backoff * 1.5)

Inference Timeout Handling

# Model inference with timeout and cancellation
try:
    async with asyncio.timeout(120.0):  # 2 minute timeout
        outputs = await model_inference_task
except asyncio.TimeoutError:
    model_inference_task.cancel()
    logger.error("Model inference timeout, skipping frame")
    return None

Frame Source Fallback

# Automatic fallback from WebRTC to HTTP polling
if webrtc_frame_source.failed:
    logger.warning("WebRTC frame source failed, falling back to HTTP polling")
    self.frame_source = LiteFrameSource(settings, logger)
    await self.frame_source.start(session_id)

Performance Optimization

GPU Memory Management

# Efficient GPU memory usage
torch.cuda.empty_cache()  # Clear cache between tasks
model.half()  # Use FP16 for faster inference
torch.backends.cudnn.benchmark = True  # Optimize for fixed input sizes

Frame Processing Pipeline

# Asynchronous frame processing
async def process_frames():
    async for frame in frame_source:
        # Non-blocking processing
        task = asyncio.create_task(self._navigate_once(frame, history, step))

        # Rate limiting to prevent overwhelming the model
        await asyncio.sleep(0.1)  # 10 FPS maximum

Model Loading Optimization

# Efficient model initialization
model = Qwen2VLForConditionalGeneration.from_pretrained(
    repo_id,
    torch_dtype=torch.float16,  # Half precision
    device_map="auto",          # Automatic device placement
    offload_folder=settings.offload_folder,  # Disk offloading for large models
    trust_remote_code=True
)

Usage Examples

Basic Web Automation

# Simple web navigation task
import asyncio
from agent import NekoAgent, Settings, setup_logging

async def main():
    settings = Settings.from_env()
    logger = setup_logging(settings)

    # Load AI model
    model = Qwen2VLForConditionalGeneration.from_pretrained("showlab/ShowUI-2B")
    processor = AutoProcessor.from_pretrained("showlab/ShowUI-2B")

    # Create and run agent
    agent = NekoAgent(
        model=model,
        processor=processor,
        ws_url="ws://localhost:8080/api/ws",
        nav_task="Go to google.com and search for 'machine learning tutorials'",
        nav_mode="web",
        settings=settings,
        logger=logger
    )

    await agent.run()

if __name__ == "__main__":
    asyncio.run(main())

Interactive Online Mode

# Multi-task interactive agent
agent = NekoAgent(
    model=model,
    processor=processor,
    ws_url="wss://neko.example.com/api/ws",
    nav_task="",  # Start without a task
    nav_mode="web",
    online=True,  # Enable chat-based task assignment
    settings=settings,
    logger=logger
)

# Agent will respond to chat messages like:
# "Please navigate to amazon.com and find wireless keyboards under $50"
# "Go to the weather website and check the forecast for tomorrow"
await agent.run()

# Agent with custom refinement settings
agent = NekoAgent(
    model=model,
    processor=processor,
    ws_url=ws_url,
    nav_task=task,
    nav_mode="web",
    refinement_steps=5,  # More refinement for better accuracy
    max_steps=20,        # Allow longer task sequences
    settings=settings,
    logger=logger
)

Testing and Debugging

Health Check

# Validate configuration
uv run src/agent.py --healthcheck

# Expected output:
# Configuration validation: PASSED
# Model loading: PASSED
# WebSocket connectivity: PASSED
# All systems ready

Debug Mode

# Enable comprehensive debugging
export NEKO_LOGLEVEL="DEBUG"
export FRAME_SAVE_PATH="./debug/frames"
export CLICK_SAVE_PATH="./debug/actions"

uv run src/agent.py --task "test navigation" --max-steps 3

Frame Analysis

The agent can save frames and action visualizations for debugging:

# Saved frame: ./debug/frames/frame_001_1642261234.567.png
# Saved action: ./debug/actions/action_step_001_1642261234.567_CLICK.png

Action visualizations include:

Red circle - Predicted click location
Green overlay - Text input areas
Blue arrows - Scroll directions
Purple box - Crop regions during refinement

Integration with Other Components

With Capture Service

# The agent automatically integrates with capture service
# No additional configuration needed - capture service monitors WebSocket
export CAPTURE_OUT="./training-data"
export CAPTURE_REMOTE="s3://training-bucket/episodes"

# Start capture in background
uv run src/capture.py &

# Run agent (capture will automatically record)
uv run src/agent.py --task "automation task"

With TTS Service

# Enable voice announcements during automation
export YAP_VOICES_DIR="./voices"

# Start TTS service
uv run src/yap.py &

# Run agent with voice enabled (audio is negotiated automatically)
uv run src/agent.py --task "automation task"

Future Enhancements

The agent architecture is designed to support planned enhancements:

API Mode - RESTful service with horizontal scaling
Multi-Modal Input - Voice commands and text instructions
Advanced Vision Models - Support for newer GUI understanding models
Distributed Inference - Model sharding across multiple GPUs
Custom Action Types - Extensible action framework
Real-Time Learning - Online adaptation from user feedback

See API Mode Design for detailed future architecture plans.

Capture Service

Manual Control CLI

The Manual Control CLI (src/manual.py) provides a production-ready command-line interface for manually interacting with Neko v3 servers. This tool implements the complete WebRTC signaling handshake required by modern Neko servers and offers a comprehensive REPL for remote desktop control.

Overview

The Manual CLI is designed for:

Manual testing of Neko server functionality
Remote administration tasks on hosted desktops
Development and debugging of Neko integrations
Quality assurance testing of desktop applications

Key Features

Full WebRTC Signaling: Complete SDP/ICE handshake for control acceptance
Robust Authentication: REST login with automatic token management
Auto-reconnection: Intelligent reconnection with exponential backoff
REPL Interface: Interactive command-line with comprehensive controls
Multiple Input Methods: Mouse, keyboard, scrolling, and gesture support
Admin Commands: Session management and force control operations
Coordinate Systems: Support for both pixel and normalized (0..1) coordinates

Architecture

Core Components

graph TD
    CLI[ManualCLI] --> Signaler[WebSocket Signaler]
    CLI --> Broker[Event Broker]

    Signaler --> WS[WebSocket Connection]
    Signaler --> Tasks[Background Tasks]

    Broker --> SysQ[System Queue]
    Broker --> SigQ[Signal Queue]
    Broker --> MiscQ[Misc Queue]

    CLI --> PC[RTCPeerConnection]
    PC --> ICE[ICE Negotiation]
    PC --> SDP[SDP Exchange]

    WS --> Neko[Neko Server]
    ICE --> Neko
    SDP --> Neko

Class Hierarchy

Broker - Event routing and message distribution
Signaler - WebSocket connection management
ManualCLI - Main application controller and REPL

Usage

Basic Connection

# Direct WebSocket connection
python manual.py --ws "wss://neko.example.com/api/ws?token=abc123"

# REST authentication
python manual.py --neko-url "https://neko.example.com" \
                 --username "admin" --password "password"

Configuration Options

Flag	Environment	Description
`--ws`	`NEKO_WS`	Direct WebSocket URL with token
`--neko-url`	`NEKO_URL`	Base URL for REST authentication
`--username`	`NEKO_USER`	Login username
`--password`	`NEKO_PASS`	Login password
`--size`	`NEKO_SIZE`	Virtual screen size (e.g., "1920x1080")
`--norm`	-	Use normalized 0..1 coordinates
`--no-auto-host`	-	Disable automatic host control requests
`--no-media`	-	Skip WebRTC signaling (may break control)
`--no-audio`	-	Disable audio streaming

Environment Variables

# Logging configuration
export NEKO_LOGLEVEL=DEBUG
export NEKO_LOGFILE=/tmp/neko-manual.log

# Connection parameters
export NEKO_URL=https://neko.example.com
export NEKO_USER=admin
export NEKO_PASS=secretpassword
export NEKO_SIZE=1920x1080

REPL Commands

Mouse Operations

move X Y                    # Move cursor to coordinates
click [button]              # Click at current position (left/right/middle)
lclick                      # Left click shortcut
rclick                      # Right click shortcut
dblclick                    # Double click with proper timing
hover X Y                   # Move to coordinates without clicking
tap X Y [button]            # Move and click in one command
swipe x1 y1 x2 y2          # Drag gesture between two points

Keyboard Input

key <KeyName>               # Press a specific key (Escape, F5, Control, etc.)
text "some text"            # Type text at current cursor focus
input X Y "text"            # Click at coordinates and type text
enter                       # Press Enter/Return key

Scrolling

scroll <direction> [amount] # Scroll up/down/left/right (amount defaults to 1)

Clipboard Operations

copy                        # Send copy command
cut                         # Send cut command
select_all                  # Select all content
paste [text]                # Paste clipboard or specific text

Connection Management

host                        # Request host control
unhost                      # Release host control
size [WxH]                  # Show or set virtual screen size
raw '{"event":"...","payload":{}}' # Send raw JSON message

Administrative Commands

force-take                  # Force take host control (admin only)
force-release               # Force release host control (admin only)
kick <sessionId>            # Force disconnect a session (admin only)
sessions                    # List all connected users and session IDs

Utility Commands

help                        # Show command reference
quit / exit                 # Close connection and exit

API Reference

Core Classes

`Broker`

Event routing system that distributes incoming WebSocket messages to appropriate topic queues.

class Broker:
    def __init__(self)
    def topic_queue(self, topic: str, maxsize: int = 512) -> asyncio.Queue
    def publish(self, msg: Dict[str, Any]) -> None

Topic Routing:

signal/ events → signal queue (WebRTC signaling)
system/, control/, screen/, keyboard/, session/, error/ → system queue
All others → misc queue

`Signaler`

WebSocket connection manager with automatic reconnection and background message processing.

class Signaler:
    def __init__(self, url: str, **wsopts)
    async def connect_with_backoff(self) -> "Signaler"
    async def close(self) -> None
    async def send(self, msg: Dict[str, Any]) -> None

Key Features:

Exponential backoff reconnection (1s → 30s max)
Separate read/write loops for optimal performance
Graceful shutdown with task cleanup
Message queuing during disconnection

`ManualCLI`

Main application controller that orchestrates connection management, WebRTC signaling, and user interaction.

class ManualCLI:
    def __init__(self, ws: str, width: int, height: int,
                 normalized: bool, auto_host: bool,
                 request_media: bool, base_url: Optional[str] = None,
                 token: Optional[str] = None, audio: bool = True)
    async def start(self) -> None
    async def handle(self, line: str) -> None

Utility Functions

`parse_size(s: str) -> Tuple[int, int]`

Parse screen size string (e.g., "1920x1080") into width/height tuple.

`ws_from_rest_login(neko_url: str, username: str, password: str, *, timeout: float = 10.0) -> Tuple[str, str, str]`

Perform REST authentication and derive WebSocket URL with token.

`name_keysym(name: str) -> int`

Map key names to X11 keysym codes for keyboard events.

Protocol Implementation

WebRTC Signaling Flow

Initial Request: Send signal/request with video/audio preferences
Offer Reception: Wait for signal/offer or signal/provide from server
ICE Configuration: Parse and validate ICE servers (strict mapping)
SDP Exchange: Set remote description, create answer, send local description
ICE Candidates: Process incoming candidates and send local candidates
Media Handling: Log received tracks without processing (control-only)

Connection Lifecycle

sequenceDiagram
    participant C as Client
    participant S as Signaler
    participant N as Neko Server
    participant R as RTC Connection

    C->>S: connect_with_backoff()
    S->>N: WebSocket connection
    S->>C: Connection established

    C->>N: signal/request (media preferences)
    C->>N: control/request (host control)

    N->>S: signal/offer (SDP + ICE servers)
    S->>R: Create RTCPeerConnection
    R->>N: signal/answer (SDP response)

    loop ICE Negotiation
        N->>R: signal/candidate
        R->>N: signal/candidate
    end

    R->>C: Connection ready

    loop Event Processing
        N->>S: system/heartbeat
        S->>N: client/heartbeat

        C->>N: control/move, control/click, etc.
        N->>S: control/host, screen/updated, etc.
    end

Error Handling and Reconnection

The CLI implements robust error handling:

WebSocket Errors: Automatic reconnection with exponential backoff
Authentication Failures: Clear error messages and exit
Protocol Errors: Graceful degradation and retry
Resource Cleanup: Proper task cancellation and connection shutdown

Integration Examples

Basic Automation Script

import asyncio
from manual import ManualCLI, ws_from_rest_login

async def automate_task():
    # Get WebSocket URL via REST login
    ws_url, base_url, token = ws_from_rest_login(
        "https://neko.example.com",
        "admin",
        "password"
    )

    # Create CLI instance
    cli = ManualCLI(
        ws=ws_url,
        width=1920,
        height=1080,
        normalized=False,
        auto_host=True,
        request_media=True,
        base_url=base_url,
        token=token
    )

    # Start connection and wait for ready
    await cli.start()

    # Perform automation tasks
    await cli.handle("move 100 100")
    await cli.handle("click")
    await cli.handle("text 'Hello World'")
    await cli.handle("key Enter")

if __name__ == "__main__":
    asyncio.run(automate_task())

Custom Event Handler

class CustomCLI(ManualCLI):
    async def _event_logger(self):
        """Override event logger with custom handling."""
        q = self.signaler.broker.topic_queue("system")

        while self.running and self.signaler and not self.signaler._closed.is_set():
            try:
                msg = await q.get()
                ev = msg.get("event", "")
                payload = msg.get("payload", {})

                # Custom event processing
                if ev == "control/host":
                    print(f"Host control changed: {payload}")
                elif ev == "session/created":
                    print(f"New session: {payload}")

                # Call parent handler
                await super()._event_logger()

            except asyncio.CancelledError:
                break

Security Considerations

Authentication

Always use HTTPS/WSS in production
Store credentials securely (environment variables, not code)
Implement proper token rotation for long-running sessions

Network Security

Validate SSL certificates in production
Use VPN or private networks for sensitive operations
Monitor connection logs for suspicious activity

Access Control

Limit admin command usage to authorized users
Implement session timeouts for inactive connections
Log all control operations for audit trails

Troubleshooting

Common Issues

Connection Failures

# Enable debug logging
export NEKO_LOGLEVEL=DEBUG
export NEKO_LOGFILE=/tmp/debug.log

# Check WebSocket connectivity
python manual.py --ws "wss://neko.example.com/api/ws?token=test"

Authentication Errors

# Verify REST API access
curl -X POST https://neko.example.com/api/login \
     -H "Content-Type: application/json" \
     -d '{"username":"admin","password":"password"}'

WebRTC Issues

ICE Failures: Check firewall and NAT configuration
SDP Errors: Verify codec compatibility
Media Problems: Use --no-audio flag for control-only mode

Performance Issues

High Latency: Check network connectivity and server load
Dropped Connections: Reduce ping_interval in WebSocket options
Memory Leaks: Monitor task cleanup during reconnection

Debug Commands

# Raw protocol inspection
await cli.handle('raw \'{"event":"system/stats"}\'')

# Connection diagnostics
await cli.handle('raw \'{"event":"signal/stats"}\'')

# Session information
await cli.handle('sessions')

Development

Testing

# Unit tests
python -m pytest tests/test_manual.py

# Integration tests with local Neko server
docker-compose up neko
python manual.py --neko-url http://localhost:8080 --username admin --password neko

# Load testing
python scripts/stress_test_manual.py --connections 10 --duration 300

Contributing

When modifying manual.py:

Maintain Protocol Compatibility: Don't break existing WebRTC signaling
Add Comprehensive Tests: Cover new REPL commands and edge cases
Update Documentation: Keep this guide and docstrings current
Follow PEP Standards: Code must comply with PEP 257/287 for docstrings
Handle Errors Gracefully: All new features must have proper error handling

Extending Functionality

Adding New Commands

# In ManualCLI.handle() method
if cmd == "my_command":
    await self._handle_my_command(rest)
    return

# Implement handler method
async def _handle_my_command(self, args) -> None:
    """Handle the 'my_command' REPL command.

    :param args: Command arguments from user input
    :type args: list
    """
    if not args:
        print("usage: my_command <parameter>")
        return

    # Command implementation
    await self._safe_send({
        "event": "custom/my_event",
        "payload": {"param": args[0]}
    })

Custom Event Processing

# In _event_logger() method
elif ev == "custom/my_event":
    self._handle_custom_event(payload)

def _handle_custom_event(self, payload: dict) -> None:
    """Process custom server events.

    :param payload: Event payload from server
    :type payload: dict
    """
    # Custom event handling logic
    pass

This documentation provides comprehensive coverage of the Manual Control CLI, from basic usage to advanced development scenarios. The tool serves as both a practical utility for Neko server interaction and a reference implementation for WebRTC signaling protocols.

TTS Service (yap.py)

The YAP (Yet Another Presenter) service provides real-time text-to-speech functionality through WebRTC audio streaming. It connects to Neko servers and responds to chat commands for immediate or streaming speech synthesis.

Overview

YAP transforms text messages into high-quality speech using F5-TTS and broadcasts the audio through WebRTC. It supports both immediate synthesis and streaming mode for real-time conversation scenarios.

Key Features

Real-time TTS: Low-latency speech synthesis using F5-TTS
WebRTC Integration: Direct audio streaming to Neko browser sessions
Voice Management: Hot-reloadable voice configurations with custom parameters
Streaming Mode: Incremental text processing for live conversations
Chat Commands: Interactive control through Neko chat interface
Parallel Processing: Multi-threaded TTS workers for improved performance
Audio Pipeline: Crossfade splicing and jitter buffering for smooth playback

Architecture

graph TD
    A[Chat Commands] --> B[Command Parser]
    B --> C[Stream Assembler]
    C --> D[Text Segmentation]
    D --> E[TTS Pipeline]
    E --> F[Parallel F5-TTS Workers]
    F --> G[Audio Splicer]
    G --> H[PCM Queue]
    H --> I[WebRTC Audio Track]
    I --> J[Neko Browser]

    K[Voice Manager] --> E
    L[WebSocket Signaler] --> A
    L --> M[WebRTC Peer Connection]
    M --> I

Installation

Dependencies

# Core dependencies
pip install aiortc av websockets numpy

# F5-TTS (required for speech synthesis)
pip install git+https://github.com/SWivid/F5-TTS.git

# Optional resampling backends (first available will be used)
pip install torchaudio  # Preferred
# OR
pip install scipy      # Alternative
# Linear fallback is built-in

System Requirements

Python 3.8+
CUDA-capable GPU (recommended for F5-TTS)
WebRTC-compatible environment

Configuration

YAP follows 12-factor app principles with environment-based configuration:

Connection Settings

Variable	Default	Description
`YAP_WS` / `NEKO_WS`	-	Direct WebSocket URL
`NEKO_URL`	-	REST API base URL
`NEKO_USER`	-	Username for REST authentication
`NEKO_PASS`	-	Password for REST authentication

Audio Settings

Variable	Default	Description
`YAP_SR`	48000	Output sample rate (Hz)
`YAP_AUDIO_CHANNELS`	1	Audio channels (1=mono, 2=stereo)
`YAP_FRAME_MS`	20	WebRTC frame size (10/20/30/40/60ms)
`YAP_JITTER_MAX_SEC`	6.0	PCM buffer maximum duration

TTS Pipeline Settings

Variable	Default	Description
`YAP_PARALLEL`	2	Parallel TTS worker threads
`YAP_CHUNK_TARGET_SEC`	3.0	Target chunk duration hint
`YAP_MAX_CHARS`	350	Maximum characters per chunk
`YAP_OVERLAP_MS`	30	Audio crossfade overlap

Voice Settings

Variable	Default	Description
`YAP_VOICES_DIR`	./voices	Voice configuration directory
`YAP_SPK_DEFAULT`	default	Default speaker ID

Logging & ICE Settings

Variable	Default	Description
`YAP_LOGLEVEL`	INFO	Log level (DEBUG/INFO/WARNING/ERROR)
`YAP_LOG_FORMAT`	text	Log format (text/json)
`YAP_STUN_URL`	stun:stun.l.google.com:19302	STUN server
`YAP_ICE_POLICY`	strict	ICE server policy (strict/all)

Usage

Basic Usage

# Direct WebSocket connection
export YAP_WS="wss://demo.neko.com/api/ws?token=your_token"
python src/yap.py

# REST API authentication
export NEKO_URL="https://demo.neko.com"
export NEKO_USER="username"
export NEKO_PASS="password"
python src/yap.py

Command Line Options

python src/yap.py --help

# Common options
python src/yap.py \
  --ws "wss://host/api/ws?token=..." \
  --sr 48000 \
  --channels 1 \
  --parallel 4 \
  --loglevel DEBUG

Health Check

# Validate configuration
python src/yap.py --healthcheck

Chat Commands

YAP responds to commands in Neko chat:

Immediate Speech

/yap Hello, this is immediate speech synthesis!

Streaming Mode

/yap:begin
This text will be processed incrementally...
More text gets added and synthesized in chunks...
/yap:end

Voice Control

# Switch active voice and parameters
/yap:voice set --spk alice --rate 1.2 --pitch 0.5

# Add new voice
/yap:voice add --spk bob --ref /path/to/reference.wav --ref-text "Reference text" --styles "calm,friendly"

# Reload voice configurations
/yap:voice reload

Queue Management

# Stop current synthesis and clear queue
/yap:stop

Voice Management

Voice Configuration (`voices.json`)

{
  "default": {
    "ref_audio": "./voices/default.wav",
    "ref_text": "This is a default reference recording.",
    "styles": ["calm", "neutral"],
    "params": {
      "rate": 1.0,
      "pitch": 0.0
    }
  },
  "alice": {
    "ref_audio": "./voices/alice_sample.wav",
    "ref_text": "Hello, I'm Alice and this is my voice.",
    "styles": ["friendly", "energetic"],
    "params": {
      "rate": 1.1,
      "pitch": 0.2
    }
  }
}

Voice Parameters

rate: Speech speed multiplier (0.5-2.0)
pitch: Pitch shift in semitones (-12 to +12)
styles: Descriptive tags for voice characteristics
ref_audio: Path to reference audio file (WAV format)
ref_text: Transcript of reference audio

Hot Reloading

Voice configurations can be updated without restarting:

Edit voices.json in the voices directory
Send /yap:voice reload command in chat
Changes take effect immediately

Technical Implementation

Audio Pipeline

Text Segmentation: Smart punctuation-aware chunking
Parallel Synthesis: Multi-threaded F5-TTS workers
Audio Splicer: Crossfade blending between chunks
PCM Queue: Jitter-buffered audio streaming
WebRTC Track: Real-time audio transmission

Text Processing

# Punctuation-aware segmentation
chunks = segment_text("Hello world! How are you today?", max_chars=50)
# Result: ["Hello world!", "How are you today?"]

# Streaming assembly with opportunistic emission
assembler = StreamAssembler(max_chars=100)
ready_chunks = assembler.feed("Partial text...")
final_chunks = assembler.flush()

Audio Formats

Input: F5-TTS generated audio (variable sample rate)
Processing: Float32 waveforms with resampling
Output: 16-bit PCM at configured sample rate
WebRTC: Opus-encoded frames for transmission

Resampling Backends

YAP automatically selects the best available resampling backend:

torchaudio (preferred): High-quality GPU acceleration
scipy: CPU-based signal processing
linear (fallback): Simple interpolation

Performance Tuning

Parallel Workers

# Increase for better throughput (GPU memory permitting)
export YAP_PARALLEL=4

Chunk Sizing

# Smaller chunks = lower latency, higher overhead
export YAP_MAX_CHARS=200

# Larger chunks = higher latency, better efficiency
export YAP_MAX_CHARS=500

Audio Buffer

# Reduce for lower latency (risk of underruns)
export YAP_JITTER_MAX_SEC=3.0

# Increase for stability (higher latency)
export YAP_JITTER_MAX_SEC=10.0

Troubleshooting

Common Issues

No audio output

Check WebRTC connection status in browser developer tools
Verify audio permissions in browser
Confirm STUN/TURN server connectivity

High latency

Reduce YAP_MAX_CHARS for smaller chunks
Decrease YAP_JITTER_MAX_SEC buffer size
Increase YAP_PARALLEL workers (if GPU permits)

Audio artifacts

Increase YAP_OVERLAP_MS for smoother crossfades
Check F5-TTS reference audio quality
Verify sample rate consistency

Connection failures

Validate WebSocket URL and authentication
Check firewall settings for WebRTC ports
Test with simpler STUN configuration

Debug Logging

export YAP_LOGLEVEL=DEBUG
export YAP_LOG_FORMAT=json
python src/yap.py 2>&1 | jq .

Health Validation

# Check configuration and dependencies
python src/yap.py --healthcheck

API Reference

Core Classes

`YapApp`

Main application coordinator handling WebSocket signaling, WebRTC setup, and command processing.

app = YapApp(settings, logger)
await app.run()  # Main event loop

`TTSPipeline`

Manages parallel TTS synthesis with crossfade splicing.

pipeline = TTSPipeline(voices, tts, pcmq, sr_out, ch_out, overlap_ms, parallel, logger, max_chars)
await pipeline.speak_text("Hello world", speaker="alice")

`VoiceManager`

Hot-reloadable voice configuration registry.

voices = VoiceManager(voices_dir, default_speaker, logger)
voice = voices.get("alice")  # Get voice config
voices.reload()  # Hot reload from JSON

`PCMQueue`

Thread-safe jitter buffer for WebRTC audio streaming.

pcmq = PCMQueue(sr=48000, channels=1, max_sec=6.0, logger=logger)
pcmq.push(audio_samples)  # Producer
samples = pcmq.pull(frame_size)  # Consumer

Key Functions

`segment_text(text, max_chars)`

Intelligent text segmentation respecting punctuation boundaries.

`apply_rate_pitch(wave, sr, rate, pitch)`

Apply naive prosody transformations (rate/pitch adjustments).

`_resample(wave, sr_from, sr_to)`

Multi-backend audio resampling with automatic fallback.

Integration Examples

Docker Deployment

FROM python:3.9-slim

# Install system dependencies
RUN apt-get update && apt-get install -y \
    ffmpeg \
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
COPY requirements.txt .
RUN pip install -r requirements.txt

# Copy application
COPY src/ /app/src/
COPY voices/ /app/voices/
WORKDIR /app

# Configuration via environment
ENV YAP_SR=48000
ENV YAP_PARALLEL=2
ENV YAP_VOICES_DIR=/app/voices

ENTRYPOINT ["python", "src/yap.py"]

Docker Compose

version: '3.8'
services:
  yap:
    build: .
    environment:
      - NEKO_URL=http://neko:8080
      - NEKO_USER=admin
      - NEKO_PASS=password
      - YAP_PARALLEL=4
      - YAP_LOGLEVEL=INFO
    volumes:
      - ./voices:/app/voices
    depends_on:
      - neko

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: yap-tts
spec:
  replicas: 1
  selector:
    matchLabels:
      app: yap-tts
  template:
    metadata:
      labels:
        app: yap-tts
    spec:
      containers:
      - name: yap
        image: neko-agent/yap:latest
        env:
        - name: NEKO_URL
          valueFrom:
            secretKeyRef:
              name: neko-config
              key: url
        - name: NEKO_USER
          valueFrom:
            secretKeyRef:
              name: neko-config
              key: username
        - name: NEKO_PASS
          valueFrom:
            secretKeyRef:
              name: neko-config
              key: password
        - name: YAP_PARALLEL
          value: "4"
        resources:
          requests:
            memory: "2Gi"
            cpu: "1000m"
          limits:
            memory: "4Gi"
            cpu: "2000m"
        volumeMounts:
        - name: voices
          mountPath: /app/voices
      volumes:
      - name: voices
        persistentVolumeClaim:
          claimName: yap-voices

Future Enhancements

Voice Cloning: Real-time voice learning from user samples
Emotion Control: Dynamic emotional expression parameters
SSML Support: Advanced speech markup for prosody control
Multi-language: Automatic language detection and synthesis
Audio Effects: Real-time audio processing (reverb, filters)
Metrics: Prometheus/OpenTelemetry integration for monitoring

Core Agent: Main automation engine that can trigger TTS
Capture Service: Training data collection for voice models
Neko Integration: Browser environment and WebRTC setup

Source Reference

File: src/yap.py:1 Lines of Code: ~1550 Dependencies: aiortc, av, websockets, numpy, F5-TTS

Development Setup

This guide covers the development environment setup, coding standards, and contribution workflow for the Neko Agent project.

Development Environment

Prerequisites

Nix with flakes - Package management and development shells
Git - Version control
NVIDIA GPU (optional but recommended) - For AI model acceleration

Setting Up

Clone the repository:

git clone <repository-url>
cd neko-agent

Enter development environment:

# For GPU development (recommended)
nix develop .#gpu

# For CPU-only development
nix develop

Verify setup:

python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"

Available Development Shells

Shell	Purpose	Key Packages
`default`	Basic Python development	PyTorch CPU, transformers, websockets
`gpu`	GPU-accelerated development	CUDA 12.8, PyTorch GPU, all dependencies
`docs`	Documentation development	mdBook, Sphinx, preprocessing tools
`neko`	Neko server management	Docker, Colima, compose tools
`ai`	AI tool integration	Claude Code CLI, OpenAI tools

Project Structure

├── src/                    # Source code
│   ├── agent.py           # Core automation agent
│   ├── capture.py         # Training data capture
│   └── yap.py             # TTS service
├── docs/                  # Documentation
│   ├── src/               # mdBook source files
│   └── book.toml          # Documentation configuration
├── voices/                # Voice models and assets
├── data/                  # Training data output
├── overlays/              # Nix package overlays
├── nix/                   # Nix configuration
└── flake.nix              # Development environment

Coding Standards

Python Style

Follow PEP 8 for code style
Use type hints for all function signatures
Write docstrings for public functions and classes
Use async/await for I/O operations

Example:

async def process_frame(frame: np.ndarray, task: str) -> Optional[Action]:
    """
    Process a video frame and determine the next action.
    
    :param frame: RGB frame data as numpy array
    :param task: Natural language task description
    :return: Action to execute, or None if task complete
    """
    # Implementation here
    pass

Error Handling

Use specific exceptions rather than generic Exception
Log errors with context for debugging
Implement graceful degradation where possible

try:
    result = await risky_operation()
except SpecificError as e:
    logger.warning(f"Operation failed: {e}, falling back to default")
    result = default_fallback()

Configuration Management

Use environment variables for configuration
Provide sensible defaults in code
Document all configuration options

NEKO_URL = os.environ.get("NEKO_URL", "http://localhost:8080")
MAX_RETRIES = int(os.environ.get("MAX_RETRIES", "3"))

Testing

Running Tests

# Run all tests
python -m pytest

# Run with coverage
python -m pytest --cov=src

# Run specific test file
python -m pytest tests/test_agent.py

Test Structure

Unit tests for individual functions
Integration tests for component interaction
End-to-end tests for full workflows

Writing Tests

import pytest
from unittest.mock import AsyncMock, patch

@pytest.mark.asyncio
async def test_agent_action_execution():
    """Test that agent executes actions correctly."""
    agent = NekoAgent()
    
    with patch.object(agent, 'webrtc_client') as mock_client:
        mock_client.execute_action = AsyncMock()
        
        action = ClickAction(x=100, y=200)
        await agent.execute_action(action)
        
        mock_client.execute_action.assert_called_once_with(action)

Documentation

Building Documentation

# Enter docs environment
nix develop .#docs

# Serve documentation locally
cd docs && mdbook serve --open

# Build static documentation
cd docs && mdbook build

Documentation Standards

Write in Markdown using mdBook extensions
Include code examples for all APIs
Use mermaid diagrams for architecture
Keep docs up-to-date with code changes

Git Workflow

Branch Strategy

main - Stable release branch
dev - Development integration branch
feature/* - Feature development branches
fix/* - Bug fix branches

Commit Guidelines

Use conventional commits format
Write descriptive messages explaining the why
Keep commits atomic - one logical change per commit

Examples:

feat(agent): add support for swipe actions
fix(capture): handle missing frames gracefully  
docs(api): update WebRTC connection examples
refactor(tts): improve voice loading performance

Pull Request Process

Create feature branch from dev
Implement changes following coding standards
Add/update tests for new functionality
Update documentation if needed
Submit PR with clear description
Address review feedback
Merge after approval

Debugging

Logging Configuration

# Set log levels
export AGENT_LOGLEVEL=DEBUG
export CAPTURE_LOGLEVEL=INFO
export YAP_LOGLEVEL=WARNING

Common Debug Tasks

WebRTC connection issues:

# Check Neko server status
curl http://localhost:8080/health

# Test WebSocket endpoint
wscat -c ws://localhost:8080/api/ws?token=<token>

AI model problems:

# Check GPU availability
python -c "import torch; print(torch.cuda.is_available())"

# Test model loading
python -c "from transformers import Qwen2VLForConditionalGeneration; print('OK')"

Training data capture:

# Verify MDS output
python -c "from streaming import StreamingDataset; print('MDS OK')"

# Check S3 connectivity
aws s3 ls s3://your-bucket/

Performance Optimization

Profiling

# Profile with cProfile
python -m cProfile -o profile.stats src/agent.py

# Analyze with snakeviz
snakeviz profile.stats

GPU Memory Management

# Monitor GPU usage
nvidia-smi -l 1

# Set memory allocation strategy
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

WebRTC Optimization

Reduce frame rate for lower bandwidth
Adjust video quality based on task complexity
Use hardware encoding when available

Contributing

Getting Started

Read the documentation to understand the system
Set up development environment
Pick an issue from the project board
Ask questions if anything is unclear

Areas for Contribution

New action types (drag, hover, etc.)
Additional AI models integration
Performance improvements
Documentation enhancements
Test coverage expansion

Code Review Guidelines

Be constructive in feedback
Explain the reasoning behind suggestions
Test the changes locally when possible
Approve promptly for good contributions

Release Process

Version Management

Semantic versioning (MAJOR.MINOR.PATCH)
Tag releases in Git
Update changelog for each release

Deployment

Build Docker images for production
Test in staging environment
Deploy with rolling updates
Monitor metrics post-deployment

Support

Getting Help

Check documentation first
Search existing issues on GitHub
Ask in discussions for general questions
Open issues for bugs or feature requests

Community Guidelines

Be respectful and inclusive
Help others when you can
Share knowledge and experiences
Follow the code of conduct

Nix Flake Development Environment

The Neko Agent project uses a sophisticated Nix flake to provide reproducible, cross-platform development environments with specialized configurations for different use cases. This document provides comprehensive documentation of all flake features, development shells, and usage patterns.

Overview

The flake (flake.nix) is designed around multiple specialized development environments that cater to different aspects of the project:

AI/ML Development - GPU-accelerated environments with CUDA support
Documentation - Publishing and development of project documentation
Container Operations - Docker and Neko server management
Performance Optimization - CPU-optimized builds with architecture-specific flags
TEE Deployment - Trusted Execution Environment deployment with attestation
Registry Management - Multi-registry container deployment support
Cross-Platform Support - Works on x86_64-Linux and aarch64-Darwin (Apple Silicon)

graph TB
    subgraph "Nix Flake Architecture"
        Flake[flake.nix]
        Inputs[External Inputs]
        Overlays[Custom Overlays]
        Shells[Development Shells]
        Packages[Docker Images]
        Apps[Utility Apps]
    end
    
    subgraph "External Dependencies"
        Nixpkgs[nixpkgs/nixos-unstable]
        MLPkgs[nixvital/ml-pkgs]
    end
    
    subgraph "Custom Overlays"
        WebRTC[WebRTC Stack]
        ML[ML Libraries]
        Audio[Audio Processing]
        Optimization[CPU Optimization]
    end
    
    subgraph "Development Shells"
        Default[default]
        GPU[gpu]
        AI[ai]
        Neko[neko]
        Docs[docs]
        CPUOpt[cpu-opt]
        GPUOpt[gpu-opt]
    end
    
    Flake --> Inputs
    Flake --> Overlays
    Flake --> Shells
    Flake --> Packages
    Flake --> Apps
    
    Inputs --> Nixpkgs
    Inputs --> MLPkgs
    
    Overlays --> WebRTC
    Overlays --> ML
    Overlays --> Audio
    Overlays --> Optimization
    
    Shells --> Default
    Shells --> GPU
    Shells --> AI
    Shells --> Neko
    Shells --> Docs
    Shells --> CPUOpt
    Shells --> GPUOpt
    Shells --> TEE

Flake Inputs

External Dependencies

inputs = {
  nixpkgs.url = "github:NixOS/nixpkgs/nixos-unstable";
  ml-pkgs.url = "github:nixvital/ml-pkgs";
};

Input	Source	Purpose
`nixpkgs`	`nixos-unstable`	Latest packages and system libraries
`ml-pkgs`	`nixvital/ml-pkgs`	Specialized ML/AI packages (PyTorch, CUDA)

Why nixos-unstable?

Latest packages - Access to newest versions of AI/ML libraries
CUDA support - Most recent NVIDIA driver and toolkit support (CUDA 12.8)
Python ecosystem - Up-to-date Python packages for transformers and WebRTC
Security updates - Timely security patches for all dependencies

Build Metadata and Reproducibility

The flake includes comprehensive build metadata for reproducible builds and attestation:

buildInfo = rec {
  timestamp = "${year}-${month}-${day}T${hour}:${minute}:${second}Z";
  revision = self.rev or self.dirtyRev or "unknown";
  shortRev = builtins.substring 0 8 revision;
  version = if (self ? rev) then shortRev else "${shortRev}-dirty";
  nixpkgsRev = nixpkgs.rev or "unknown";
  
  imageMetadata = {
    "org.opencontainers.image.title" = "Neko Agent";
    "org.opencontainers.image.created" = timestamp;
    "org.opencontainers.image.revision" = revision;
    "dev.neko.build.reproducible" = "true";
  };
};

Custom Overlay System

The flake uses a comprehensive overlay system to provide packages not available in standard Nixpkgs:

WebRTC and Media Stack

nekoOverlays = [
  (import ./overlays/pylibsrtp.nix)     # Secure RTP protocol
  (import ./overlays/aioice.nix)        # Async ICE implementation  
  (import ./overlays/aiortc.nix)        # WebRTC for Python
  # ... more overlays
];

Overlay	Package	Purpose
`pylibsrtp.nix`	`pylibsrtp`	Secure Real-time Transport Protocol for WebRTC
`aioice.nix`	`aioice`	Asynchronous ICE (Interactive Connectivity Establishment)
`aiortc.nix`	`aiortc`	WebRTC implementation for Python with media support

AI/ML and Audio Processing

Overlay	Package	Purpose
`streaming.nix`	`streaming`	MosaicML Streaming for training data
`f5-tts.nix`	`f5-tts`	F5-TTS voice synthesis model
`vocos.nix`	`vocos`	Neural vocoder for audio generation
`ema-pytorch.nix`	`ema-pytorch`	Exponential Moving Average for PyTorch
`transformers-stream-generator.nix`	`transformers-stream-generator`	Streaming text generation
`bitsandbytes.nix`	`bitsandbytes`	8-bit optimizers for PyTorch

Pi-Zero PyTorch Dependencies

The flake includes comprehensive packaging for pi-zero-pytorch and its dependencies:

Overlay	Package	Purpose
`pi-zero-pytorch/pi-zero-pytorch.nix`	`pi-zero-pytorch`	Main π0 implementation in PyTorch
`pi-zero-pytorch/einx.nix`	`einx`	Universal tensor operations with Einstein notation
`pi-zero-pytorch/x-transformers.nix`	`x-transformers`	Transformer architectures library
`pi-zero-pytorch/rotary-embedding-torch.nix`	`rotary-embedding-torch`	Rotary positional embeddings
`pi-zero-pytorch/accelerated-scan.nix`	`accelerated-scan`	Accelerated scan operations
`pi-zero-pytorch/bidirectional-cross-attention.nix`	`bidirectional-cross-attention`	Cross-attention mechanisms
`pi-zero-pytorch/hl-gauss-pytorch.nix`	`hl-gauss-pytorch`	Gaussian operations for ML
`pi-zero-pytorch/evolutionary-policy-optimization.nix`	`evolutionary-policy-optimization`	Evolution strategies

Performance Optimization

Overlay	Package	Purpose
`cached-path.nix`	`cached-path`	Efficient file caching utilities
`znver2-flags.nix`	`nekoZnver2Env`	AMD Zen2 CPU optimization flags
`vmm-cli.nix`	`vmm-cli`	Virtual machine management CLI

Example Znver2 Optimization:

# Generated environment variables for AMD Zen2 CPUs
export NIX_CFLAGS_COMPILE="-O3 -pipe -march=znver2 -mtune=znver2 -fno-plt"
export RUSTFLAGS="-C target-cpu=znver2 -C target-feature=+sse2,+sse4.2,+avx,+avx2,+fma,+bmi1,+bmi2"

External ML Packages

ml-pkgs.overlays.torch-family  # Provides torch-bin, torchvision-bin, etc.

Benefits:

Pre-compiled binaries - Faster setup without compilation
CUDA integration - Proper CUDA toolkit linkage
Consistent versions - Matching PyTorch ecosystem versions

Development Shells

1. Default Shell (`default`)

Purpose: Basic Python development with CPU-only PyTorch.

Usage:

nix develop
# or
nix develop .#default

Includes:

Python Environment: PyTorch CPU, Transformers, WebRTC stack
System Tools: FFmpeg, Git, Curl, Just, pkg-config
Node.js Ecosystem: Node 20, NPM for AI tools
AI CLI Tools: OpenAI Codex, Anthropic Claude Code (auto-installed)

Python Packages:

# Core ML/AI
transformers
torch (CPU)
torchvision
pillow
accelerate

# WebRTC and networking
websockets
av (PyAV for video processing)
pylibsrtp
aioice
aiortc

# Data and streaming
streaming (MosaicML)
f5-tts
numpy
scipy
zstandard
xxhash
tqdm

# Monitoring
prometheus-client

When to Use:

Initial project setup and exploration
Development on systems without NVIDIA GPUs
Testing compatibility with CPU-only environments
CI/CD pipelines where GPU access is unavailable

2. GPU Shell (`gpu`)

Purpose: GPU-accelerated development with CUDA 12.8 support.

Usage:

nix develop .#gpu

NVIDIA hosts: When running outside NixOS you will typically need nixGL to expose the system GPU. Use:
NIXPKGS_ALLOW_UNFREE=1 nix run --impure github:nix-community/nixGL#nixGLNvidia -- nix develop .#gpu
This wraps the GPU shell with the right OpenGL/EGL libraries from the host driver.

Additional Features over Default:

CUDA Toolkit 12.8 - Complete CUDA development environment
cuDNN and NCCL - Optimized neural network and communication libraries
GPU-enabled PyTorch - Tensor operations on NVIDIA GPUs
Environment Variables - Automatic CUDA path and library configuration

CUDA Environment Setup:

# Automatically configured
export CUDA_HOME=/nix/store/.../cuda-12.8
export CUDA_PATH=$CUDA_HOME
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH

# GPU control
export NVIDIA_VISIBLE_DEVICES=all
export NVIDIA_DRIVER_CAPABILITIES=compute,utility
export CUDA_MODULE_LOADING=LAZY
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

Verification Commands:

# Check CUDA installation
nvidia-smi
nvcc --version

# Test PyTorch GPU support
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
python -c "import torch; print(f'CUDA devices: {torch.cuda.device_count()}')"

When to Use:

AI model inference and training
GPU-accelerated image/video processing
Development requiring CUDA libraries
Performance-critical workloads

3. AI Shell (`ai`)

Purpose: Lightweight environment focused on AI development tools.

Usage:

nix develop .#ai

Includes:

Core System Tools - FFmpeg, Git, networking utilities
Node.js Environment - Node 20, NPM
AI CLI Tools - Automatic installation of OpenAI and Anthropic CLIs
Minimal Footprint - No heavy ML libraries, faster startup

AI Tools Installed:

# OpenAI Codex CLI
npm install -g @openai/codex

# Anthropic Claude Code CLI  
npm install -g @anthropic-ai/claude-code

Environment Setup:

# NPM global packages in project directory
export NPM_CONFIG_PREFIX=$PWD/.npm-global
export PATH=$NPM_CONFIG_PREFIX/bin:$PATH

When to Use:

AI-assisted development workflows
Code generation and review tasks
Integration with AI development services
Quick environment for AI tool testing

4. Neko Shell (`neko`)

Purpose: Container and Neko server management.

Usage:

nix develop .#neko

Container Stack:

Colima - Lightweight Docker runtime for macOS/Linux
Docker & Docker Compose - Container orchestration
Docker Buildx - Multi-platform image building
Networking Tools - curl, jq for API interaction

Custom Scripts:

# Neko service management script
neko-services up      # Start Neko server
neko-services down    # Stop services
neko-services logs    # View container logs
neko-services status  # Check service status
neko-services restart # Restart services
neko-services update  # Pull latest images and restart

Colima Configuration:

# Automatically configured VM
colima start --vm-type vz --cpu 2 --memory 4 \
  --mount-type sshfs --mount "~:w"

Docker Environment:

# Automatic Docker socket configuration
export DOCKER_HOST="unix://$HOME/.colima/default/docker.sock"

When to Use:

Neko server development and testing
Container image building and deployment
Docker-based development workflows
Local testing of production deployments

5. Documentation Shell (`docs`)

Purpose: Documentation development, building, and publishing.

Usage:

nix develop .#docs

Documentation Stack:

mdBook - Rust-based documentation generator
mdBook Extensions:
- mdbook-mermaid - Diagram support
- mdbook-linkcheck - Link validation
- mdbook-toc - Table of contents generation
Sphinx - Python documentation with reStructuredText support
Node.js - For additional tooling and preprocessing

Python Documentation Tools:

sphinx              # Documentation generator
sphinx-rtd-theme     # Read the Docs theme
myst-parser          # Markdown support for Sphinx
sphinxcontrib-mermaid # Mermaid diagrams in Sphinx

Available Commands:

# From inside docs/
mdbook serve --open     # Development server with live reload
mdbook build           # Build static documentation
mdbook test            # Test code examples and links

# Sphinx alternative
sphinx-build -b html source build/

When to Use:

Writing and editing project documentation
Building documentation for deployment
Testing documentation changes locally
Contributing to API reference and guides

6. CPU-Optimized Shell (`cpu-opt`)

Purpose: Performance-optimized CPU development.

Usage:

nix develop .#cpu-opt

Optimization Features:

Architecture-Specific Compilation - Znver2 flags for AMD CPUs
Optimized Python Environment - Performance-tuned package builds
Compiler Optimizations - -O3, -march=znver2, -mtune=znver2

Generated Optimization Flags (Linux only):

# Compiler flags
export NIX_CFLAGS_COMPILE="-O3 -pipe -march=znver2 -mtune=znver2 -fno-plt"

# Rust flags
export RUSTFLAGS="-C target-cpu=znver2 -C target-feature=+sse2,+sse4.2,+avx,+avx2,+fma,+bmi1,+bmi2 -C link-arg=-Wl,-O1 -C link-arg=--as-needed"

When to Use:

Performance-critical CPU workloads
Benchmarking and optimization work
Production builds targeting specific CPU architectures
Environments where every bit of CPU performance matters

7. GPU-Optimized Shell (`gpu-opt`)

Purpose: Maximum performance GPU development with optimizations.

Usage:

nix develop .#gpu-opt

Combined Optimizations:

All GPU features - CUDA 12.8, cuDNN, NCCL
CPU optimizations - Znver2 flags for host code
PyTorch optimizations - Optimized builds with CPU and GPU acceleration
Memory optimizations - Advanced CUDA memory management

GPU-Specific Optimizations:

# Target specific GPU architecture (configurable)
export TORCH_CUDA_ARCH_LIST=8.6  # RTX 30xx series

# Memory allocation strategy
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

Performance Verification:

# Check optimizations are active
echo $NIX_CFLAGS_COMPILE  # Should show znver2 flags
echo $TORCH_CUDA_ARCH_LIST  # Should show target GPU architecture

# Benchmark performance
python -c "
import torch
import time
x = torch.randn(1000, 1000, device='cuda')
start = time.time()
torch.mm(x, x)
print(f'GPU matrix multiply: {time.time() - start:.4f}s')
"

When to Use:

Maximum performance AI inference
GPU-accelerated training workloads
Performance benchmarking and optimization
Production deployments requiring peak performance

8. TEE Shell (`tee`)

Purpose: Trusted Execution Environment deployment and attestation.

Usage:

nix develop .#tee

TEE Deployment Stack:

Phala Cloud CLI - Modern CLI for TEE deployments
Legacy VMM CLI - Compatible with older dstack systems
Docker & Docker Compose - Container orchestration
Bun Runtime - Fast JavaScript runtime
Reproducible Image Builder - Attestation-ready container building

Available Commands:

# Modern Phala CLI
phala auth login <api-key>      # Authenticate with Phala Cloud
phala status                    # Check authentication status
phala cvms list                 # List Confidential VMs
phala nodes                     # List available TEE nodes

# Legacy VMM CLI (if needed)
vmm-cli lsvm                    # List virtual machines
vmm-cli lsimage                 # List available images
vmm-cli lsgpu                   # List available GPUs

# Reproducible builds
nix run .#build-images          # Build reproducible images
nix run .#deploy-to-tee         # Deploy with attestation metadata
nix run .#verify-attestation    # Verify TEE attestation

Multi-Registry Support:

# Deploy to ttl.sh (ephemeral registry)
NEKO_REGISTRY=ttl.sh NEKO_TTL=1h nix run .#deploy-to-tee
nix run .#deploy-to-ttl 24h

# Deploy to GitHub Container Registry
NEKO_REGISTRY=ghcr.io/your-org nix run .#deploy-to-tee

# Deploy to Docker Hub
NEKO_REGISTRY=docker.io/your-org nix run .#deploy-to-tee

# Deploy to local registry
NEKO_REGISTRY=localhost:5000/neko nix run .#deploy-to-tee

When to Use:

Deploying to Trusted Execution Environments
Creating attestable, reproducible deployments
Multi-registry container management
TEE-based inference deployments
Confidential computing workloads

Docker Images and Packages

The flake builds optimized Docker images for production deployment:

Available Images

The flake now builds multiple specialized images for different components:

# Build all images
nix run .#build-images

# Agent images
nix build .#neko-agent-docker-generic
nix build .#neko-agent-docker-opt

# Capture images  
nix build .#neko-capture-docker-generic
nix build .#neko-capture-docker-opt

# YAP (TTS) images
nix build .#neko-yap-docker-generic
nix build .#neko-yap-docker-opt

# Train images
nix build .#neko-train-docker-generic
nix build .#neko-train-docker-opt

1. Generic CUDA Image (`neko-agent-docker-generic`)

Target: neko-agent:cuda12.8-generic

Features:

Portable CUDA - Includes PTX for forward compatibility
CUDA 12.8 - Full toolkit and libraries
Python Environment - All dependencies with torch-bin
Broad GPU Support - Works on any CUDA 8.6+ GPU

Configuration:

# Environment variables
CUDA_HOME=/nix/store/.../cuda-12.8
LD_LIBRARY_PATH=$CUDA_HOME/lib64:$CUDA_HOME/lib
CUDA_MODULE_LOADING=LAZY
TORCH_CUDA_ARCH_LIST=8.6+PTX  # Forward compatibility

Use Cases:

Multi-GPU deployment environments
Cloud platforms with varying GPU types
Development and testing across different hardware

2. Optimized Image (`neko-agent-docker-opt`)

Target: neko-agent:cuda12.8-sm86-v3

Features:

Specific GPU targeting - Optimized for RTX 30xx series (sm_86)
CPU optimizations - Znver2 architecture flags
Smaller size - No PTX, specific architecture only
Maximum performance - All available optimizations enabled

Configuration:

# Optimized environment
TORCH_CUDA_ARCH_LIST=8.6  # Specific architecture only
NIX_CFLAGS_COMPILE="-O3 -pipe -march=znver2 -mtune=znver2 -fno-plt"
RUSTFLAGS="-C target-cpu=znver2 ..."  # Rust optimizations

Use Cases:

Production deployments with known hardware
Performance-critical applications
Cost-optimized cloud instances

Image Building System

# Helper function for consistent container structure
mkRoot = paths: pkgs.buildEnv {
  name = "image-root";
  inherit paths;
  pathsToLink = [ "/bin" ];
};

# Generic image build
neko-agent-docker-generic = pkgs.dockerTools.buildImage {
  name = "neko-agent:cuda12.8-generic";
  created = "now";
  copyToRoot = mkRoot ([
    runnerGeneric
    pyEnvGeneric
    cuda.cudatoolkit
    cuda.cudnn
    cuda.nccl
    pkgs.bashInteractive
  ] ++ commonSystemPackages);
  config = {
    Env = baseEnv ++ [
      "CUDA_HOME=${cuda.cudatoolkit}"
      "LD_LIBRARY_PATH=${cuda.cudatoolkit}/lib64:${cuda.cudnn}/lib"
      "TORCH_CUDA_ARCH_LIST=8.6+PTX"
    ];
    WorkingDir = "/workspace";
    Entrypoint = [ "/bin/neko-agent" ];
  };
};

Utility Apps

The flake provides comprehensive utility applications for common tasks:

Documentation Apps

# Build documentation
nix run .#docs-build

# Serve documentation with live reload
nix run .#docs-serve

# Check documentation for issues
nix run .#docs-check

Build and Deployment Apps

# Build all Docker images with attestation metadata
nix run .#build-images

# TEE deployment with multi-registry support
nix run .#deploy-to-tee
nix run .#deploy-to-ttl 24h              # Quick ttl.sh deployment
nix run .#push-to-ttl 1h                 # Just push to ttl.sh

# Attestation verification
nix run .#verify-attestation <app-id> <expected-hash>

Container Registry Apps

# Local registry management
nix run .#start-registry                 # HTTP registry with auth
nix run .#start-registry-https           # HTTPS with Tailscale certs
nix run .#stop-registry

# Public exposure
nix run .#start-tailscale-funnel         # Expose via Tailscale Funnel
nix run .#start-cloudflare-tunnel        # Expose via Cloudflare Tunnel

Registry Configuration Examples:

# Environment variables for registry customization
NEKO_REGISTRY_PORT=5000
NEKO_REGISTRY_USER=neko
NEKO_REGISTRY_PASSWORD=pushme
NEKO_REGISTRY_DATA_DIR=$PWD/registry-data
NEKO_REGISTRY_AUTH_DIR=$PWD/auth
NEKO_REGISTRY_CERTS_DIR=$PWD/certs

# Tailscale Funnel setup
NEKO_REGISTRY=your-device.tail-scale.ts.net/neko

# Cloudflare Tunnel setup
NEKO_CF_TUNNEL_NAME=neko-registry
NEKO_CF_HOSTNAME=registry.example.com

Common Development Workflows

Initial Setup

# Clone repository
git clone <repo-url>
cd neko-agent

# Enter development environment
nix develop .#gpu  # or .#default for CPU-only

# Verify setup
python -c "import torch; print(torch.cuda.is_available())"

AI Development Workflow

# 1. Enter GPU environment
nix develop .#gpu

# 2. Load environment variables (if .env exists)
# Automatically loaded by shell hook

# 3. Test model loading
python -c "
from transformers import Qwen2VLForConditionalGeneration
model = Qwen2VLForConditionalGeneration.from_pretrained('showlab/ShowUI-2B')
print('Model loaded successfully')
"

# 4. Run agent
uv run src/agent.py --task "Navigate to google.com"

Documentation Development

# 1. Enter docs environment
nix develop .#docs

# 2. Start development server
nix run .#docs-serve
# Opens browser to http://localhost:3000

# 3. Edit files in docs/src/
# Changes automatically reload in browser

# 4. Build for deployment
nix run .#docs-build

Container Development

# 1. Enter container environment
nix develop .#neko

# 2. Start Neko server
neko-services up

# 3. Check status
neko-services status

# 4. View logs
neko-services logs neko

# 5. Test connection
curl http://localhost:8080/health

Performance Optimization

# 1. Use optimized environment
nix develop .#gpu-opt

# 2. Verify optimizations
echo $NIX_CFLAGS_COMPILE
echo $TORCH_CUDA_ARCH_LIST

# 3. Run performance benchmarks
python benchmarks/inference_speed.py

# 4. Build optimized container
nix build .#neko-agent-docker-opt

TEE Deployment Workflow

# 1. Enter TEE environment
nix develop .#tee

# 2. Build reproducible images
nix run .#build-images

# 3. Deploy to TEE (with registry choice)
# Option A: Use ttl.sh for testing
nix run .#deploy-to-ttl 1h

# Option B: Use GitHub Container Registry
NEKO_REGISTRY=ghcr.io/your-org nix run .#deploy-to-tee

# Option C: Use local registry (start it first)
nix run .#start-registry  # In another terminal
NEKO_REGISTRY=localhost:5000/neko nix run .#deploy-to-tee

# 4. Verify attestation (inside TEE)
nix run .#verify-attestation <app-id> <compose-hash>

# 5. Check deployment status
phala cvms list  # Modern CLI
# or
vmm-cli lsvm    # Legacy CLI

Multi-Registry Development

# Setup local registry for testing
nix run .#start-registry

# Push images to multiple registries
docker tag neko-agent:latest localhost:5000/neko/agent:v1
docker push localhost:5000/neko/agent:v1

# Use Tailscale for team access
nix run .#start-tailscale-funnel

# Use Cloudflare for public access
nix run .#start-cloudflare-tunnel

Environment Variables and Configuration

Automatic .env Loading

All development shells automatically load .env files:

# .env file example
NEKO_WS=ws://localhost:8080/api/ws
NEKO_LOGLEVEL=DEBUG
CUDA_VISIBLE_DEVICES=0
TORCH_CUDA_ARCH_LIST=8.6

Common Environment Variables

Variable	Purpose	Default	Set By
`CUDA_HOME`	CUDA installation path	Auto-detected	GPU shells
`CUDA_VISIBLE_DEVICES`	GPU selection	`all`	User configurable
`PYTORCH_CUDA_ALLOC_CONF`	Memory strategy	`expandable_segments:True`	GPU shells
`NPM_CONFIG_PREFIX`	NPM global location	`$PWD/.npm-global`	All shells
`NIX_CFLAGS_COMPILE`	Compiler optimizations	Znver2 flags	Optimized shells

Shell-Specific Variables

GPU Shells:

export CUDA_MODULE_LOADING=LAZY
export NVIDIA_DRIVER_CAPABILITIES=compute,utility
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$CUDA_HOME/lib

Documentation Shell:

# No specific variables, uses standard tool defaults

Container Shell:

export DOCKER_HOST="unix://$HOME/.colima/default/docker.sock"

Cross-Platform Support

Supported Systems

supportedSystems = [ "x86_64-linux" "aarch64-darwin" ];

Platform-Specific Features

x86_64-Linux:

Full GPU support - NVIDIA CUDA, Docker GPU passthrough
CPU optimizations - Znver2, Intel architecture targeting
Container building - Docker images with CUDA support

aarch64-Darwin (Apple Silicon):

Metal Performance Shaders - GPU acceleration via MPS
Rosetta compatibility - x86_64 dependencies when needed
Native performance - ARM64-optimized packages

Platform Detection

# Conditional features based on platform
${pkgs.lib.optionalString pkgs.stdenv.isLinux ''
  source ${znver2File}
  echo "[cpu-opt] Using znver2 flags: $NIX_CFLAGS_COMPILE"
''}

Troubleshooting

Common Issues

CUDA Not Detected:

# Check NVIDIA drivers
nvidia-smi

# Verify CUDA environment
echo $CUDA_HOME
echo $LD_LIBRARY_PATH

# Test PyTorch CUDA
python -c "import torch; print(torch.cuda.is_available())"

Solution: Ensure NVIDIA drivers are installed and compatible with CUDA 12.8.

Docker Issues on macOS:

# Check Colima status
colima status

# Restart if needed
colima stop
colima start --vm-type vz --cpu 2 --memory 4

Slow Package Installation:

# Use binary cache
echo "substituters = https://cache.nixos.org https://cuda-maintainers.cachix.org" >> ~/.config/nix/nix.conf
echo "trusted-public-keys = cache.nixos.org-1:6NCHdD59X431o0gWypbMrAURkbJ16ZPMQFGspcDShjY= cuda-maintainers.cachix.org-1:0dq3bujKpuEPiCgBv7/11NEBpCcEKUzZzUNjRgPTOOA=" >> ~/.config/nix/nix.conf

Memory Issues

GPU Memory:

# Monitor GPU memory
nvidia-smi -l 1

# Optimize PyTorch memory
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128,expandable_segments:True

System Memory:

# Check available memory
free -h

# Monitor during development
htop

Performance Issues

Check Optimizations:

# Verify CPU flags
cat /proc/cpuinfo | grep flags

# Check compiler optimizations
echo $NIX_CFLAGS_COMPILE

# Benchmark inference
python -c "
import torch
import time
device = 'cuda' if torch.cuda.is_available() else 'cpu'
x = torch.randn(1000, 1000, device=device)
start = time.time()
result = torch.mm(x, x)
print(f'{device} time: {time.time() - start:.4f}s')
"

Advanced Usage

Custom Overlays

Create project-specific overlays in overlays/:

# overlays/custom-package.nix
final: prev: {
  custom-package = prev.python3Packages.buildPythonPackage {
    pname = "custom-package";
    version = "1.0.0";
    src = prev.fetchFromGitHub {
      owner = "owner";
      repo = "repo";
      rev = "v1.0.0";
      sha256 = "...";
    };
    propagatedBuildInputs = with prev.python3Packages; [
      numpy
      torch
    ];
  };
}

Custom Development Shells

Add new shells to the flake:

# Add to devShells
experimental = pkgs.mkShell {
  buildInputs = commonSystemPackages ++ [
    # Custom packages
  ];
  shellHook = ''
    echo "Experimental environment loaded"
    # Custom setup
  '';
};

Environment Specialization

Create environment-specific configurations:

# .env.gpu
CUDA_VISIBLE_DEVICES=0
TORCH_CUDA_ARCH_LIST=8.6

# .env.multi-gpu  
CUDA_VISIBLE_DEVICES=0,1,2,3
NCCL_DEBUG=INFO

# Load specific environment
cp .env.gpu .env
nix develop .#gpu

Contributing to the Flake

Adding New Packages

Create overlay in overlays/new-package.nix
Add to overlay list in nekoOverlays
Include in appropriate shells
Test across platforms
Update documentation

Testing Changes

# Test specific shell
nix develop .#shell-name --command python -c "import new_package"

# Test all shells
for shell in default gpu ai neko docs cpu-opt gpu-opt; do
  echo "Testing $shell..."
  nix develop .#$shell --command echo "✓ $shell loads successfully"
done

# Test image builds
nix build .#neko-agent-docker-generic
nix build .#neko-agent-docker-opt

Performance Considerations

Binary caches - Use Cachix for custom packages
Layer optimization - Minimize Docker image layers
Dependency management - Avoid unnecessary dependencies
Build reproducibility - Pin package versions when needed

This comprehensive flake system provides a robust, reproducible development environment that scales from local development to production deployment while maintaining consistency across different platforms and use cases.

Nix Flake Python Packaging Guide

This guide explains the nuances and patterns used for packaging Python dependencies in the Neko Agent project's Nix flake. The project uses a sophisticated overlay system to package complex Python libraries that aren't available in standard Nixpkgs, with special attention to PyTorch ecosystem integration and CUDA support.

Overview

The Neko Agent project requires numerous specialized Python packages for AI/ML, WebRTC, and audio processing that either don't exist in Nixpkgs or need custom configurations. The flake uses a comprehensive overlay system to package these dependencies while maintaining compatibility with the PyTorch ecosystem.

Key Packaging Principles

Torch-bin Integration - Use pre-compiled PyTorch wheels for consistency
CUDA Compatibility - Ensure proper CUDA toolkit integration
Dependency Override - Override transitive dependencies to use torch-bin
Test Skipping - Disable problematic tests that require GPU access
Version Flexibility - Patch restrictive version constraints when needed

Core Patterns

1. Basic Python Package Overlay

self: super: {
  python3Packages = super.python3Packages // {
    example-package = super.python3Packages.buildPythonPackage rec {
      pname = "example-package";
      version = "1.0.0";
      format = "pyproject";

      src = super.fetchFromGitHub {
        owner = "owner";
        repo = "example-package";
        rev = "v${version}";
        sha256 = "...";
      };

      nativeBuildInputs = with super.python3Packages; [ setuptools wheel ];
      propagatedBuildInputs = with super.python3Packages; [ numpy scipy ];
      doCheck = false;

      meta = with super.lib; {
        description = "Package description";
        homepage = "https://github.com/owner/example-package";
        license = licenses.mit;
      };
    };
  };
}

2. PyTorch Integration Pattern

Always use torch-bin and override transitive dependencies:

self: super: {
  python3Packages = super.python3Packages // {
    ml-package = super.python3Packages.buildPythonPackage rec {
      # ... basic fields ...
      
      propagatedBuildInputs = with super.python3Packages; [
        super.python3Packages."torch-bin"
        super.python3Packages."torchvision-bin"
        
        # Override packages that depend on torch
        (accelerate.override { torch = super.python3Packages."torch-bin"; })
        (torchdiffeq.override { torch = super.python3Packages."torch-bin"; })
        
        transformers
        numpy
      ];

      doCheck = false;  # Skip GPU-dependent tests
    };
  };
}

3. CUDA-Aware Package Pattern

For packages needing CUDA compilation:

self: super: 
let
  cudaPackages = super.cudaPackages_12_8;
  cuda-redist = super.symlinkJoin {
    name = "cuda-redist-${cudaPackages.cudaMajorMinorVersion}";
    paths = with cudaPackages; [
      (super.lib.getDev cuda_cccl)
      (super.lib.getDev libcublas)
      # ... more CUDA libs
    ];
  };
in
{
  python3Packages = super.python3Packages // {
    cuda-package = super.python3Packages.buildPythonPackage rec {
      # ... basic fields ...
      
      postPatch = ''
        substituteInPlace setup.py \
          --replace-fail "find_cuda_libs()" "['${cuda-redist}/lib/libcudart.so']"
      '';

      nativeBuildInputs = [ super.cmake cudaPackages.cuda_nvcc ];
      buildInputs = [ cuda-redist ];
      
      CUDA_HOME = "${cuda-redist}";
      NVCC_PREPEND_FLAGS = [ "-I${cuda-redist}/include" ];
      
      doCheck = false;
    };
  };
}

4. Version Constraint Patching

Relax overly restrictive version constraints:

postPatch = ''
  substituteInPlace pyproject.toml \
    --replace "numpy<=1.24.0" "numpy" \
    --replace "pydantic>=2.0,<2.5" "pydantic>=2.0"
'';
## Real-World Examples

### F5-TTS Package

Demonstrates torch overrides, version patching, and platform-specific dependencies:

```nix
self: super: {
  python3Packages = super.python3Packages // {
    f5-tts = super.python3Packages.buildPythonPackage rec {
      pname = "f5-tts";
      version = "1.1.7";
      
      src = super.fetchFromGitHub {
        owner = "SWivid";
        repo = "F5-TTS";
        rev = "main";
        sha256 = "sha256-MtPyqS5aNrq929pHMlDp3HFUSf+i9xYDb5xMA0Eqh9Y=";
      };

      propagatedBuildInputs = with super.python3Packages; [
        # Torch overrides
        (accelerate.override { torch = super.python3Packages."torch-bin"; })
        (x-transformers.override { torch = super.python3Packages."torch-bin"; })
        
        # Custom overlays
        cached-path ema-pytorch vocos
        
        # Standard deps
        transformers gradio librosa soundfile
      ] ++ super.lib.optionals (!super.stdenv.isAarch64) [
        bitsandbytes  # x86_64 only
      ];

      # Version constraint fixes
      postPatch = ''
        substituteInPlace pyproject.toml \
          --replace "numpy<=1.26.4" "numpy" \
          --replace "pydantic<=2.10.6" "pydantic"
      '';
      
      doCheck = false;
    };
  };
}

Pi-Zero PyTorch

Shows complex dependency chains and overlay usage:

self: super: {
  python3Packages = super.python3Packages // {
    pi-zero-pytorch = super.python3Packages.buildPythonPackage rec {
      pname = "pi-zero-pytorch";
      version = "0.2.5";
      
      src = super.fetchFromGitHub {
        owner = "lucidrains";
        repo = "pi-zero-pytorch";
        rev = "1c13cbbcc236c3cb4fd18213c543188fdf083b33";
        sha256 = "1xl38c5spv798xydljahq8fw9nglwcw95s95zpn11cm9ra4rg4ib";
      };

      nativeBuildInputs = [ super.python3Packages.hatchling ];
      
      propagatedBuildInputs = with super.python3Packages; [
        # Torch overrides
        (accelerate.override { torch = super.python3Packages."torch-bin"; })
        
        # Custom overlays (all from overlays/)
        einx x-transformers rotary-embedding-torch
        accelerated-scan bidirectional-cross-attention
        
        # Use patched einops from self
        (self.python3Packages.einops)
        
        beartype jaxtyping scipy tqdm
        super.python3Packages."torch-bin"
      ];
      
      doCheck = false;
    };
  };
}

Advanced Techniques

Using `self` vs `super`

super: Access packages before overlay modifications
self: Access final overlaid packages (avoid infinite recursion)

propagatedBuildInputs = with super.python3Packages; [
  numpy  # Use super for standard packages
  (self.python3Packages.custom-overlay-package)  # Use self for overlay deps
  (accelerate.override { torch = super.python3Packages."torch-bin"; })
];

Conditional Dependencies

propagatedBuildInputs = with super.python3Packages; [
  numpy scipy
] ++ super.lib.optionals (!super.stdenv.isAarch64) [
  bitsandbytes  # x86_64 only
] ++ super.lib.optionals super.stdenv.isLinux [
  nvidia-ml-py  # Linux only
];

Source Fetching

# Preferred: Specific tag/commit
src = super.fetchFromGitHub {
  owner = "owner"; repo = "repo";
  tag = "v${version}";  # or rev = "commit-hash";
  hash = "sha256-...";  # Use nix-prefetch-github
};

# PyPI when available
src = super.fetchPypi {
  inherit pname version;
  hash = "sha256-...";
};

Build Systems

# Modern pyproject.toml
format = "pyproject";
nativeBuildInputs = [ super.python3Packages.hatchling ];

# Legacy setup.py  
format = "setuptools";
nativeBuildInputs = [ super.python3Packages.setuptools ];

Best Practices

Hash Management

# Get hashes for sources
nix-prefetch-github owner repo --rev v1.0.0
nix hash to-sri --type sha256 abc123...  # Convert to SRI format

Testing Overlays

# Test single package
nix-shell -p '(python3.withPackages (ps: [ ps.my-package ]))'
nix-shell --run 'python -c "import my_package"'

Dependency Management

Keep dependencies minimal and explicit
Always use torch-bin for PyTorch ecosystem
Override transitive torch dependencies

Quality Metadata

meta = with super.lib; {
  description = "Clear, concise description";
  homepage = "https://github.com/owner/repo";
  license = licenses.mit;
  platforms = platforms.unix;
};

Troubleshooting

Common Issues

Import Errors: Use super.python3Packages."torch-bin" not "torch"

Version Conflicts: Patch constraints with postPatch

postPatch = ''
  substituteInPlace setup.py --replace "torch==2.0.0" "torch>=2.0.0"
'';

CUDA Issues: Set proper CUDA environment

buildInputs = [ cuda-redist ];
CUDA_HOME = "${cuda-redist}";

Test Failures: Disable with doCheck = false;

Integration with Flake

Adding New Overlays

Create overlay file in overlays/ directory
Add to nekoOverlays list in flake.nix
Include in Python environments

# In flake.nix
nekoOverlays = [
  (import ./overlays/new-package.nix)
];

# Usage in environments
pyEnvExample = pkgs.python3.withPackages (ps: [
  ps.new-package
]);

Overlay Ordering

Dependencies must come first:

nekoOverlays = [
  # Base packages first
  (import ./overlays/cached-path.nix)
  (import ./overlays/ema-pytorch.nix)
  
  # Dependent packages after
  (import ./overlays/f5-tts.nix)  # Uses ema-pytorch
  
  # Complex packages last
  (import ./overlays/pi-zero-pytorch/pi-zero-pytorch.nix)
  
  # Patches to existing packages last
  (import ./overlays/einops.nix)  # Overrides existing einops
];

Handling GPU-Heavy Test Suites

Some upstream projects (for example einops) ship notebook-based test suites that start long-running CUDA benchmarks. In CI or local builds these checks routinely exceed the sandbox timeout. Our overlays/einops.nix patch disables the notebook runner and its specific test_notebook_3 case, while still allowing the rest of the package to build reproducibly. Always prefer pinpointing the slow tests (via disabledTestPaths/disabledTests) instead of globally disabling checks so that lightweight import tests keep running.

This guide covers the key patterns for packaging Python dependencies in the Neko Agent project with proper PyTorch ecosystem integration and CUDA support.

Neko

Repository Cheat Sheet (Current Codebase)

Project Structure:
- server/: Go backend (cmd/, internal/, pkg/, optional plugins/). Build via server/build → server/bin/neko.
- client/: Vue 2 + TypeScript SPA (src/, public/), built to client/dist.
- apps/: Docker app images (e.g., firefox/, chromium/, kde/, vlc/).
- runtime/: Base image and Xorg/driver configs; basis for final images.
- utils/: Tooling; Dockerfile generator at utils/docker/main.go.
- webpage/: Docusaurus docs site.
- Root: build (image builder), docker-compose.yaml (local run), config.yml (sample config).
Dev Commands:
- Client dev: cd client && npm ci && npm run serve (hot-reload dev server).
- Client build: cd client && npm run build → client/dist.
- Client lint: cd client && npm run lint.
- Server build: cd server && ./build (use ./build core to skip plugins) → server/bin/neko.
- Docker images: base ./build, app ./build -a firefox, flavor ./build -f nvidia -a chromium.
- Local run: docker compose up -d (serves ghcr.io/m1k1o/neko/firefox:latest on :8080).
Coding Style:
- Indent 2 spaces (LF). Client: Prettier single quotes, no semicolons, trailing commas. Go: gofmt/go vet; package names short/lowercase; files snake_case.
Architecture & Entry Points:
- SPA served as static assets from container; talks HTTP/WS on :8080.
- server/cmd/neko/main.go → cmd.Execute(); server/cmd/root.go config/logging init; server/cmd/serve.go wires managers in order: session → member → desktop → capture → webrtc → websocket → api → http.
- HTTP: server/internal/http/manager.go registers /api, /api/ws, /health, /metrics, static files, optional pprof.
- REST router: server/internal/api/router.go. WebSocket: server/internal/websocket/manager.go. WebRTC: server/internal/webrtc/manager.go. Session/auth: server/internal/session, server/internal/member/*. Capture/Xorg: server/internal/capture, server/internal/desktop.
API Surface (summary; see server/openapi.yaml):
- Auth: POST /api/login, POST /api/logout, GET /api/whoami, POST /api/profile.
- Sessions: GET /api/sessions, GET/DELETE /api/sessions/{id}, POST /api/sessions/{id}/disconnect.
- Room: GET/POST /api/room/settings, broadcast GET /api/room/broadcast, POST /api/room/broadcast/start|stop.
- Clipboard: GET/POST /api/room/clipboard, image GET /api/room/clipboard/image.png.
- Keyboard: GET/POST /api/room/keyboard/map, GET/POST /api/room/keyboard/modifiers.
- Control: GET /api/room/control, POST /api/room/control/{request|release|take|give/{sessionId}|reset}.
- Screen: GET/POST /api/room/screen, GET /api/room/screen/configurations, GET /api/room/screen/{cast.jpg|shot.jpg}.
- Upload: POST /api/room/upload/{drop|dialog}, DELETE /api/room/upload/dialog.
- Utility: POST /api/batch, GET /health, GET /metrics.
WebSocket:
- Connect ws(s)://<host>/api/ws with cookie, Authorization: Bearer <token>, or ?token=<token>.
- Envelope { "event": "<string>", "payload": <JSON> } with events defined under server/pkg/types/event/events.go (e.g., system/init, signal/request, control/move, clipboard/set, keyboard/map, broadcast/status, file_chooser_dialog/opened|closed).
Configuration Model:
- Defaults in config.yml; override via NEKO_* envs; legacy v2 envs supported behind NEKO_LEGACY=1 and have v3 equivalents (see server/internal/config/*).

Complete REST API (v3)

Authentication

Methods: Cookie (NEKO_SESSION), Bearer token (Authorization: Bearer <token>), or query token (?token=<token>).
Default security applies to all endpoints unless explicitly noted; /health, /metrics, and /api/login do not require auth.

General

GET /health: Liveness probe. 200 text/plain.
GET /metrics: Prometheus metrics. 200 text/plain.
POST /api/batch: Execute multiple API requests in one call.
- Request: [{ path: string, method: 'GET'|'POST'|'DELETE', body?: any }].
- Response: [{ path, method, status: number, body?: any }].
GET /api/stats: Server/session statistics.
- Response: { has_host, host_id, server_started_at, total_users, last_user_left_at, total_admins, last_admin_left_at }.

Current Session

POST /api/login: Authenticate and start a session. No auth required.
- Request: { username, password }.
- Response: SessionLoginResponse = SessionData plus optional { token } if cookies are disabled.
POST /api/logout: Terminate current session. 200.
GET /api/whoami: Retrieve current session info.
- Response: SessionData.
POST /api/profile: Update current session’s runtime profile (no member sync).
- Request: MemberProfile. 204.

Sessions

GET /api/sessions: List active sessions.
- Response: SessionData[].
GET /api/sessions/{sessionId}: Get a session by id.
- Response: SessionData. 404 if not found.
DELETE /api/sessions/{sessionId}: Remove a session. 204.
POST /api/sessions/{sessionId}/disconnect: Force disconnect a session. 204.

Room Settings

GET /api/room/settings: Get room settings.
- Response: Settings.
POST /api/room/settings: Update room settings. 204.
- Request: Settings.

Room Broadcast

GET /api/room/broadcast: Get broadcast status.
- Response: { url?: string, is_active: boolean }.
POST /api/room/broadcast/start: Start RTMP broadcast.
- Request: { url: string }. 204. Errors: 400 missing URL; 422 already broadcasting; 500 start failure.
POST /api/room/broadcast/stop: Stop broadcast. 204. Error: 422 not broadcasting.

Room Clipboard

GET /api/room/clipboard: Get clipboard text/HTML.
- Response: { text?: string, html?: string }.
POST /api/room/clipboard: Set clipboard text/HTML. 204.
- Request: { text?: string, html?: string }.
GET /api/room/clipboard/image.png: Get clipboard image.
- Response: image/png binary.

Room Keyboard

GET /api/room/keyboard/map: Get keyboard map.
- Response: { layout?: string, variant?: string }.
POST /api/room/keyboard/map: Set keyboard map. 204.
- Request: { layout?: string, variant?: string }.
GET /api/room/keyboard/modifiers: Get keyboard modifiers.
- Response: { shift?, capslock?, control?, alt?, numlock?, meta?, super?, altgr? }: boolean.
POST /api/room/keyboard/modifiers: Set modifiers. 204.
- Request: same shape as response.

Room Control

GET /api/room/control: Get control status.
- Response: { has_host: boolean, host_id?: string }.
POST /api/room/control/request: Request host control. 204. Error: 422 already has host.
POST /api/room/control/release: Release control. 204.
POST /api/room/control/take: Take control (admin/host override). 204.
POST /api/room/control/give/{sessionId}: Give control to session. 204.
- Path: sessionId string. Errors: 400 target can’t host; 404 unknown session.
POST /api/room/control/reset: Reset control state. 204.

Room Screen

GET /api/room/screen: Get current screen config.
- Response: ScreenConfiguration.
POST /api/room/screen: Change screen config.
- Request/Response: ScreenConfiguration. Errors: 422 invalid config.
GET /api/room/screen/configurations: List available configurations.
- Response: ScreenConfiguration[].
GET /api/room/screen/cast.jpg: Current screencast JPEG.
- Response: image/jpeg. Errors: 400 screencast disabled; 500 fetch error.
GET /api/room/screen/shot.jpg: On-demand screenshot JPEG.
- Query: quality integer (0–100). Response: image/jpeg. Errors: 500 create image.

Room Upload

POST /api/room/upload/drop: Upload files and drop at coordinates. 204.
- multipart/form-data: x: number, y: number, files: file[].
- Errors: 400 upload error; 500 processing error.
POST /api/room/upload/dialog: Upload files to active dialog. 204.
- multipart/form-data: files: file[]. Errors: 422 no dialog.
DELETE /api/room/upload/dialog: Close file chooser dialog. 204. Error: 422 no dialog.

Members

GET /api/members: List members.
- Query: limit?: number, offset?: number.
- Response: MemberData[].
POST /api/members: Create member.
- Request: MemberCreate.
- Response: MemberData. Error: 422 ID exists.
GET /api/members/{memberId}: Get member profile.
- Response: MemberProfile. 404 if not found.
POST /api/members/{memberId}: Update member profile. 204.
- Request: MemberProfile.
DELETE /api/members/{memberId}: Remove member. 204.
POST /api/members/{memberId}/password: Update member password. 204.
- Request: { password: string }.
POST /api/members_bulk/update: Bulk update member profiles. 204.
- Request: { ids: string[], profile: MemberProfile }.
POST /api/members_bulk/delete: Bulk delete members. 204.
- Request: { ids: string[] }.

Schemas (Shapes)

SessionData: { id: string, profile: MemberProfile, state: { is_connected: boolean, is_watching: boolean } }.
MemberProfile: { name?: string, is_admin?: boolean, can_login?: boolean, can_connect?: boolean, can_watch?: boolean, can_host?: boolean, can_share_media?: boolean, can_access_clipboard?: boolean, sends_inactive_cursor?: boolean, can_see_inactive_cursors?: boolean, plugins?: object }.
MemberData: { id: string, profile: MemberProfile }.
MemberCreate: { username: string, password: string, profile?: MemberProfile }.
Settings: { private_mode?, locked_controls?, implicit_hosting?, inactive_cursors?, merciful_reconnect?, plugins?: object }.
ScreenConfiguration: { width: number, height: number, rate: number }.
KeyboardMap: { layout?: string, variant?: string }; KeyboardModifiers: all booleans listed above.
BroadcastStatus: { url?: string, is_active: boolean }.
ClipboardText: { text?: string, html?: string }.
ErrorMessage: { message: string } (for 4xx/5xx bodies).

Notes

Unless stated, successful POSTs return 204 No Content; GETs return JSON bodies per schema.
Image endpoints return binary JPEG/PNG with appropriate content types.
Security schemes in effect globally: Bearer, Cookie, or token query; /api/login has security disabled.

Complete WebSocket API

Envelope: { "event": string, "payload": any }. Connect to ws(s)://<host>/api/ws with cookie, Bearer, or ?token=.
Heartbeats: server sends system/heartbeat ~10s and low-level pings; clients may send client/heartbeat (no payloads).

System

Server → Client system/init:
- { session_id: string, control_host: { id?: string, has_host: boolean, host_id?: string }, screen_size: { width, height, rate }, sessions: { [id]: { id, profile: MemberProfile, state: SessionState } }, settings: Settings, touch_events: boolean, screencast_enabled: boolean, webrtc: { videos: string[] } }.
Server → Client (admins) system/admin:
- { screen_sizes_list: ScreenSize[], broadcast_status: { is_active: boolean, url?: string } }.
Server → Client system/settings: { id: string, ...Settings } (actor id, new settings).
Client → Server system/logs: [{ level: 'debug'|'info'|'warn'|'error', fields?: object, message: string }].
Server → Client system/disconnect: { message: string }.

Signaling (WebRTC)

Client → Server signal/request: { video: { disabled?: boolean, selector?: { id: string, type: 'exact'|'best' }, auto?: boolean }, audio: { disabled?: boolean }, auto?: boolean }.
Server → Client signal/provide: { sdp: string, iceservers: [{ urls: string[], username?: string, credential?: string }], video: { disabled: boolean, id: string, video?: string, auto: boolean }, audio: { disabled: boolean } }.
Client ↔ Server signal/offer | signal/answer: { sdp: string } (direction depends on negotiation; Neko often offers first via signal/provide).
Client → Server signal/candidate: { candidate: string, sdpMid?: string, sdpMLineIndex?: number }.
Client → Server signal/video: { disabled?: boolean, selector?: { id: string, type: 'exact'|'best' }, auto?: boolean }.
Client → Server signal/audio: { disabled?: boolean }.
Server → Client signal/restart: { sdp: string } (ICE restart offer).
Server → Client signal/close: null payload (peer disconnected).

Sessions

Server → Client session/created: { id, profile: MemberProfile, state: SessionState }.
Server → Client session/deleted: { id }.
Server → Client session/profile: { id, ...MemberProfile }.
Server → Client session/state: { id, is_connected, connected_since?, not_connected_since?, is_watching, watching_since?, not_watching_since? }.
Server → Client session/cursors: [{ id: string, cursors: { x: number, y: number }[] }] (only when settings.inactive_cursors enabled).

Control & Input

Server ↔ Client control/request:
- Client → Server: no payload (request host control).
- Server → Client (to current host): { id: string } (requester id).
Server → Client control/host: { id: string, has_host: boolean, host_id?: string }.
Client → Server control/release: no payload (only host; requires can_host).
Client → Server control/move: { x, y }.
Client → Server control/scroll: { delta_x?: number, delta_y?: number, control_key?: boolean } (legacy { x, y } supported).
Client → Server control/buttonpress|buttondown|buttonup: { code: uint32, x?: number, y?: number }.
Client → Server control/keypress|keydown|keyup: { keysym: uint32, x?: number, y?: number }.
Client → Server control/touchbegin|touchupdate|touchend: { touch_id: uint32, x: number, y: number, pressure: uint8 }.
Client → Server control/cut|control/copy|control/select_all: no payload.
Client → Server control/paste: { text?: string } (optionally seeds clipboard first).
Notes: Control requires profile.can_host and either being host or implicit hosting; clipboard ops also require profile.can_access_clipboard.

Screen

Client → Server screen/set: { width, height, rate } (admin only).
Server → Client screen/updated: { id: string, width, height, rate }.

Clipboard

Client → Server clipboard/set: { text: string } (host with clipboard permission).
Server → Client clipboard/updated: { text: string } (sent to host on OS clipboard change).

Keyboard

Client → Server keyboard/map: { layout?: string, variant?: string } (host only).
Client → Server keyboard/modifiers: { shift?, capslock?, control?, alt?, numlock?, meta?, super?, altgr? } (host only; omitted fields unchanged).

Broadcast

Server → Client (admins) broadcast/status: { is_active: boolean, url?: string }.

Opaque Send Channel

Client → Server send/unicast: { receiver: string, subject: string, body: any } → forwarded to receiver as { sender, receiver, subject, body }.
Client → Server send/broadcast: { subject: string, body: any } → broadcast to others as { sender, subject, body }.

File Chooser Dialog

Server → Client file_chooser_dialog/opened: { id: string } (host holding dialog).
Server → Client file_chooser_dialog/closed: {}.

Shared Types (payload shapes)

MemberProfile: { name?: string, is_admin?, can_login?, can_connect?, can_watch?, can_host?, can_share_media?, can_access_clipboard?, sends_inactive_cursor?, can_see_inactive_cursors?, plugins?: object }.
SessionState: { is_connected: boolean, connected_since?: string, not_connected_since?: string, is_watching: boolean, watching_since?: string, not_watching_since?: string } (ISO timestamps).
Settings (WS): { private_mode, locked_logins, locked_controls, control_protection, implicit_hosting, inactive_cursors, merciful_reconnect, heartbeat_interval, plugins?: object }.
ScreenSize: { width: number, height: number, rate: number }.
KeyboardMap: { layout: string, variant: string }; KeyboardModifiers: booleans listed above.
Cursor: { x: number, y: number }.

Heartbeat & Reconnect Behavior

Layers of liveness
- WebSocket ping/pong: Server sends a WS Ping every ~10s and also emits system/heartbeat in-band. Browsers auto‑reply with Pong; no client code is needed for WS Pong.
- Client heartbeat event: Clients may send client/heartbeat at a cadence derived from settings.heartbeat_interval (default 120s). The server accepts it but does not strictly require it for liveness; it’s useful for analytics and legacy bridges.
Settings and defaults
- session.heartbeat_interval (default 120s) is included in system/init.settings.heartbeat_interval for the client to display/use.
- session.merciful_reconnect (default true): If a session is “already connected” and a new WS arrives, the server replaces the old connection instead of rejecting it. With this off, a second connection is rejected with reason “already connected”.
Timeouts and proxies
- Ensure reverse proxies have read timeouts comfortably above both the WS Ping cadence (~10s) and the client heartbeat interval (≥120s). Recommended proxy_read_timeout ≥ 300s and to forward Upgrade/Connection headers for WebSocket.
- Avoid aggressive idle timeouts on L4/L7 load balancers that terminate idle TCP flows under a few minutes; set ≥5 minutes where possible.
Disconnect semantics
- If WS Ping fails or the socket errors, the server closes the connection and sends system/disconnect {message} when possible. The session is marked disconnected; clients should auto‑reconnect and renegotiate WebRTC.
- On WebRTC transport loss, the server emits signal/close; clients should re‑issue signal/request or handle signal/restart.
Troubleshooting frequent drops
- Blackouts every N seconds: increase reverse proxy proxy_read_timeout/keep‑alive timeouts; confirm no CDN/WebApp firewall is in the WS path.
- “already connected” on reconnect: enable/keep session.merciful_reconnect=true (default), or ensure clients close prior tabs before reconnecting.
- Mobile/background tabs: some OSes suspend WS/Pong; expect disconnects when backgrounded. The client must reconnect on resume.

Legacy WebSocket Events (v2 Compatibility Proxy)

Enabled when legacy mode is active (see server/internal/http/manager.go uses viper legacy). The legacy bridge maps v3 events to v2 event names for old clients and also emits some additional compatibility messages.
Envelope: legacy messages also use {event, ...payload}.
System:
- system/init: { locks: {login?, control?, file_transfer?}: string map, implicit_hosting: boolean, file_transfer: boolean, heartbeat_interval: number }.
- system/disconnect: { title?: string, message: string }.
- system/error: historical; emitted by legacy paths on errors.
Members:
- member/list: { members: [{ id, name, admin: boolean, muted: boolean }] }.
- member/connected: { id, name, admin, muted }.
- member/disconnected: { id }.
Signaling:
- signal/provide: { id: string, sdp: string, lite: boolean, ice: [{ urls, username?, credential? }] }.
- signal/offer: { sdp: string }.
- signal/answer: { displayname: string, sdp: string }.
- signal/candidate: { data: string } (raw ICE JSON as string).
Control/Admin (compatibility notifications):
- control/locked: { id: string } (lock established).
- control/release: { id: string }.
- control/request: client-originated; control/requesting: notification to host; control/give: { id, target }.
- control/clipboard: { text: string }.
- control/keyboard: { layout?: string, capsLock?: boolean, numLock?: boolean, scrollLock?: boolean }.
- admin/lock|unlock: { resource: 'control'|'login'|'file_transfer', id: string }.
- admin/control|release|give: { id: string } or { id, target }.
- admin/ban|kick|mute|unmute: { id: string } or { target: string, id: string }.
Chat (legacy plugin messages):
- chat/message: send/receive { id?: string, content: string }.
- chat/emote: send/receive { id?: string, emote: string }.
Filetransfer (legacy plugin messages):
- filetransfer/list: { cwd: string, files: [{ name, size, is_dir, mtime, perms, ... }] }.
- filetransfer/refresh: same shape as list (triggered refresh).
Screen/Broadcast:
- screen/configurations: { configurations: { [idx]: { width, height, rates: { [idx]: rate } } } }.
- screen/resolution: { id?: string, width, height, rate }.
- screen/set: { width, height, rate }.
- broadcast/status: { url: string, isActive: boolean }.
- broadcast/create: { url: string }; broadcast/destroy: no payload.

1. What Neko Is (Concept & Origin)

Neko (often styled n.eko) is an open‑source, self‑hosted virtual browser / remote desktop environment: you run a containerized Linux desktop with a preinstalled browser (Firefox, Chromium, etc.) on your own infrastructure; Neko streams the interactive desktop (video, audio, input) to remote clients via WebRTC, so multiple participants can watch and even take control in real time. GitHub neko.m1k1o.net heise online

The project was started by its author after the shutdown of Rabb.it; needing a reliable way to watch anime remotely with friends over limited bandwidth + unstable Discord streaming, he built a WebRTC‑based Dockerized environment so everyone could share a single browser session. This collaborative genesis still shapes Neko’s multi‑user design (shared control queue, watch‑party friendliness). GitHub heise online

Neko targets privacy, isolation, and portability: browsing happens in the container, not on the viewer’s device; host fingerprints/cookies stay server‑side; nothing persistent need touch the client unless you configure it. This “shielded browser” model is highlighted in both the docs and independent coverage (Heise), which also frames Neko as a lightweight VPN alternative for accessing internal resources without distributing full desktop access. neko.m1k1o.net heise online fossengineer.com

2. Primary Use Cases

Collaborative browsing & watch parties: All participants see the same live browser; host control can be passed; synchronized media playback works well because WebRTC streams the rendered video/audio from the container. GitHub neko.m1k1o.net fossengineer.com
Interactive presentations, workshops, remote support: Presenter drives a shared browser/desktop; participants can be granted temporary control for demos or troubleshooting. Heise specifically calls out company trainings and support scenarios. GitHub heise online fossengineer.com
Privacy / throwaway browsing / firewall bypass: Because traffic originates from the Neko host, users can browse sites blocked locally (subject to policy/ethics); community reports note using Neko to get around locked‑down work networks. heise online fossengineer.com Reddit
Web dev & cross‑browser testing in controlled envs: Spin up specific browser versions (incl. Waterfox, Tor, Chromium variants) to test sites without polluting local machines. neko.m1k1o.net neko.m1k1o.net heise online
Remote application streaming beyond browsers: Official images include full desktop environments (KDE, Xfce), Remmina (RDP/VNC client), VLC, and more; you can install arbitrary Linux GUI apps, turning Neko into a general remote app delivery layer. GitHub neko.m1k1o.net heise online
Embedding into other web properties / programmatic rooms: Docs and community guides show URL query param auth for frictionless embedding; REST API + Neko Rooms enable dynamic, ephemeral shareable sessions. neko.m1k1o.net GitHub fossengineer.com

3. High‑Level Architecture

At a high level, a Neko deployment comprises:

Server container(s): Run the Linux desktop + target browser/application; capture Xorg display frames + PulseAudio; encode via GStreamer; feed into WebRTC pipeline (Pion stack). GitHub neko.m1k1o.net GitHub
Signaling / control plane: HTTP + WebSocket endpoints manage sessions, auth, and host‑control; periodic ping/heartbeat maintain liveness (esp. behind proxies). neko.m1k1o.net neko.m1k1o.net GitHub
WebRTC media plane: ICE negotiation (STUN/TURN) to establish peer link(s); selectable port strategy (ephemeral range vs. UDP/TCP mux single port); optional Coturn relay for NAT‑restricted environments. neko.m1k1o.net neko.m1k1o.net GitHub
Client UI (served over HTTPS): Browser front‑end page that renders the stream in a canvas/video element, sends input events (mouse/keyboard), displays participant cursors, chat/plugins, and exposes host‑control queue. GitHub neko.m1k1o.net neko.m1k1o.net
Optional ecosystem services: REST API, Prometheus metrics exporter, plugin hooks (chat, file upload), and higher‑level orchestration projects (Neko Rooms / Apps / VPN). GitHub neko.m1k1o.net neko.m1k1o.net

4. Feature Inventory (v3 era)

Multi‑user concurrent session w/ host handoff + inactive cursors: Participants can join; privileges (watch / host / share media / clipboard) governed per‑member profile. neko.m1k1o.net GitHub neko.m1k1o.net
Audio + video streaming w/ low latency: WebRTC transport from container to clients; GStreamer capture; stream selector to adjust quality. neko.m1k1o.net GitHub neko.m1k1o.net
GPU acceleration modes (Intel/Nvidia flavors) & CPU builds: Select appropriate image flavor to offload encoding & improve responsiveness; GPU support maturity varies—docs caution focus currently on CPU images. neko.m1k1o.net heise online neko.m1k1o.net
Granular auth/authorization (admin vs user; fine‑grained caps): Role bits include can_login, can_connect, can_watch, can_host, can_share_media, can_access_clipboard, etc.; supports multiuser password split, file‑backed users, in‑memory object sets, and no‑auth (dev only). neko.m1k1o.net GitHub
REST API + API token (admin programmatic control) & batch HTTP: Added in v3; enables external orchestration, dynamic user provisioning, and admin operations without interactive login; API token should be short‑lived in ephemeral rooms. GitHub neko.m1k1o.net
Prometheus metrics & pprof profiling: Expose runtime health / performance metrics; integrate into observability stacks; profiling hooks assist tuning. GitHub neko.m1k1o.net
Desktop quality‑of‑life: Clipboard reworked via xclip; drag‑and‑drop & file chooser upload; touchscreen input driver; dynamic resolution via xrandr; cursor image events. GitHub neko.m1k1o.net
Screencast endpoints + webcam/mic passthrough (experimental): HTTP/JPEG screencast endpoints for snapshots/casting; optional upstream of user webcam/mic. GitHub neko.m1k1o.net
Plugin system (chat, file upload, user‑scoped plugin config map). GitHub neko.m1k1o.net neko.m1k1o.net

5. Supported Browsers / Apps / Desktops

Neko ships many tagged images; availability varies by architecture and GPU flavor. Current matrix (AMD64 strongest support): Firefox, Waterfox, Tor Browser; Chromium family incl. Google Chrome, Microsoft Edge, Brave, Vivaldi, Opera; plus Ungoogled Chromium. Additional desktop/media apps: KDE, Xfce, Remmina, VLC. ARM support exists for subsets (e.g., Brave & Vivaldi on ARM64; some lack DRM). neko.m1k1o.net heise online GitHub

Community packages (Umbrel) surface a streamlined install for home servers; Umbrel metadata shows current packaged version (3.0.4 at capture) and highlights collaboration + tunneling access patterns. apps.umbrel.com neko.m1k1o.net

6. Deployment Overview (Minimal to Advanced)

6.1 Quick Minimal Docker Run

Pull an image (e.g., Firefox flavor) and run mapping HTTP + WebRTC ports; provide screen size and user/admin passwords via env vars; share memory sized for modern browsers (e.g., 2GB). Community example docker‑compose (FOSS Engineer) shows mapping 8888:8080 plus 52000-52100/udp EPR range and NEKO_MEMBER_MULTIUSER_* passwords. fossengineer.com neko.m1k1o.net neko.m1k1o.net

6.2 Choosing Registry & Tags

Prefer GitHub Container Registry (GHCR) for stable, flavor‑specific version tags; Docker Hub hosts latest dev (amd64) convenience builds. Semantic versioning (MAJOR.MINOR.PATCH) supported; latest for most recent stable—pin explicit tags for reproducibility. neko.m1k1o.net Docker Hub

6.3 Selecting Flavors (CPU vs GPU)

Image suffix selects hardware accel stack: nvidia-* for CUDA GPUs (AMD64), intel-* for VA‑API/QuickSync paths, or base CPU images. Docs caution GPU support may lag; verify in your environment. neko.m1k1o.net heise online

6.4 Architecture Match & Resource Planning

Images published for linux/amd64, arm64, arm/v7; not every browser builds on all arches; some Chromium‑derived variants require ≥2GB RAM (Heise). Check the docs availability matrix before pulling. neko.m1k1o.net heise online

6.5 Persistent State (Data Volumes)

While Neko can be run “throwaway,” you may bind‑mount config, member files, and persistent browser profiles to retain bookmarks, extensions (if policy permits), and user lists; docs show file/member providers referencing host paths (e.g., /opt/neko/members.json). neko.m1k1o.net neko.m1k1o.net neko.m1k1o.net

7. Networking & WebRTC Ports

7.1 Why Ports Matter

WebRTC media does not traverse your HTTP reverse proxy; you must expose the negotiated media ports (or provide a TURN relay). If you only open 443 you will fail unless multiplexing or relay is used. neko.m1k1o.net neko.m1k1o.net Reddit

7.2 Ephemeral UDP Port Range (EPR)

Configure NEKO_WEBRTC_EPR (e.g., 59000-59100) and expose identical host:container UDP range; don’t remap—ICE candidates must match reachable ports. neko.m1k1o.net

7.3 UDP/TCP Multiplexing

Alternatively specify single udpmux / tcpmux ports when firewall pinholes are scarce; open both protocols for fallback where UDP blocked. neko.m1k1o.net

7.4 Public vs NAT’d IPs

Set nat1to1 when advertising a different reachable address (NAT hairpin caveats); or provide an IP retrieval URL to auto‑detect public address; otherwise ICE may hand out unroutable candidates. neko.m1k1o.net

7.5 TURN Integration

Provide STUN/TURN server JSON (frontend/back‑end separation) via env vars; example Coturn compose snippet in docs; TURN recommended when clients sit behind strict NAT/firewalls. neko.m1k1o.net neko.m1k1o.net

7.6 Real‑World Gotchas

Community reverse‑proxy thread shows mis‑set X‑Forwarded headers and missing additional port exposures leading to 502s; verifying correct WebRTC ports resolved issues for some users. Reddit neko.m1k1o.net

8. Reverse Proxy Patterns (HTTP Plane)

8.1 Enable Proxy Trust

Set server.proxy=true so Neko honors X-Forwarded-* headers (important for logging, CSRF, cookie domain/path). Docs warn to adjust WebSocket timeouts because Neko pings every ~10s and expects client heartbeat ~120s. neko.m1k1o.net neko.m1k1o.net

8.2 Traefik v2 Example

Label‑driven routing to backend 8080; integrate TLS cert resolver; ensure UDP media ports separately exposed. neko.m1k1o.net neko.m1k1o.net

8.3 Nginx Example & Header Hygiene

Minimal conf proxies HTTP + WebSocket upgrade; you may add X‑Forwarded‑For/Proto, cache bypass, and long read timeouts—legacy v2 docs show extended header set; community notes correcting X-Forwarded-Proto spelling vs “Protocol.” neko.m1k1o.net neko.m1k1o.net Reddit

8.4 Apache, Caddy, HAProxy Templates

Docs provide working snippets incl. WebSocket rewrite for Apache; one‑liner reverse_proxy for Caddy w/ auto HTTPS; HAProxy ACL routing recipe w/ timeout tuning guidance. neko.m1k1o.net neko.m1k1o.net

9. Authentication & Authorization

9.1 Member vs Session Providers

Auth split: Member Provider validates credentials + returns capability profile; Session Provider persists session state (memory/file). Single member provider active at a time. neko.m1k1o.net

9.2 Capability Flags (Granular Rights)

Per‑user profile booleans drive UI & backend enforcement: admin status; login/API; connect vs watch; host control; share media; clipboard access; send inactive cursor; see inactive cursors; plugin‑specific keys. neko.m1k1o.net GitHub

9.3 Provider Types

Multiuser: Two shared passwords (admin/user) generate ephemeral usernames; mirrors legacy v2 behavior.
File: Persistent JSON map of users → hashed (optional) passwords + profiles.
Object: In‑memory static list; no dup logins.
No‑Auth: Open guest access (testing only—danger). neko.m1k1o.net

9.4 API User Token

Separate non‑interactive admin identity for HTTP API calls; cannot join media; recommend ephemeral/rotated tokens (avoid long‑lived static in exposed rooms). neko.m1k1o.net GitHub

Session cookie name, expiry, secure, httpOnly, domain/path configurable; disabling cookies falls back to token in client local storage—less secure (XSS risk). Keep cookies enabled for production. neko.m1k1o.net

10. Security Considerations

Surface reduction via containerization: Browsing occurs inside an isolated container; you can discard state or run read‑only images for throwaway sessions; community privacy guides emphasize non‑retention setups. fossengineer.com GitHub
Transport security & certs: Terminate TLS at your reverse proxy (Traefik/Caddy/Certbot etc.); ensure WebSocket upgrades & long timeouts; see official reverse proxy examples. neko.m1k1o.net neko.m1k1o.net
Auth hardening: Use strong unique admin/user passwords (or file/object providers w/ hashed credentials); avoid enabling no‑auth in public deployments; scope API tokens tightly. neko.m1k1o.net neko.m1k1o.net
Cookie vs token leakage: Leaving cookies enabled (secure, httpOnly) prevents script access to session; disabling pushes token into JS‑accessible storage increasing exfiltration risk. neko.m1k1o.net
Firewalling media ports: Only expose required UDP/TCP ranges; where possible, restrict source IPs or require authenticated TURN; community reports of leaving ports closed manifest as connection failures rather than leaks—but mis‑config can open broad EPR ranges; plan network policy. neko.m1k1o.net Reddit
Extension install policy: Browser policies in images may block arbitrary extension installs; you must explicitly allow if you need them—reduces attack surface by default. neko.m1k1o.net neko.m1k1o.net

11. Performance & Tuning

Screen resolution & frame rate: NEKO_DESKTOP_SCREEN / NEKO_SCREEN env controls virtual display mode (e.g., 1920x1080@30); higher rates = more bandwidth/CPU/GPU; choose based on clients & uplink. fossengineer.com neko.m1k1o.net
Shared memory size: Modern Chromium‑family browsers need large /dev/shm; examples allocate shm_size: 2gb; undersizing leads to crashes. fossengineer.com neko.m1k1o.net
Bandwidth estimator (experimental adaptive bitrate): Optional server‑side estimator can downgrade/upgrade encodes based on measured throughput; disabled by default; numerous thresholds/backoffs tunable. neko.m1k1o.net GitHub
Hardware accel vs CPU encode tradeoffs: GPU flavors reduce encode latency but add driver complexity; docs call out limited support maturity; Heise notes Neko can leverage Intel/Nvidia accelerated builds. neko.m1k1o.net heise online
Resource guidance for Chromium variants: Heise reports ≥2GB RAM allocation recommended when running Chromium‑based browsers in containers; plan host sizing accordingly. heise online neko.m1k1o.net

12. Administration & Operations

Logging & Debugging: Enable debug logging via log.level=debug or env NEKO_DEBUG=1; GStreamer verbosity via GST_DEBUG; Pion debug by PION_LOG_DEBUG=all; inspect docker logs and browser dev console. neko.m1k1o.net GitHub
Metrics & Profiling: Prometheus metrics endpoint + pprof instrumentation introduced in v3 support operational monitoring and performance investigation. GitHub neko.m1k1o.net
Upgrades / Migration from v2: Config modularization in v3; backward compatibility shims but deprecated; consult v3 docs + legacy reverse proxy header diffs when migrating. GitHub neko.m1k1o.net neko.m1k1o.net
Embedding Auto‑Login: For kiosk/iframe use, append ?usr=<user>&pwd=<pwd> to URL to bypass login prompt for viewers—use carefully; combine w/ restricted capability profile. neko.m1k1o.net neko.m1k1o.net
Clipboard Behavior: When accessed over HTTPS in supported host browsers (Chromium family), Neko hides its own clipboard button, deferring to native Clipboard API integration; not a bug. neko.m1k1o.net GitHub

13. Ecosystem Projects

Neko Rooms: Multi‑room orchestration wrapper that spins up independent Neko instances (ephemeral or persistent) with simplified onboarding (scripts, HTTPS via Let’s Encrypt, Traefik/NGINX automation); useful when you need per‑group isolation. neko.m1k1o.net fossengineer.com Reddit
Neko Apps: Library of containerized app bundles beyond browsers—expands use cases to general remote Linux app streaming; complements Rooms for scaling out multi‑app catalogs. fossengineer.com neko.m1k1o.net
Neko VPN (experimental): Mentioned in docs nav as companion project enabling tunneled access paths; explore if you need integrated network overlay to reach internal apps through Neko. neko.m1k1o.net neko.m1k1o.net
Umbrel Packaging: Curated home‑server integration; one‑click install, Umbrel tunneling for remote reachability, version tracking; good for homelab / non‑Docker‑experts. apps.umbrel.com neko.m1k1o.net

14. Comparison Touchpoints

vs. Kasm Workspaces: Heise positions Neko as the lightweight alternative—Kasm provides full multi‑tenant workspace management & security layers but is heavier; Neko is simpler, container‑first, optimized for shared live sessions rather than individual isolated desktops (though you can run per‑user instances). heise online GitHub
vs. Hyperbeam API (hosted embeddable co‑browse): Neko offers a similar embeddable shared browser experience but is self‑hosted, giving you data control & on‑prem compliance; Heise explicitly calls out analogous embedding. heise online neko.m1k1o.net
vs. Generic Remote Desktop (VNC/NoVNC/Guacamole): WebRTC yields smoother video + audio sync and lower interactive latency compared to image‑diff or poll‑based remotes; community commentary and docs emphasize superior streaming for media/watch usage. fossengineer.com GitHub

15. Practical Config Snippets

version: "3.4"
services:
neko:
  image: "ghcr.io/m1k1o/neko/firefox:latest"   # pick flavor/tag
  restart: unless-stopped
  shm_size: 2gb
  ports:
    - "8080:8080"                       # HTTP / signaling
    - "59000-59100:59000-59100/udp"     # WebRTC EPR
  environment:
    NEKO_DESKTOP_SCREEN: 1920x1080@30
    NEKO_MEMBER_MULTIUSER_ADMIN_PASSWORD: ${NEKO_ADMIN:?err}
    NEKO_MEMBER_MULTIUSER_USER_PASSWORD: ${NEKO_USER:?err}
    NEKO_WEBRTC_EPR: 59000-59100
    NEKO_WEBRTC_ICELITE: 1
    NEKO_DEBUG: 0
    # optionally front/back STUN/TURN JSON:
    # NEKO_WEBRTC_ICESERVERS_FRONTEND: '[{"urls":["stun:stun.l.google.com:19302"]}]'
    # NEKO_WEBRTC_ICESERVERS_BACKEND:  '[]'
volumes:
# mount for persistent member/session files if using file provider
# - ./data:/opt/neko

fossengineer.com neko.m1k1o.net neko.m1k1o.net

server {
listen 443 ssl http2;
server_name neko.example.com;

location / {
  proxy_pass http://127.0.0.1:8080;
  proxy_http_version 1.1;
  proxy_set_header Upgrade $http_upgrade;
  proxy_set_header Connection "upgrade";
  proxy_set_header Host $host;
  proxy_set_header X-Real-IP $remote_addr;
  proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
  proxy_set_header X-Forwarded-Proto $scheme;
  proxy_cache_bypass $http_upgrade;
}
}```
[neko.m1k1o.net](https://neko.m1k1o.net/docs/v3/reverse-proxy-setup) [neko.m1k1o.net](https://neko.m1k1o.net/docs/v2/reverse-proxy) [Reddit](https://www.reddit.com/r/nginxproxymanager/comments/ut8zyu/help_with_setting_up_reverse_proxy_custom_headers/)

```yaml
# Minimal member provider switch to file‑backed users with hashed passwords
member:
provider: file
file:
  path: "/opt/neko/members.json"
  hash: true
session:
file: "/opt/neko/sessions.json"
session:
api_token: "<short-lived-random-hex>"

neko.m1k1o.net GitHub

16. Operational Runbook Checklist

Preflight: Pick image flavor + arch; allocate ≥2GB RAM (Chromium); set shm_size; open media ports (EPR or mux); decide auth provider; create strong creds. neko.m1k1o.net heise online neko.m1k1o.net
Launch: Compose up; confirm logs show listening on 8080 + WebRTC ports; test LAN client first; verify ICE candidates reachable (browser dev console). neko.m1k1o.net neko.m1k1o.net
Secure: Put behind TLS proxy; enable proxy trust; restrict ports/firewall; rotate API tokens; store hashed passwords. neko.m1k1o.net neko.m1k1o.net neko.m1k1o.net
Scale / Multi‑tenant: Use Neko Rooms or orchestration (k8s, compose bundles) to spin per‑team instances; leverage REST API + metrics for automation & autoscaling triggers. GitHub neko.m1k1o.net fossengineer.com
Troubleshoot: Turn on debug envs; inspect GStreamer logs for encode issues; validate reverse proxy headers; check that WebRTC ports aren’t blocked (common 502 confusion). neko.m1k1o.net Reddit neko.m1k1o.net

17. Roadmap Glimpses & Future Directions

Recent release notes hint at additional session backends (Redis/Postgres), richer plugin ecosystem, and potential RDP/VNC relay modes where Neko acts as a WebRTC gateway rather than running the browser locally. Heise reports interest in direct protocol relay; docs flag “in the future” for expanded session providers. heise online neko.m1k1o.net GitHub

18. Community Lore / Field Notes

Homelabbers use Neko to co‑watch media, punch through restrictive corporate firewalls (when Neko host has outbound freedom), and expose full Linux desktops (KDE) to lightweight tablets. These anecdotes underscore why low‑latency WebRTC streaming and easy multi‑user control were prioritized. Reddit fossengineer.com GitHub

Reverse‑proxy misconfig (wrong header name, missing EPR exposure) is a recurring community stumbling block; always validate both HTTP and media planes. Reddit neko.m1k1o.net neko.m1k1o.net

Neko Browser + Playwright/CDP Integration Deep Dive

Neko (“n.eko”) is an open‑source, self‑hosted virtual browser that streams a full Linux desktop (not just a headless DOM) over WebRTC so multiple remote users can view and interactively control the same session in real time. It targets collaborative browsing, watch parties, remote support, embedded browser surfaces, and hardened “throwaway” cloud browsing where nothing persists locally. neko.m1k1o.net GitHub heise online

Unlike simple remote‑automation containers, Neko can run any Linux GUI application—browsers (Firefox, Chromium, Brave, Vivaldi, Waterfox, Tor, etc.), media players like VLC, full desktop environments (XFCE, KDE), and bespoke tools—because it captures an Xorg display and streams audio/video frames to clients via WebRTC. This breadth makes it viable for shared debugging sessions, interactive presentations, and as a privacy “jump box” into otherwise restricted networks. GitHub heise online Umbrel App Store

Multi‑user collaboration is first‑class: user roles, admin elevation, shared cursor visibility, host (keyboard/mouse) control arbitration, clipboard access, media sharing, and plugin‑scoped per‑user settings are governed by Neko’s v3 authentication system (Member + Session providers). This replaces v2’s simple dual‑password model and lets you express richer authorization matrices or plug in external identity sources. neko.m1k1o.net neko.m1k1o.net

Versioning: v3 vs v2 & Legacy Mode

Neko v3 reorganized configuration into modular namespaces (server, member, session, webrtc, desktop, capture, plugins, etc.) and introduced providers; however, v3 retains backward compatibility with v2 environment variables when NEKO_LEGACY=true is set (and some legacy features auto‑detected). A migration table maps every major v2 var to its v3 equivalent (e.g., NEKO_SCREEN→NEKO_DESKTOP_SCREEN; NEKO_PASSWORD→NEKO_MEMBER_MULTIUSER_USER_PASSWORD; NEKO_NAT1TO1→NEKO_WEBRTC_NAT1TO1). This is critical when modernizing older compose files (like the snippet you shared) to avoid silent fallbacks and dual‑stream cursor quirks. neko.m1k1o.net neko.m1k1o.net

Heise’s Neko 3.0 coverage underscores why migrating matters: new browser flavors (Waterfox, additional Chromium builds, ARM variants), GPU‑accelerated options, HTTP/JPEG screencast endpoints, plugin ecosystem growth, and structural config changes—all shipping under a maintained Apache‑2.0 project—mean staying current pays dividends in stability and capability. heise online GitHub

Community quick‑start guides still widely circulate v2 envs (e.g., NEKO_SCREEN, NEKO_PASSWORD, NEKO_ICELITE, NEKO_EPR), which “work” only because legacy support remains—but they obscure v3 tuning knobs and can yield performance or auth surprises (e.g., no granular per‑user policy). Use the migration mapping to upgrade; I’ll show a patched compose below. fossengineer.com neko.m1k1o.net

Authentication Model (v3)

Authentication splits into Member Provider (who are you? what can you do?) and Session Provider (state & tokens). The multiuser provider emulates v2’s “user password” + “admin password” flow; you enable it via NEKO_MEMBER_PROVIDER=multiuser, then supply NEKO_MEMBER_MULTIUSER_USER_PASSWORD and NEKO_MEMBER_MULTIUSER_ADMIN_PASSWORD, optionally overriding default per‑role capability profiles (host, watch, clipboard, etc.). For tighter control, switch to file or object providers to define fixed accounts, hashed passwords, and granular profiles; or noauth for unsecured demo setups (never production). Session storage can persist to file; API access can be separately tokenized (NEKO_SESSION_API_TOKEN). neko.m1k1o.net

When exposing Neko programmatically (embedding in an app, auto‑provisioning rooms, LLM agents), consider disabling cookies or providing short‑lived API tokens; but weigh increased XSS risk if tokens leak into client JS when cookies are off. v3 exposes cookie flags (secure, http_only, domain/path scoping) so you can harden deployment behind TLS. neko.m1k1o.net

WebRTC Transport Essentials

For smooth low‑latency A/V + input streaming you must correctly expose Neko’s WebRTC ports. Three main patterns:

Ephemeral UDP Port Range (EPR) — Specify a contiguous range (e.g., 56000-56100) via NEKO_WEBRTC_EPR and map the exact same range host:container without remap. Each new participant consumes ports; size range accordingly. neko.m1k1o.net
UDP/TCP Multiplexing — Collapse to a single well‑known port (e.g., 59000) as NEKO_WEBRTC_UDPMUX / NEKO_WEBRTC_TCPMUX for NAT‑challenged environments; trade throughput. neko.m1k1o.net
ICE Servers — Provide STUN/TURN front/back split: NEKO_WEBRTC_ICESERVERS_FRONTEND (what clients see) and ..._BACKEND (what the server dials internally); JSON‑encoded arrays. Required when clients are off‑LAN and UDP paths are blocked. neko.m1k1o.net

If you run behind NAT, set NEKO_WEBRTC_NAT1TO1 to the public (hairpin‑reachable) address; otherwise clients may ICE‑candidate a private IP and fail to connect. Automatic public IP fetch is available but you can override with NEKO_WEBRTC_IP_RETRIEVAL_URL. neko.m1k1o.net neko.m1k1o.net

Do not rely on your HTTP reverse proxy to relay WebRTC media. Nginx/Traefik only front the signaling/control (HTTP(S)/WS) on port 8080; actual RTP/DTLS flows use the ports you expose above and must be reachable end‑to‑end or via TURN. neko.m1k1o.net neko.m1k1o.net

Reverse Proxy & Timeouts

When fronting Neko with nginx/Traefik/etc., enable proxy trust in server config (server.proxy=true / NEKO_SERVER_PROXY=1 in v3) so real client IPs from X-Forwarded-* are honored. Neko sends WS pings ~10s; clients heartbeat ~120s—so bump proxy read timeouts accordingly or users drop during long idle automation runs. Official nginx sample shows required Upgrade/Connection headers for WebSocket upgrade; community Nginx Proxy Manager threads confirm these plus extended proxy_read_timeout and forwarded IP headers to avoid 502s and broken control channels. neko.m1k1o.net Reddit

Container Security / Chromium Sandboxing

Chromium inside containers often needs elevated namespaces to run its sandbox; many headless automation images either add --no-sandbox (reduced isolation) or grant --cap-add=SYS_ADMIN and supporting kernel flags so Chrome’s sandbox works. Puppeteer’s Docker docs call out the SYS_ADMIN requirement for their hardened image; Neko’s own v2 troubleshooting notes that forgetting SYS_ADMIN yields a black screen in Chromium variants—evidence the capability remains relevant. Decide: secure host kernel + allow SYS_ADMIN (preferred for full sandbox) or run --no-sandbox and accept risk; the sample supervisord snippet you posted already includes --no-sandbox, so SYS_ADMIN is belt‑and‑suspenders but still recommended for stability in GPU/namespace operations. pptr.dev neko.m1k1o.net

Enabling Chrome DevTools Protocol (CDP) in Neko for Playwright

Your goal: let humans drive the streamed Neko Chromium UI and attach automation via Playwright. Playwright supports attaching to any existing Chromium instance that exposes a DevTools endpoint via chromium.connectOverCDP(endpointURL), where endpointURL can be the HTTP JSON version URL or direct WS endpoint; the returned browser exposes existing contexts/pages. Lower fidelity than full Playwright protocol, but ideal for “co‑drive” scenarios. Playwright GitHub

Once connected, you can open a raw CDPSession per page/context to send protocol commands (e.g., Runtime.evaluate, Animation.enable), mirroring the manual WebSocket probes in your test.js. This is useful for diagnostics, performance metrics, and low‑level tweaks Playwright doesn’t expose natively. Playwright Playwright

Remote Debugging Flags & Port Forward Pattern

Modern Chromium removed unrestricted --remote-debugging-address=0.0.0.0 for security; recommended practice is bind the DevTools socket to localhost within the container (e.g., --remote-debugging-port=9223), then selectively forward or reverse‑proxy to an external port (e.g., 9222) with an auth / ACL layer (nginx, socat, SSH tunnel). Your nginx‑cdp sidecar implements precisely this 9222→9223 pass‑through with WebSocket upgrade and long timeouts—aligning with guidance from the Dockerized Chromium remote debugging discussion. Stack Overflow neko.m1k1o.net

Review of Your `web-agent/neko-with-playwright` Compose Snippet

You posted a two‑service stack: neko (using m1k1o/neko:chromium) and an nginx-cdp sidecar in service network_mode sharing; supervisord launches Chromium with CDP flags and disables sandbox/gpu; nginx maps host 9222 to internal 9223 to front DevTools with WS keepalive/timeouts. Ports published: 52000→8080(tcp?) and 9222 (tcp). Issues & improvements:

1. Legacy Env Vars – You’re mixing v2 (NEKO_SCREEN, NEKO_PASSWORD*, NEKO_ICELITE, NEKO_NAT1TO1) in a v3 world; while legacy support exists, you lose granular control and risk double cursor streams (cursor once in video, once separate) plus awkward auth extension later. Upgrade to v3 vars (NEKO_DESKTOP_SCREEN, NEKO_MEMBER_PROVIDER=multiuser, NEKO_MEMBER_MULTIUSER_*, NEKO_WEBRTC_ICELITE, NEKO_WEBRTC_NAT1TO1). neko.m1k1o.net neko.m1k1o.net neko.m1k1o.net
2. Missing WebRTC Ports – No UDP EPR or mux port is exposed, so remote WebRTC will fail off‑box unless clients are on the container host network and fallback mechanisms kick in. Add either an EPR range mapping and NEKO_WEBRTC_EPR or UDPMUX/TCPMUX single‑port mapping. neko.m1k1o.net neko.m1k1o.net
3. Public vs Private Subnet – Your custom Docker subnet 17.100.0.0/16 collides with publicly routed Apple allocations (17.0.0.0/8 owned by Apple); choose RFC1918 (e.g., 172.31.0.0/16 or 10.67.0.0/16) to avoid confusing clients seeing ICE candidates referencing real vs container ranges. Proper NAT1TO1 matters when advertising ICE addresses. neko.m1k1o.net neko.m1k1o.net
4. Proxy Headers & Timeouts – Good start; ensure proxy_read_timeout is comfortably above Neko’s WebSocket ping interval (~10s)—set ≥60s or higher—and that NEKO_SERVER_PROXY=1 (or config) is set so Neko trusts forwarded IPs; align with official reverse proxy doc + community NPM thread. neko.m1k1o.net Reddit
5. Chromium Capability / Sandbox – You added cap_add: SYS_ADMIN (good) and --no-sandbox (less secure). Consider removing --no-sandbox once you confirm kernel support; Neko experiences black screens without SYS_ADMIN in Chromium images; Puppeteer’s hardened image docs reinforce giving SYS_ADMIN if you want sandbox. pptr.dev neko.m1k1o.net
6. Password Hygiene – Hard‑coding neko / admin is fine for testing but never production; switch to secrets or .env injection; multiuser provider makes it easy. neko.m1k1o.net neko.m1k1o.net
7. NAT Hairpin & ICE Lite – You set NEKO_ICELITE=0 (full ICE) and NAT1TO1 to container IP; if you actually need WAN access supply your public IP or domain; ICE Lite mode is only appropriate when server has public reflexive; official doc warns not to mix with external ICE servers. neko.m1k1o.net
8. Debug Logging – When diagnosing CDP or WebRTC handshake, enable NEKO_DEBUG=1 and optional GST_DEBUG per FAQ; huge time saver. neko.m1k1o.net

Hardened & Modernized Compose Example (v3 Vars, CDP Enabled)

Below is an updated docker-compose.yml (org‑mode src). Key changes:

Switched to GHCR explicit version tag (pin for reproducibility).
RFC1918 subnet.
Proper WebRTC EPR exposure.
v3 auth vars.
Proxy flag so Neko trusts sidecar.
Optional API token for automation mgmt.
Chromium started with localhost‑bound remote debugging; nginx sidecar terminates TLS (optional) & ACLs; you can env‑inject allowed upstream (e.g., ngrok tunnel).
Dropped --no-sandbox (commented) to prefer secure sandbox; toggle per your threat model.
Added healthcheck & log volumes.

version: "3.8"

x-neko-env: &neko-env
NEKO_DESKTOP_SCREEN: 1920x1080@30
NEKO_MEMBER_PROVIDER: multiuser
NEKO_MEMBER_MULTIUSER_USER_PASSWORD: ${NEKO_USER_PASSWORD:-neko}
NEKO_MEMBER_MULTIUSER_ADMIN_PASSWORD: ${NEKO_ADMIN_PASSWORD:-admin}
NEKO_WEBRTC_EPR: 56000-56100          # match ports below
NEKO_WEBRTC_ICELITE: "false"          # full ICE unless static public IP
NEKO_WEBRTC_NAT1TO1: ${NEKO_PUBLIC_IP:-auto}  # set literal IP or leave unset to auto-detect
NEKO_SERVER_PROXY: "true"             # trust reverse proxy headers
NEKO_SESSION_API_TOKEN: ${NEKO_API_TOKEN:-}   # optional; blank disables
NEKO_DEBUG: ${NEKO_DEBUG:-0}

services:
neko:
  image: ghcr.io/m1k1o/neko/chromium:3.0.4
  container_name: neko
  restart: unless-stopped
  shm_size: 2gb
  networks:
    proxy:
      ipv4_address: 172.31.0.3
  ports:
    - "8080:8080/tcp"                 # web / signaling
    - "56000-56100:56000-56100/udp"   # WebRTC EPR (must match env)
  environment:
    <<: *neko-env
  cap_add:
    - SYS_ADMIN                       # required if not using --no-sandbox
  volumes:
    - neko-data:/var/lib/neko         # persistent config / sessions (bind as needed)
    - neko-logs:/var/log/neko
  configs:
    - source: supervisord_chromium
      target: /etc/neko/supervisord/chromium.conf

nginx-cdp:
  image: nginx:alpine
  container_name: neko-cdp
  network_mode: "service:neko"        # join same net & PID
  depends_on:
    - neko
  environment:
    # restrict which hosts may speak CDP (use allowlist or auth)
    ALLOWED_CDP_ORIGIN: ${ALLOWED_CDP_ORIGIN:-127.0.0.1}
  configs:
    - source: nginx_cdp_conf
      target: /etc/nginx/conf.d/cdp.conf
  ports:
    - "9222:9222/tcp"                 # exposed CDP endpoint proxied to 9223 in container

networks:
proxy:
  ipam:
    config:
      - subnet: 172.31.0.0/16

volumes:
neko-data:
neko-logs:

configs:
supervisord_chromium:
  content: |
    [program:chromium]
    environment=HOME="/home/%(ENV_USER)s",USER="%(ENV_USER)s",DISPLAY="%(ENV_DISPLAY)s"
    command=/usr/bin/chromium \
      --remote-debugging-port=9223 \
      --remote-debugging-address=127.0.0.1 \
      --remote-allow-origins="*" \
      --disable-web-security \
      --disable-features=VizDisplayCompositor \
      --disable-extensions \
      # --no-sandbox \  # uncomment only if you drop SYS_ADMIN
      --disable-dev-shm-usage \
      --enable-automation \
      --disable-background-timer-throttling \
      --disable-backgrounding-occluded-windows \
      --disable-renderer-backgrounding \
      --force-devtools-available \
      --disable-features=TranslateUI \
      --disable-ipc-flooding-protection \
      --enable-blink-features=IdleDetection \
      --headless=new \
      --disable-gpu
    stopsignal=INT
    autorestart=true
    priority=800
    user=%(ENV_USER)s
    stdout_logfile=/var/log/neko/chromium.log
    stdout_logfile_maxbytes=100MB
    stdout_logfile_backups=10
    redirect_stderr=true

    [program:openbox]
    environment=HOME="/home/%(ENV_USER)s",USER="%(ENV_USER)s",DISPLAY="%(ENV_DISPLAY)s"
    command=/usr/bin/openbox --config-file /etc/neko/openbox.xml
    autorestart=true
    priority=300
    user=%(ENV_USER)s
    stdout_logfile=/var/log/neko/openbox.log
    stdout_logfile_maxbytes=100MB
    stdout_logfile_backups=10
    redirect_stderr=true

nginx_cdp_conf:
  content: |
    map $http_upgrade $connection_upgrade {
      default upgrade;
      ''      close;
    }

    upstream chrome {
      server 127.0.0.1:9223;
      keepalive 32;
    }

    server {
      listen 9222;

      # Optional IP allowlist (simple example); extend w/ auth / mTLS as needed
      allow 127.0.0.1;
      allow ::1;
      # env-subst ALLOWED_CDP_ORIGIN could template additional allow lines
      deny all;

      location / {
        proxy_pass http://chrome;

        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection $connection_upgrade;

        proxy_set_header Host $host:$server_port;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

        proxy_read_timeout 86400s;
        proxy_send_timeout 86400s;
        proxy_connect_timeout 7200s;
        proxy_cache off;
        proxy_buffering off;
        proxy_max_temp_file_size 0;
        proxy_request_buffering off;

        proxy_set_header Sec-WebSocket-Extensions $http_sec_websocket_extensions;
        proxy_set_header Sec-WebSocket-Key $http_sec_websocket_key;
        proxy_set_header Sec-WebSocket-Version $http_sec_websocket_version;

        proxy_socket_keepalive on;
        keepalive_timeout 300s;
        keepalive_requests 1000;
      }
    }

neko.m1k1o.net neko.m1k1o.net neko.m1k1o.net neko.m1k1o.net Stack Overflow pptr.dev Reddit

Minimal `.env` Illustration (override at deploy)

NEKO_USER_PASSWORD=supersecretuser
NEKO_ADMIN_PASSWORD=supersecretadmin
NEKO_PUBLIC_IP=203.0.113.45        # example; or set DNS name in upstream LB/TURN
NEKO_API_TOKEN=$(openssl rand -hex 32)
NEKO_DEBUG=1
ALLOWED_CDP_ORIGIN=10.0.0.0/8      # example ACL range for automation runners

neko.m1k1o.net neko.m1k1o.net neko.m1k1o.net

Playwright Attach Script (Improved)

Key best practices: discover the browser WebSocket endpoint from /json/version; create a context if none returned (some builds start w/ zero pages when headless new); gracefully handle targets; optionally filter to the Neko desktop window by URL. Example:

// attach-neko.js
const { chromium } = require('playwright');

(async () => {
const cdpHttp = process.env.NEKO_CDP_URL || 'http://localhost:9222';

// Attach to existing Chromium exposed by Neko's CDP proxy.
const browser = await chromium.connectOverCDP(cdpHttp);

// In many cases Neko's running Chromium already has a default context.
// If none, create one.
const [defaultContext] = browser.contexts().length
  ? browser.contexts()
  : [await browser.newContext()];

// Reuse first existing page or open a new one.
const page = defaultContext.pages() || await defaultContext.newPage();

await page.goto('https://example.com');
console.log('Neko page title:', await page.title());

// Get raw CDP session if you want low-level control.
const client = await page.context().newCDPSession(page);
const version = await client.send('Browser.getVersion').catch(() => null);
console.log('CDP Browser Version:', version);

// Keep browser open for human co-driving; do NOT close().
})();

Playwright Playwright GitHub

Diagnostic CDP Ping Script (Refined from Your `test.js` / `test4.js`)

Below is a leaner diagnostic that:

Fetches /json/version;
Opens WebSocket;
Discovers targets;
Attaches to first non‑extension page;
Evaluates an expression;
Logs failures cleanly.

// cdp-diagnostics.js
const WebSocket = require('ws');
const fetch = (...args) => import('node-fetch').then(({default: f}) => f(...args));

(async () => {
const base = process.env.NEKO_CDP_URL || 'http://localhost:9222';
const version = await (await fetch(`${base}/json/version`)).json();
const wsUrl = version.webSocketDebuggerUrl;

const ws = new WebSocket(wsUrl, { perMessageDeflate: false });
let id = 0;

function send(method, params, sessionId) {
ws.send(JSON.stringify({ id: ++id, method, params, sessionId }));
}

ws.on('open', () => {
console.log('CDP connected');
send('Target.setDiscoverTargets', { discover: true });
});

let firstSession;
ws.on('message', data => {
const msg = JSON.parse(data);
if (msg.method === 'Target.targetCreated') {
const t = msg.params.targetInfo;
if (t.type === 'page' && !t.url.startsWith('chrome-extension://')) {
send('Target.attachToTarget', { targetId: t.targetId, flatten: true });
}
} else if (msg.method === 'Target.attachedToTarget' && !firstSession) {
firstSession = msg.params.sessionId;
send('Runtime.enable', {}, firstSession);
send('Runtime.evaluate', { expression: '1+1' }, firstSession);
} else if (msg.id && msg.result) {
console.log('Result', msg.id, msg.result);
} else if (msg.error) {
console.error('CDP Error', msg.error);
}
});
})();

Stack Overflow Playwright Playwright

Operational Checklist for Playwright‑Augmented Neko

Check	Why	How to Verify
Chromium started with `--remote-debugging-port` (localhost)	Required for CDP attach; safer than 0.0.0.0	`curl http://<host>:9222/json/version` returns JSON
CDP proxy ACL in place	Prevent hostile takeover of your shared session	restrict IPs or auth in nginx; test from unauthorized host fails
WebRTC ports reachable	Avoid black screens / frozen video	`webrtc-internals` in client; `docker logs` ICE candidate errors
SYS_ADMIN vs `--no-sandbox` decision documented	Security posture clarity	Confirm container start flags; run `chrome://sandbox`
Multiuser passwords rotated	Prevent drive‑by admin	Use secrets; verify login roles mapping
Proxy timeout > heartbeat	Prevent surprise disconnects during long automation	Nginx `proxy_read_timeout >= 120s`
Debug logging toggled for incident response	Rapid triage	`NEKO_DEBUG=1`, `GST_DEBUG=3` when needed
neko.m1k1o.net neko.m1k1o.net neko.m1k1o.net pptr.dev neko.m1k1o.net

Example Hybrid Workflow: Humans Steer, Agents Assist

A common pattern in agentic stacks:

Human opens Neko in browser, logs in as admin (multiuser).
Automation runner (Playwright script / LLM agent) attaches over CDP using service account limited by firewall.
Agent performs scripted setup (login, nav, cookie seeding) then relinquishes; human sees results instantly.
If human taking over triggers UI state changes, agent can poll via CDP events (Target/Runtime) to resume.

This model avoids re‑launching browsers and preserves session continuity Neko already streams to participants. Playwright neko.m1k1o.net GitHub

Deployment Channels & Ecosystem

You can deploy via raw Docker/Compose, room orchestration stacks (neko‑rooms), homelab bundles (Umbrel App Store), or community charts/templates; packaging often pre‑wires reverse proxy + TLS but may lag in env var updates—review and update to v3 syntax after install. Umbrel App Store GitHub neko.m1k1o.net

Troubleshooting Quick Hits

Black screen (cursor only) in Chromium flavor → missing SYS_ADMIN or mis‑sandbox; confirm capability or drop sandbox flag. neko.m1k1o.net pptr.dev
WebRTC connect stalls / DTLS not started → exposed UDP mismatch or firewall block; check EPR mapping & NAT1TO1; review server logs at debug level. neko.m1k1o.net neko.m1k1o.net
Users disconnect behind proxy → heartbeat vs proxy timeout mismatch; ensure proxy_read_timeout >120s and server.proxy enabled. neko.m1k1o.net Reddit
CDP connect refused → nginx sidecar not up or ACL blocking; verify /json/version at 9222 and upstream 9223 reachable in container. Stack Overflow neko.m1k1o.net
Legacy envs ignored → upgrade to v3 names or set NEKO_LEGACY=true explicitly; review migration matrix. neko.m1k1o.net

Neko v3 WebRTC & WebSocket Control: Frame/State, Keyboard, Mouse (Cited)

TL;DR:

All browser control in Neko v3 is mediated over a single /api/ws WebSocket after session authentication.
Browser frames are not delivered directly over the WS as video; rather, the WS carries control, signaling, events, and input (mouse/keyboard) JSON, with media frames (video, audio) negotiated via WebRTC ICE as a peer connection.
Full workflow: REST login → WS upgrade (/api/ws) → system/init → WebRTC signal/request → ICE handshake → frames sent to client, controls sent from client.

Mode	REST Call	Response	WS Upgrade Auth
Cookie (default)	`POST /api/login {username, password}`	`Set-Cookie: NEKO_SESSION`	Cookie auto-sent
Token (stateless)	`POST /api/login` for `{token}`	Opaque JWT/Bearer	`?token=...` or Bearer header
Legacy (query)	(multiuser only) skip REST, `?password=`	—	?password in query triggers v2

2. WebSocket Upgrade URL (With/Without Path Prefix)

Mainline: wss://host[:port]/api/ws?token=<TOKEN>
With path-prefix: e.g. wss://proxy.example.com/neko/api/ws?token=...
Alt: cookies or Authorization: Bearer ... supported.
Legacy /ws endpoint: only if enabled or in legacy mode.

3. Connection Lifecycle

Upgrade: Gorilla WS server handles /api/ws, performs token/cookie/Bearer/session check.
Init: Server pushes system/init (JSON: session id, settings, role).
Heartbeat: Server sends system/heartbeat; clients may reply with client/heartbeat, but liveness relies on WebSocket ping/pong.
All interaction now flows over the socket: control events (keyboard/mouse), signaling (ICE, SDP), system/broadcast, errors.
All client state (host, cursor, input, session, etc.) is managed by events.

4. How Media (Frames) and Control Flow

Media

Video and audio frames do not go over the WebSocket; they are WebRTC media streams (negotiated via signaling on WS).
To initiate frame streaming, send: {"event":"signal/request","payload":{"video":{},"audio":{}}}
Server replies with ICE candidates/SDP; client opens WebRTC peer connection.
Browser’s actual frames (video, audio) arrive via WebRTC MediaStream.
Some deployments gate input on a completed WebRTC handshake:
1. send {"event":"signal/request","payload":{"video":{},"audio":{}}}
2. perform SDP/ICE (offer/provide → answer; exchange candidates)
3. only then will control events be honored.

Input: Keyboard/Mouse

Input is sent from client to server as JSON events (current protocol):

Pointer move: {"event":"control/move","payload":{"x":123,"y":456}}
Mouse scroll: {"event":"control/scroll", "payload": {"delta_x": 0, "delta_y": 120}}
Mouse press: {"event":"control/buttonpress","payload":{"x":123,"y":456,"code":1}}
Mouse down: {"event":"control/buttondown","payload":{"x":123,"y":456,"code":1}}
Mouse up: {"event":"control/buttonup","payload":{"x":123,"y":456,"code":1}}
- button codes: 1=left, 2=middle, 3=right
Key press: {"event":"control/keypress","payload":{"keysym":65}}
Key down: {"event":"control/keydown","payload":{"keysym":65}}
Key up: {"event":"control/keyup","payload":{"keysym":65}}
- printable chars use ASCII; control keys use X11 keysyms (e.g., Enter=65293, Esc=65307, Arrows=65361–65364)

Host arbitration:

Request: {"event":"control/request"}
Release: {"event":"control/release"}
Note: many servers will implicitly grant host on first valid control event, but explicit control/request is cleaner.

Coordinates expected by server are pixels. If your client uses normalized [0..1], convert and clamp before send.

Remove legacy examples using: control/click, control/mouse, control/key, control/request_host — these are not recognized by current servers.

Control quick reference (current)

Action	Event	Payload schema
Move pointer	`control/move`	`{ "x": int, "y": int }`
Mouse scroll	`control/scroll`	`{ "delta_x": int, "delta_y": int }`
Mouse button press	`control/buttonpress`	`{ "x": int, "y": int, "code": 1
Mouse button down/up	`control/buttondown/up`	`{ "x": int, "y": int, "code": 1
Key press	`control/keypress`	`{ "keysym": int }`
Key down/up	`control/keydown/up`	`{ "keysym": int }`
Request host	`control/request`	`{}`
Release host	`control/release`	`{}`

Common keysyms: Enter=65293, Backspace=65288, Tab=65289, Escape=65307, Left=65361, Up=65362, Right=65363, Down=65364, Control=65507. Button codes: 1=left, 2=middle, 3=right.

5. Production-Grade Client Implementation Notes

Building a robust programmatic client requires handling several subtle but critical aspects of the Neko protocol beyond basic event sending.

5.1 Connection Management & Reconnection

A production client must anticipate WebSocket disconnects. Simply reconnecting is insufficient. The correct pattern is a connection manager that, upon a detected drop:

Completely tears down the old state: cancel all background async tasks (like loggers and signal consumers), and close the RTCPeerConnection. This prevents resource leaks and "zombie" tasks.
Initiates a new connection attempt, often with exponential backoff to avoid spamming a downed server.
Upon successful reconnection, rebuilds all necessary components: a new Signaler, new background tasks, and a new RTCPeerConnection for the subsequent WebRTC handshake.

5.2 Dynamic Keyboard Mapping

While a client can maintain a static table of common X11 keysyms, the most robust approach is to listen for the keyboard/map event from the server. This event provides a JSON object mapping key names to their precise keysym values for the server's specific environment.

Compatibility: A client should handle both legacy ({...}) and current ({"map": {...}}) payload shapes for this event.
Strategy: Maintain a default, built-in keysym map but dynamically update or extend it with the values received from the server at runtime. This ensures maximum compatibility across different keyboard layouts and server versions.

5.3 Liveness and Error Handling

Heartbeats: The server periodically sends a system/heartbeat event and also issues WebSocket pings. Clients may optionally send client/heartbeat, but connection liveness primarily depends on WebSocket ping/pong; lack of client/heartbeat alone does not cause disconnection.
Error Events: A client should listen for and surface all error/... events. These provide crucial feedback from the server if a sent command was malformed, rejected, or failed, which is essential for debugging.

5.4 Robust Input Handling

Host Control: A robust client should monitor the control/host event. If auto-host functionality is desired, the client should track its own session_id (from system/init) and automatically send a control/request if it observes that the host has been dropped (has_host: false) or given to another session.
Sequential Double-Click: To reliably simulate a double-click, a client should send two control/buttonpress (or down/up pairs) events sequentially, with a small delay (e.g., 50-100ms) in between. Using async.gather or sending them simultaneously can result in the server's input handler swallowing one of the events.

5.5 Defensive WebRTC Handshake

When parsing the signal/provide or signal/offer events, a client should be defensive about the iceservers array.

Strict Field Mapping: Only parse and use the fields relevant to aiortc or the target WebRTC library (typically urls, username, and credential). Ignore any extra fields to avoid errors.
URL Flexibility: Be prepared to handle both a single url key and a urls array.

6. References and Further Reading

MosaicML Streaming (MDS) — Practical Guide for Codex Agents

This document explains what the MosaicML Streaming library is, why it’s useful for large, sharded datasets, and how to write and read MDS shards locally or to an S3-compatible store (MinIO, R2, etc.). It includes runnable patterns you can copy into data pipelines.

Why Streaming/MDS?

Sharded, resumable, deterministic data loading across workers/nodes, with built-in shuffling and fault tolerance. Useful when the dataset is bigger than local disk/RAM or lives in object storage.
MDS is the on-disk format used by Streaming: a set of shard files (e.g., shard.00000.mds[.zstd]) plus an index.json that records metadata/offsets for fast random access.

Key Concepts (10,000-ft view)

MDSWriter: builds shards from Python dict samples (columnar schema you define), supports compression and direct upload to remote storage.
StreamingDataset: reads from a local cache and/or directly from remote (S3, GCS, etc.), handling shuffling, ordering, and resumption deterministically.
S3-compatible endpoints: Use standard AWS creds + S3_ENDPOINT_URL to point to MinIO, Cloudflare R2, etc.

Minimal: Write an MDS dataset

from streaming import MDSWriter
import json, io, numpy as np

# 1) Define the schema: map column -> storage type.
# Supported logical types include 'bytes', 'str', numeric scalars, etc. (see docs)
columns = {
    "frame_table": "bytes",   # packed npy/npz/etc.
    "meta_json":   "str",     # utf-8 JSON string
}

# 2) Writer: local out + optional remote (s3://bucket/prefix).
# Compression & shard size are critical for throughput.
with MDSWriter(
    out="data/mds/train",                 # local staging/cache
    columns=columns,
    compression="zstd",                   # fast decode in training
    hashes="sha1",                        # integrity
    size_limit="128MB",                   # shard target
    remote="s3://my-bucket/datasets/ui-trajs"  # optional: upload shards as they close
) as w:
    for episode in make_episodes():
        # Pack the frame table as bytes (e.g., np.save to a BytesIO)
        buf = io.BytesIO()
        np.save(buf, episode["frame_table"].astype("float32"))
        sample = {
            "frame_table": buf.getvalue(),
            "meta_json":   json.dumps(episode["meta"], ensure_ascii=False),
        }
        w.write(sample)

Notes
	•	remote=... enables automatic upload of completed shards to object storage while writing.
	•	Prefer compression="zstd" for read-time throughput; it’s commonly used in training loops.

⸻

Configure S3 / S3-Compatible Remotes

Set standard AWS creds and (for non-AWS) an endpoint override:

export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_DEFAULT_REGION=us-east-1
# For MinIO / Cloudflare R2 / etc.:
export S3_ENDPOINT_URL="https://<your-endpoint>"   # e.g., https://<accountid>.r2.cloudflarestorage.com

S3_ENDPOINT_URL is the knob for S3-compatible providers.

⸻

Read an MDS dataset (local cache + remote)

from streaming import StreamingDataset
from torch.utils.data import DataLoader

ds = StreamingDataset(
    remote="s3://my-bucket/datasets/ui-trajs",  # remote is optional if data is fully local
    local="data/cache/ui-trajs",                # local cache directory
    shuffle=True, shuffle_seed=1337,            # deterministic across workers
)

def collate(batch):
    # Each item is a dict with your columns.
    return batch

loader = DataLoader(ds, batch_size=8, num_workers=8, collate_fn=collate, persistent_workers=True)
for batch in loader:
    ...

	•	StreamingDataset handles shuffling/resume deterministically across workers and epochs.
	•	Local cache fills on demand; you can pin/cache shards for repeated jobs.

⸻

Practical Tips
	•	Shard size: 64–256 MB is a good default for training throughput; too tiny increases open/seek overhead, too huge hurts parallelism (tune per infra). (General best practice aligned with MDS design. )
	•	Compression: zstd balances ratio & decode speed for training (common choice in large-scale pipelines).
	•	Schema: store big blobs under bytes (e.g., numpy npz) and small metadata as str/scalars—keeps readers simple.
	•	Determinism: set shuffle_seed once for the run; Streaming takes care of global ordering across ranks.

⸻

FAQ

Q: Can I mix multiple datasets?
Yes—construct multiple StreamingDataset streams or concatenate datasets; Streaming supports mixing sources and remote stores. See the docs’ “main concepts” and examples.

Q: How do I resume mid-epoch?
The index and shard metadata let Streaming resume deterministically; you don’t need custom offset logic.

Q: Non-AWS S3?
Use S3_ENDPOINT_URL with your provider’s endpoint (R2/MinIO/etc.).

---

Neko Agent API Mode - Architecture Design Document

Executive Summary

This document outlines the design for a new --api mode for the Neko Agent system that exposes the existing WebRTC-based GUI automation agent as a RESTful API service. The design includes containerization, orchestration with Kubernetes, load balancing, and auto-scaling strategies to handle production workloads.

Current Architecture Analysis

Existing System (`src/agent.py`)

The current Neko Agent is a WebRTC-based GUI automation system with the following key components:

NekoAgent Class: Core agent that manages WebRTC connections and performs GUI automation
WebRTC Integration: Uses aiortc for peer-to-peer connections with Neko Chrome containers
AI Model: ShowUI-2B (Qwen2VL) for visual reasoning and action planning
Action Execution: Supports CLICK, INPUT, SCROLL, SWIPE, TAP, ANSWER actions
Signaling: WebSocket-based communication with Neko server
Frame Processing: Real-time video frame capture and analysis
Metrics: Prometheus integration for monitoring

Current Limitations for API Mode

Single Session: Designed for one agent per WebRTC connection
Blocking Operations: Synchronous task execution model
No Request Queuing: No API endpoint abstraction
Resource Management: No dynamic resource allocation
Scaling Constraints: Manual deployment and management

Proposed API Mode Architecture

1. System Overview

graph TB
    subgraph "Infrastructure Layer"
        LB[Load Balancer<br/>Kubernetes]
        GW[API Gateway<br/>FastAPI]
        AM[Agent Manager<br/>Orchestrator]
    end
    
    subgraph "Processing Layer"
        TQ[Task Queue<br/>Redis/RQ]
        AP[Agent Pods<br/>Multi-Container]
    end
    
    subgraph "Execution Layer"
        NI[Neko Instances<br/>Chrome/Edge]
        GPU[GPU Resources<br/>CUDA/MIG]
    end
    
    subgraph "Agent Pod Layout"
        subgraph "Pod Components"
            NA[Neko Agent API<br/>Core Automation]
            CS[Capture Sidecar<br/>Training Data MDS]
            TS[TTS Sidecar<br/>F5-TTS + WebRTC]
        end
        SR[Shared Resources:<br/>GPU, Storage, Voices, Network]
    end
    
    LB --> GW
    GW --> AM
    GW --> TQ
    AM --> AP
    TQ --> AP
    AP --> NI
    AP --> GPU
    GPU --> AP
    
    AP -.-> NA
    AP -.-> CS
    AP -.-> TS
    NA --- SR
    CS --- SR
    TS --- SR

2. Core Components

2.1 Training Data Capture Service (`src/capture.py`)

Technology: Python with MosaicML Streaming, WebSocket client, HTTP polling

Purpose: Captures training data from Neko sessions for model improvement and fine-tuning

Responsibilities:

Real-time episode recording during agent sessions
Screenshot capture at configurable FPS via HTTP polling
Action annotation parsing from chat messages
MDS (MosaicML Streaming) shard generation with S3 upload
Episode lifecycle management with automatic timeout handling

Integration with API Mode:

Runs as sidecar container alongside agent instances
Monitors WebSocket chat for /start, /stop, and Action: commands
Captures synchronized video frames and action annotations
Outputs training episodes as ZIP archives in MDS format
Supports remote storage (S3) for distributed training pipelines

Key Features:

Episode-based Recording: Packages complete task sessions as single training samples
Streaming Upload: Direct-to-S3 capability with local/remote storage options
Frame Synchronization: Precise timestamp alignment between actions and screenshots
Compression: ZSTD compression for efficient storage and transfer
12-Factor Compliance: Configuration via environment variables only

2.2 Text-to-Speech Service (`src/yap.py`)

Technology: F5-TTS, aiortc WebRTC audio, PyAV, asyncio

Purpose: Provides real-time voice synthesis and audio streaming to Neko sessions

Responsibilities:

F5-TTS voice synthesis with custom voice models
Real-time audio streaming via WebRTC
Voice management and hot-reloading
Punctuation-aware text segmentation
Low-latency audio pipeline with jitter buffering

Integration with API Mode:

Deploys as companion service for agent instances
WebRTC audio track injection into Neko sessions
Chat command interface for voice control
Streaming and batch TTS modes
Voice configuration management via JSON registry

Key Features:

Multi-Voice Support: Hot-reloadable voice registry with custom models
Streaming TTS: /yap:begin ... /yap:end for incremental synthesis
Low Latency: Parallel TTS workers with crossfade audio splicing
Voice Customization: Live rate/pitch adjustments and style parameters
Production Audio: 48kHz WebRTC-compatible output with jitter buffering

2.3 API Gateway Service

Technology: FastAPI with async/await support

Responsibilities:

HTTP request handling and validation
Authentication and authorization
Rate limiting and quotas
Request/response transformation
API documentation (OpenAPI/Swagger)
Health checks and metrics

Key Endpoints:

# Core automation tasks
POST /api/v1/tasks              # Create new automation task
GET  /api/v1/tasks/{task_id}    # Get task status and results
PUT  /api/v1/tasks/{task_id}    # Update task parameters
DELETE /api/v1/tasks/{task_id}  # Cancel running task
GET  /api/v1/tasks              # List tasks with filtering
POST /api/v1/sessions           # Create persistent session

# Training data capture
POST /api/v1/capture/episodes   # Start episode recording
GET  /api/v1/capture/episodes/{episode_id}  # Get episode status
DELETE /api/v1/capture/episodes/{episode_id}  # Stop episode recording
GET  /api/v1/capture/datasets   # List available training datasets
POST /api/v1/capture/datasets/{dataset_id}/upload  # Upload to remote storage
GET  /api/v1/capture/config     # Get capture configuration
PUT  /api/v1/capture/config     # Update capture settings

# Text-to-Speech and voice
POST /api/v1/tts/speak          # Immediate text-to-speech
POST /api/v1/tts/stream/start   # Begin streaming TTS session
POST /api/v1/tts/stream/text    # Add text to streaming session
POST /api/v1/tts/stream/stop    # End streaming TTS session
DELETE /api/v1/tts/stop         # Stop all TTS and clear queue

# Voice management
GET  /api/v1/voices             # List available voices
POST /api/v1/voices             # Add new voice
GET  /api/v1/voices/{voice_id}  # Get voice details
PUT  /api/v1/voices/{voice_id}  # Update voice configuration
DELETE /api/v1/voices/{voice_id}  # Remove voice
POST /api/v1/voices/reload      # Hot-reload voice registry

# System
GET  /api/v1/health             # Health check endpoint
GET  /api/v1/metrics            # Prometheus metrics

Request/Response Format:

// POST /api/v1/tasks - Core automation task
{
  "task": "Search for Python tutorials on YouTube",
  "mode": "web",
  "max_steps": 10,
  "timeout": 300,
  "callback_url": "https://app.example.com/webhook",
  "options": {
    "screen_resolution": "1920x1080",
    "save_screenshots": true,
    "refinement_steps": 5,
    "enable_capture": true,     // Enable training data capture
    "enable_tts": true,         // Enable voice output
    "voice_id": "default"       // Voice for TTS
  }
}

// Response
{
  "task_id": "task_123456",
  "status": "queued",
  "created_at": "2024-01-15T10:30:00Z",
  "estimated_completion": "2024-01-15T10:35:00Z"
}

// POST /api/v1/capture/episodes - Start training capture
{
  "task_description": "Navigate to YouTube and search for tutorials",
  "neko_session_id": "session_abc123",
  "capture_options": {
    "fps": 2.0,
    "jpeg_quality": 85,
    "min_frames": 4,
    "timeout_seconds": 900
  }
}

// Response
{
  "episode_id": "episode_xyz789",
  "status": "recording",
  "started_at": "2024-01-15T10:30:00Z",
  "estimated_size_mb": 120
}

// POST /api/v1/tts/speak - Immediate TTS
{
  "text": "I'm now searching for Python tutorials on YouTube",
  "voice_id": "instructor",
  "params": {
    "rate": 1.0,
    "pitch": 0.0
  }
}

// Response
{
  "request_id": "tts_req_456",
  "status": "speaking",
  "estimated_duration_ms": 3500
}

// POST /api/v1/voices - Add new voice
{
  "voice_id": "instructor",
  "ref_audio": "/voices/instructor.wav",
  "ref_text": "This is an instructional voice for tutorials",
  "styles": ["calm", "explanatory"],
  "params": {
    "rate": 0.95,
    "pitch": -1.0
  }
}

// Response
{
  "voice_id": "instructor",
  "status": "registered",
  "created_at": "2024-01-15T10:30:00Z"
}

2.2 Agent Manager (Orchestrator)

Technology: Python with asyncio, Kubernetes Python client

Responsibilities:

Agent lifecycle management
Resource allocation and cleanup
Task routing to available agents
Agent health monitoring
Scaling decisions based on queue depth

Architecture:

class AgentManager:
    def __init__(self):
        self.k8s_client = kubernetes.client.ApiClient()
        self.agent_pools = {}
        self.task_queue = RedisQueue()
        self.metrics = PrometheusMetrics()
    
    async def provision_agent(self, requirements: AgentRequirements) -> AgentInstance
    async def scale_agent_pool(self, pool_name: str, replicas: int)
    async def route_task(self, task: Task) -> AgentInstance
    async def cleanup_completed_agents(self)

2.3 Modified Agent Core

Changes to src/agent.py:

class APINekoAgent(NekoAgent):
    def __init__(self, *args, api_mode=True, **kwargs):
        super().__init__(*args, **kwargs)
        self.api_mode = api_mode
        self.current_task = None
        self.task_queue = asyncio.Queue()
        
    async def api_run(self) -> None:
        """API mode main loop - processes tasks from queue"""
        while not self.shutdown.is_set():
            try:
                task = await self.task_queue.get()
                self.current_task = task
                result = await self.execute_task(task)
                await self.report_result(result)
            except Exception as e:
                await self.report_error(e)
    
    async def execute_task(self, task: APITask) -> TaskResult:
        """Execute single API task and return results"""
        self.nav_task = task.description
        # Existing navigation logic with API adaptations
        return TaskResult(task_id=task.id, status="completed", ...)

3. Containerization Strategy

3.1 Multi-Stage Docker Build

# Base image with CUDA support for GPU inference
FROM nvidia/cuda:12.1-runtime-ubuntu22.04 as base

# Python environment setup
RUN apt-get update && apt-get install -y \
    python3.11 python3-pip \
    libgl1-mesa-glx libglib2.0-0 \
    && rm -rf /var/lib/apt/lists/*

# Dependencies stage
FROM base as deps
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Model download stage (optional optimization)
FROM deps as models
RUN python -c "from transformers import Qwen2VLForConditionalGeneration; \
    Qwen2VLForConditionalGeneration.from_pretrained('showlab/ShowUI-2B')"

# Application stage  
FROM models as app
WORKDIR /app
COPY src/ ./src/
COPY pyproject.toml ./
RUN pip install -e .

# API mode entrypoint
CMD ["python", "-m", "agent", "--api", "--port", "8000"]

Capture Service Dockerfile:

FROM python:3.11-slim as base

# System dependencies
RUN apt-get update && apt-get install -y \
    libgl1-mesa-glx libglib2.0-0 \
    && rm -rf /var/lib/apt/lists/*

# Python dependencies
COPY requirements-capture.txt .
RUN pip install --no-cache-dir -r requirements-capture.txt

# Application
WORKDIR /app
COPY src/capture.py ./src/
COPY pyproject.toml ./
RUN pip install -e .

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=10s --retries=3 \
    CMD python -c "import capture; print('ok')"

# Entry point
CMD ["python", "src/capture.py"]

TTS Service Dockerfile:

FROM nvidia/cuda:12.1-runtime-ubuntu22.04 as base

# System dependencies for audio processing
RUN apt-get update && apt-get install -y \
    python3.11 python3-pip \
    libsndfile1 libasound2-dev \
    ffmpeg libavcodec-dev libavformat-dev \
    && rm -rf /var/lib/apt/lists/*

# Python dependencies including F5-TTS
COPY requirements-tts.txt .
RUN pip install --no-cache-dir -r requirements-tts.txt

# Download F5-TTS models (optional optimization)
RUN python -c "from f5_tts.api import F5TTS; F5TTS()"

# Application
WORKDIR /app
COPY src/yap.py ./src/
COPY voices/ ./voices/
COPY pyproject.toml ./
RUN pip install -e .

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=30s --retries=3 \
    CMD python -c "from yap import Settings; s=Settings(); print('ok' if s.validate()==[] else 'fail')"

# Entry point
CMD ["python", "src/yap.py"]

3.2 Container Images

Agent Image: neko-agent-api:latest

Contains the modified agent with API capabilities
GPU-enabled for model inference
Health checks and graceful shutdown
Prometheus metrics export

Gateway Image: neko-api-gateway:latest

FastAPI application
Authentication middleware
Rate limiting and request validation
OpenAPI documentation

Manager Image: neko-agent-manager:latest

Kubernetes orchestration logic
Agent lifecycle management
Auto-scaling algorithms

Capture Image: neko-capture:latest

Training data capture service (src/capture.py)
MosaicML Streaming for MDS shard generation
S3-compatible storage support for remote upload
WebSocket client for Neko session monitoring
Screenshot polling and episode management

TTS Image: neko-yap:latest

Text-to-speech service (src/yap.py)
F5-TTS model inference with GPU acceleration
WebRTC audio streaming capabilities
Voice management and hot-reloading
Real-time audio pipeline with jitter buffering

Voice Registry Sidecar: neko-voices:latest

Shared voice model storage
Voice file hosting and management
Hot-reloadable voice configurations
Persistent volume for voice assets

4. Kubernetes Deployment

4.1 Namespace Organization

apiVersion: v1
kind: Namespace
metadata:
  name: neko-agents
  labels:
    app: neko-agent-system
---
# Resource quotas and limits
apiVersion: v1
kind: ResourceQuota
metadata:
  name: neko-agents-quota
  namespace: neko-agents
spec:
  hard:
    requests.nvidia.com/gpu: "20"
    limits.memory: "200Gi"
    persistentvolumeclaims: "10"

4.2 Enhanced Agent Deployment with Sidecars

apiVersion: apps/v1
kind: Deployment
metadata:
  name: neko-agent-pool
  namespace: neko-agents
spec:
  replicas: 3
  selector:
    matchLabels:
      app: neko-agent
  template:
    metadata:
      labels:
        app: neko-agent
    spec:
      nodeSelector:
        accelerator: nvidia-tesla-t4
      volumes:
      - name: voices-storage
        persistentVolumeClaim:
          claimName: neko-voices-pvc
      - name: capture-storage
        emptyDir: {}
      containers:
      # Main agent container
      - name: neko-agent
        image: neko-agent-api:latest
        resources:
          requests:
            nvidia.com/gpu: 1
            memory: "4Gi"
            cpu: "2"
          limits:
            nvidia.com/gpu: 1
            memory: "8Gi"
            cpu: "4"
        env:
        - name: NEKO_API_MODE
          value: "true"
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
        ports:
        - containerPort: 8000
          name: api
        - containerPort: 9000
          name: metrics
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
      
      # Training data capture sidecar
      - name: neko-capture
        image: neko-capture:latest
        env:
        - name: NEKO_WS
          value: "ws://localhost:8000/api/ws"
        - name: CAPTURE_OUT
          value: "/capture/mds"
        - name: CAPTURE_REMOTE
          valueFrom:
            secretKeyRef:
              name: neko-s3-config
              key: remote_uri
              optional: true
        - name: AWS_ACCESS_KEY_ID
          valueFrom:
            secretKeyRef:
              name: neko-s3-config
              key: access_key_id
              optional: true
        - name: AWS_SECRET_ACCESS_KEY
          valueFrom:
            secretKeyRef:
              name: neko-s3-config
              key: secret_access_key
              optional: true
        volumeMounts:
        - name: capture-storage
          mountPath: /capture
        resources:
          requests:
            memory: "512Mi"
            cpu: "0.5"
          limits:
            memory: "1Gi"
            cpu: "1"
      
      # TTS sidecar
      - name: neko-yap
        image: neko-yap:latest
        env:
        - name: YAP_WS
          value: "ws://localhost:8000/api/ws"
        - name: YAP_VOICES_DIR
          value: "/voices"
        - name: YAP_SR
          value: "48000"
        - name: YAP_PARALLEL
          value: "2"
        volumeMounts:
        - name: voices-storage
          mountPath: /voices
        resources:
          requests:
            nvidia.com/gpu: 0.5  # GPU sharing with main agent
            memory: "2Gi"
            cpu: "1"
          limits:
            nvidia.com/gpu: 0.5
            memory: "4Gi"
            cpu: "2"
        livenessProbe:
          exec:
            command:
            - python
            - -c
            - "from yap import Settings; s=Settings(); exit(0 if s.validate()==[] else 1)"
          initialDelaySeconds: 45
          periodSeconds: 30

4.3 Persistent Storage for Voices

# Persistent Volume Claim for voice assets
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: neko-voices-pvc
  namespace: neko-agents
spec:
  accessModes:
    - ReadWriteMany  # Multiple pods can mount
  resources:
    requests:
      storage: 10Gi
  storageClassName: nfs-client  # Use NFS or similar for shared access
---
# Secret for S3 configuration (training data upload)
apiVersion: v1
kind: Secret
metadata:
  name: neko-s3-config
  namespace: neko-agents
type: Opaque
stringData:
  remote_uri: "s3://neko-training-data/episodes"
  access_key_id: "AKIA..."
  secret_access_key: "..."
  region: "us-west-2"

4.4 Service Configuration

# Service for agent API endpoints
apiVersion: v1
kind: Service
metadata:
  name: neko-agent-service
  namespace: neko-agents
spec:
  selector:
    app: neko-agent
  ports:
  - name: api
    port: 8000
    targetPort: 8000
  - name: metrics
    port: 9000
    targetPort: 9000
  type: ClusterIP
---
# Headless service for TTS WebRTC (if needed)
apiVersion: v1
kind: Service
metadata:
  name: neko-tts-service
  namespace: neko-agents
spec:
  selector:
    app: neko-agent
  ports:
  - name: webrtc
    port: 8080
    targetPort: 8080
    protocol: UDP
  clusterIP: None

4.5 Horizontal Pod Autoscaler

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: neko-agent-hpa
  namespace: neko-agents
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: neko-agent-pool
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: nvidia.com/gpu
      target:
        type: Utilization
        averageUtilization: 80
  - type: Pods
    pods:
      metric:
        name: task_queue_length
      target:
        type: AverageValue
        averageValue: "5"

5. Load Balancing and Auto-Scaling Strategy

5.1 Multi-Tier Load Balancing

Tier 1: External Load Balancer

Cloud provider load balancer (ALB/GKE/AKS)
SSL termination and geographical routing
Health checks and failover

Tier 2: Ingress Controller

NGINX Ingress with session affinity
Rate limiting and request buffering
WebSocket upgrade support for agent communication

Tier 3: Service Mesh (Optional)

Istio for advanced traffic management
Circuit breaker patterns
Observability and tracing

5.2 Intelligent Auto-Scaling

Queue-Based Scaling:

class QueueBasedScaler:
    def __init__(self):
        self.queue_threshold_scale_up = 10
        self.queue_threshold_scale_down = 2
        self.avg_task_duration = 30  # seconds
        
    def calculate_desired_replicas(self, queue_length: int, current_replicas: int) -> int:
        if queue_length > self.queue_threshold_scale_up:
            # Predictive scaling based on queue growth rate
            desired = min(current_replicas * 2, max_replicas)
        elif queue_length < self.queue_threshold_scale_down:
            desired = max(current_replicas // 2, min_replicas)
        else:
            desired = current_replicas
        return desired

GPU Utilization Scaling:

Monitor GPU memory usage and compute utilization
Scale up when average GPU utilization > 80%
Scale down when GPU utilization < 30% for 5 minutes

Cost-Optimized Node Selection:

Prefer spot instances for batch workloads
Use GPU node pools with automatic provisioning
Implement preemption handling for graceful task migration

5.3 Advanced Scaling Features

Predictive Scaling (2024 Best Practice):

Model Selection Rationale: For Neko Agent workloads, we recommend a Prophet + XGBoost ensemble approach due to:

Variable task durations (30s to 30+ minutes)
Bursty request patterns with business hour seasonality
Multi-dimensional resource requirements (CPU, GPU, memory)
Need for 5-15 minute prediction horizon to account for container startup times

Core Implementation:

import numpy as np
import pandas as pd
from prophet import Prophet
import xgboost as xgb
from sklearn.ensemble import VotingRegressor
from typing import Dict, List, Tuple, Optional
import asyncio
import logging
from dataclasses import dataclass
from datetime import datetime, timedelta

@dataclass
class LoadPrediction:
    timestamp: datetime
    predicted_requests: float
    predicted_gpu_hours: float
    predicted_memory_gb: float
    confidence_interval: Tuple[float, float]
    recommendation: str  # 'scale_up', 'scale_down', 'maintain'

class NekoLoadPredictor:
    """
    Ensemble model for predicting Neko Agent resource demands.
    
    Uses Prophet for seasonality/trends + XGBoost for complex patterns.
    Optimized for multi-container pods with GPU sharing constraints.
    """
    
    def __init__(self, prediction_horizon_minutes: int = 10):
        self.horizon = prediction_horizon_minutes
        self.prophet_model = None
        self.xgb_model = None
        self.ensemble_model = None
        self.feature_columns = [
            'hour_of_day', 'day_of_week', 'is_weekend', 'is_holiday',
            'queue_depth', 'avg_task_duration', 'active_pods',
            'gpu_utilization_pct', 'memory_utilization_pct',
            'requests_last_5min', 'requests_last_15min', 'requests_last_hour',
            'tts_requests_ratio', 'capture_enabled_ratio'
        ]
        self.logger = logging.getLogger(__name__)
        
    def _engineer_features(self, 
                          historical_metrics: pd.DataFrame, 
                          current_time: datetime) -> pd.DataFrame:
        """
        Engineer features for prediction model.
        
        :param historical_metrics: Time series of system metrics
        :param current_time: Current timestamp for prediction
        :return: Engineered feature DataFrame
        """
        df = historical_metrics.copy()
        
        # Time-based features
        df['hour_of_day'] = df['timestamp'].dt.hour
        df['day_of_week'] = df['timestamp'].dt.dayofweek
        df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
        df['is_holiday'] = df['timestamp'].dt.date.isin(self._get_holidays()).astype(int)
        
        # Rolling window features
        df['requests_last_5min'] = df['request_count'].rolling('5min').sum()
        df['requests_last_15min'] = df['request_count'].rolling('15min').sum()
        df['requests_last_hour'] = df['request_count'].rolling('60min').sum()
        
        # Task complexity indicators
        df['avg_task_duration'] = df['total_task_seconds'].rolling('15min').sum() / df['request_count'].rolling('15min').sum()
        df['tts_requests_ratio'] = df['tts_requests'] / df['request_count'].clip(lower=1)
        df['capture_enabled_ratio'] = df['capture_requests'] / df['request_count'].clip(lower=1)
        
        # Resource utilization
        df['gpu_utilization_pct'] = df['gpu_memory_used'] / df['gpu_memory_total'] * 100
        df['memory_utilization_pct'] = df['memory_used'] / df['memory_total'] * 100
        
        return df.fillna(0)
    
    def _get_holidays(self) -> List[datetime]:
        """Get list of holidays that might affect traffic patterns."""
        # Simplified - in production, use holidays library
        return [
            datetime(2024, 1, 1),   # New Year
            datetime(2024, 7, 4),   # Independence Day
            datetime(2024, 12, 25), # Christmas
            # Add more holidays based on user base geography
        ]
    
    async def train_models(self, historical_data: pd.DataFrame):
        """
        Train the ensemble prediction model.
        
        :param historical_data: Historical metrics with features and targets
        """
        self.logger.info("Training predictive scaling models...")
        
        # Prepare data
        df = self._engineer_features(historical_data, datetime.now())
        
        # Prophet for seasonality and trends
        prophet_df = df[['timestamp', 'request_count']].rename(
            columns={'timestamp': 'ds', 'request_count': 'y'}
        )
        
        self.prophet_model = Prophet(
            daily_seasonality=True,
            weekly_seasonality=True,
            yearly_seasonality=False,  # Need more data
            changepoint_prior_scale=0.05,  # Conservative for stability
            interval_width=0.8
        )
        
        # Add business hours regressor
        prophet_df['business_hours'] = (
            (prophet_df['ds'].dt.hour >= 9) & 
            (prophet_df['ds'].dt.hour < 17) & 
            (prophet_df['ds'].dt.dayofweek < 5)
        ).astype(int)
        
        self.prophet_model.add_regressor('business_hours')
        self.prophet_model.fit(prophet_df)
        
        # XGBoost for complex patterns
        feature_df = df[self.feature_columns].fillna(0)
        target_requests = df['request_count'].values
        target_gpu_hours = df['gpu_hours_consumed'].values
        target_memory = df['memory_gb_peak'].values
        
        self.xgb_model = {
            'requests': xgb.XGBRegressor(
                n_estimators=100,
                max_depth=6,
                learning_rate=0.1,
                subsample=0.8,
                random_state=42
            ),
            'gpu_hours': xgb.XGBRegressor(
                n_estimators=100,
                max_depth=6,
                learning_rate=0.1,
                subsample=0.8,
                random_state=42
            ),
            'memory_gb': xgb.XGBRegressor(
                n_estimators=100,
                max_depth=6,
                learning_rate=0.1,
                subsample=0.8,
                random_state=42
            )
        }
        
        # Train XGBoost models
        self.xgb_model['requests'].fit(feature_df, target_requests)
        self.xgb_model['gpu_hours'].fit(feature_df, target_gpu_hours)
        self.xgb_model['memory_gb'].fit(feature_df, target_memory)
        
        self.logger.info("Model training completed successfully")
    
    async def predict_load(self, 
                          current_metrics: Dict[str, float], 
                          prediction_time: datetime) -> LoadPrediction:
        """
        Predict future load using ensemble model.
        
        :param current_metrics: Current system metrics
        :param prediction_time: Time to predict for
        :return: Load prediction with confidence intervals
        """
        if not (self.prophet_model and self.xgb_model):
            raise ValueError("Models not trained. Call train_models() first.")
        
        # Prophet prediction (seasonal/trend)
        future_df = pd.DataFrame({
            'ds': [prediction_time],
            'business_hours': [
                1 if (9 <= prediction_time.hour < 17 and prediction_time.weekday() < 5) else 0
            ]
        })
        
        prophet_forecast = self.prophet_model.predict(future_df)
        prophet_requests = max(0, prophet_forecast['yhat'].iloc[0])
        confidence_lower = prophet_forecast['yhat_lower'].iloc[0]
        confidence_upper = prophet_forecast['yhat_upper'].iloc[0]
        
        # XGBoost prediction (complex patterns)
        current_features = pd.DataFrame([{
            'hour_of_day': prediction_time.hour,
            'day_of_week': prediction_time.weekday(),
            'is_weekend': 1 if prediction_time.weekday() >= 5 else 0,
            'is_holiday': 1 if prediction_time.date() in [h.date() for h in self._get_holidays()] else 0,
            'queue_depth': current_metrics.get('queue_depth', 0),
            'avg_task_duration': current_metrics.get('avg_task_duration', 180),
            'active_pods': current_metrics.get('active_pods', 1),
            'gpu_utilization_pct': current_metrics.get('gpu_utilization_pct', 0),
            'memory_utilization_pct': current_metrics.get('memory_utilization_pct', 0),
            'requests_last_5min': current_metrics.get('requests_last_5min', 0),
            'requests_last_15min': current_metrics.get('requests_last_15min', 0),
            'requests_last_hour': current_metrics.get('requests_last_hour', 0),
            'tts_requests_ratio': current_metrics.get('tts_requests_ratio', 0.3),
            'capture_enabled_ratio': current_metrics.get('capture_enabled_ratio', 0.8),
        }])
        
        xgb_requests = max(0, self.xgb_model['requests'].predict(current_features)[0])
        xgb_gpu_hours = max(0, self.xgb_model['gpu_hours'].predict(current_features)[0])
        xgb_memory_gb = max(0, self.xgb_model['memory_gb'].predict(current_features)[0])
        
        # Ensemble: Weight Prophet (seasonality) and XGBoost (patterns)
        # Use higher Prophet weight during business hours, XGBoost for off-hours
        prophet_weight = 0.7 if current_features['is_weekend'].iloc[0] == 0 else 0.4
        xgb_weight = 1 - prophet_weight
        
        final_requests = prophet_weight * prophet_requests + xgb_weight * xgb_requests
        
        # Generate scaling recommendation
        current_capacity = current_metrics.get('active_pods', 1) * current_metrics.get('pods_max_requests_per_hour', 20)
        recommendation = self._generate_recommendation(
            predicted_load=final_requests,
            current_capacity=current_capacity,
            current_utilization=current_metrics.get('current_utilization_pct', 0)
        )
        
        return LoadPrediction(
            timestamp=prediction_time,
            predicted_requests=final_requests,
            predicted_gpu_hours=xgb_gpu_hours,
            predicted_memory_gb=xgb_memory_gb,
            confidence_interval=(confidence_lower, confidence_upper),
            recommendation=recommendation
        )
    
    def _generate_recommendation(self, 
                               predicted_load: float, 
                               current_capacity: float,
                               current_utilization: float) -> str:
        """Generate scaling recommendation based on prediction."""
        utilization_threshold_up = 80.0    # Scale up if predicted > 80%
        utilization_threshold_down = 30.0  # Scale down if predicted < 30%
        
        predicted_utilization = (predicted_load / current_capacity) * 100
        
        # Conservative scaling to avoid thrashing
        if predicted_utilization > utilization_threshold_up and current_utilization > 60:
            return 'scale_up'
        elif predicted_utilization < utilization_threshold_down and current_utilization < 40:
            return 'scale_down'
        else:
            return 'maintain'

class PredictiveScaler:
    """
    Production-ready predictive scaler for Neko Agent pods.
    """
    
    def __init__(self, 
                 k8s_client,
                 namespace: str = "neko-agents",
                 deployment_name: str = "neko-agent-pool"):
        self.predictor = NekoLoadPredictor()
        self.k8s_client = k8s_client
        self.namespace = namespace
        self.deployment_name = deployment_name
        self.logger = logging.getLogger(__name__)
        self.scaling_history = []
        self.min_replicas = 2
        self.max_replicas = 50
        
    async def collect_metrics(self) -> Dict[str, float]:
        """Collect current system metrics for prediction."""
        # This would integrate with Prometheus/Grafana
        # Simplified example:
        return {
            'queue_depth': await self._get_queue_depth(),
            'active_pods': await self._get_active_pods(),
            'gpu_utilization_pct': await self._get_gpu_utilization(),
            'memory_utilization_pct': await self._get_memory_utilization(),
            'requests_last_5min': await self._get_recent_requests(5),
            'requests_last_15min': await self._get_recent_requests(15),
            'requests_last_hour': await self._get_recent_requests(60),
            'avg_task_duration': await self._get_avg_task_duration(),
            'tts_requests_ratio': await self._get_tts_ratio(),
            'capture_enabled_ratio': await self._get_capture_ratio(),
        }
    
    async def predict_and_scale(self):
        """Main prediction and scaling loop."""
        try:
            current_metrics = await self.collect_metrics()
            prediction_time = datetime.now() + timedelta(minutes=10)
            
            prediction = await self.predictor.predict_load(current_metrics, prediction_time)
            
            self.logger.info(
                f"Load prediction: {prediction.predicted_requests:.1f} requests, "
                f"recommendation: {prediction.recommendation}"
            )
            
            if prediction.recommendation == 'scale_up':
                await self._scale_up(prediction)
            elif prediction.recommendation == 'scale_down':
                await self._scale_down(prediction)
            
            # Record scaling decision for model improvement
            self.scaling_history.append({
                'timestamp': datetime.now(),
                'prediction': prediction,
                'action_taken': prediction.recommendation,
                'current_metrics': current_metrics
            })
            
        except Exception as e:
            self.logger.error(f"Predictive scaling error: {e}")
    
    async def _scale_up(self, prediction: LoadPrediction):
        """Scale up agent pods based on prediction."""
        current_replicas = await self._get_current_replicas()
        
        # Calculate required replicas with safety margin
        requests_per_pod_per_hour = 20  # Conservative estimate
        required_replicas = max(
            current_replicas + 1,  # Minimum increment
            int(np.ceil(prediction.predicted_requests / requests_per_pod_per_hour * 1.2))  # 20% safety margin
        )
        
        target_replicas = min(required_replicas, self.max_replicas)
        
        if target_replicas > current_replicas:
            await self._update_replicas(target_replicas)
            self.logger.info(f"Scaled up from {current_replicas} to {target_replicas} replicas")
    
    async def _scale_down(self, prediction: LoadPrediction):
        """Scale down agent pods based on prediction."""
        current_replicas = await self._get_current_replicas()
        
        # Conservative scale-down to avoid thrashing
        if len(self.scaling_history) > 0:
            recent_scale_up = any(
                h['action_taken'] == 'scale_up' 
                for h in self.scaling_history[-3:]  # Last 3 decisions
            )
            if recent_scale_up:
                self.logger.info("Skipping scale-down due to recent scale-up")
                return
        
        requests_per_pod_per_hour = 15  # More conservative for scale-down
        required_replicas = max(
            self.min_replicas,
            int(np.ceil(prediction.predicted_requests / requests_per_pod_per_hour))
        )
        
        target_replicas = max(required_replicas, current_replicas - 1)  # Maximum decrement
        
        if target_replicas < current_replicas:
            await self._update_replicas(target_replicas)
            self.logger.info(f"Scaled down from {current_replicas} to {target_replicas} replicas")
    
    async def _get_current_replicas(self) -> int:
        """Get current number of replicas from Kubernetes."""
        # Implementation would use Kubernetes API
        pass
    
    async def _update_replicas(self, target_replicas: int):
        """Update deployment replica count."""
        # Implementation would use Kubernetes API
        pass
    
    # Additional helper methods for metrics collection...
    async def _get_queue_depth(self) -> float:
        # Redis queue length
        pass
    
    async def _get_gpu_utilization(self) -> float:
        # NVIDIA DCGM metrics
        pass

Model Training Pipeline:

class ModelTrainingPipeline:
    """
    Automated training pipeline for predictive scaling models.
    Runs weekly to incorporate new patterns and seasonal changes.
    """
    
    def __init__(self, prometheus_client, model_storage_path: str):
        self.prometheus = prometheus_client
        self.storage_path = model_storage_path
        self.logger = logging.getLogger(__name__)
    
    async def collect_training_data(self, days_back: int = 30) -> pd.DataFrame:
        """Collect historical metrics from Prometheus for training."""
        end_time = datetime.now()
        start_time = end_time - timedelta(days=days_back)
        
        queries = {
            'request_count': 'sum(rate(neko_api_requests_total[5m])) * 300',
            'gpu_hours_consumed': 'sum(neko_gpu_utilization_percent) / 100 * 5/60',
            'memory_gb_peak': 'max(neko_memory_usage_bytes) / 1024/1024/1024',
            'total_task_seconds': 'sum(neko_task_duration_seconds)',
            'tts_requests': 'sum(rate(neko_tts_requests_total[5m])) * 300',
            'capture_requests': 'sum(rate(neko_capture_episodes_total[5m])) * 300',
            # Add more metrics as needed
        }
        
        historical_data = []
        current_time = start_time
        
        while current_time < end_time:
            row = {'timestamp': current_time}
            for metric_name, query in queries.items():
                try:
                    result = await self.prometheus.query(query, time=current_time)
                    row[metric_name] = float(result[0]['value'][1]) if result else 0.0
                except Exception as e:
                    self.logger.warning(f"Failed to get {metric_name}: {e}")
                    row[metric_name] = 0.0
            
            historical_data.append(row)
            current_time += timedelta(minutes=5)
        
        return pd.DataFrame(historical_data)
    
    async def train_and_evaluate(self):
        """Train models and evaluate performance."""
        self.logger.info("Starting model training pipeline...")
        
        # Collect data
        training_data = await self.collect_training_data()
        
        # Split train/test (last 20% for testing)
        split_idx = int(len(training_data) * 0.8)
        train_data = training_data[:split_idx]
        test_data = training_data[split_idx:]
        
        # Train models
        predictor = NekoLoadPredictor()
        await predictor.train_models(train_data)
        
        # Evaluate performance
        mae_scores = []
        mape_scores = []
        
        for _, row in test_data.iterrows():
            if row.name % 12 == 0:  # Every hour
                current_metrics = {
                    'queue_depth': 0,  # Simplified for evaluation
                    'active_pods': 3,
                    'gpu_utilization_pct': 50,
                    # ... other metrics
                }
                
                prediction = await predictor.predict_load(
                    current_metrics, 
                    row['timestamp'] + timedelta(minutes=10)
                )
                
                actual = row['request_count']
                predicted = prediction.predicted_requests
                
                mae_scores.append(abs(actual - predicted))
                if actual > 0:
                    mape_scores.append(abs(actual - predicted) / actual * 100)
        
        mae = np.mean(mae_scores)
        mape = np.mean(mape_scores)
        
        self.logger.info(f"Model evaluation - MAE: {mae:.2f}, MAPE: {mape:.2f}%")
        
        # Save model if performance is acceptable
        if mape < 20:  # 20% MAPE threshold
            await self._save_model(predictor)
            self.logger.info("Model saved successfully")
        else:
            self.logger.warning("Model performance below threshold, keeping previous version")
    
    async def _save_model(self, predictor: NekoLoadPredictor):
        """Save trained model to persistent storage."""
        import joblib
        model_data = {
            'prophet_model': predictor.prophet_model,
            'xgb_model': predictor.xgb_model,
            'trained_at': datetime.now(),
            'feature_columns': predictor.feature_columns
        }
        joblib.dump(model_data, f"{self.storage_path}/neko_scaling_model.pkl")

Integration with Kubernetes HPA:

# Custom HPA with predictive scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: neko-agent-predictive-hpa
  namespace: neko-agents
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: neko-agent-pool
  minReplicas: 2
  maxReplicas: 50
  metrics:
  - type: External
    external:
      metric:
        name: neko_predicted_load
        selector:
          matchLabels:
            deployment: neko-agent-pool
      target:
        type: Value
        value: "20"  # Target requests per pod
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60  # React quickly to spikes
      policies:
      - type: Percent
        value: 50   # Max 50% increase per minute
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300  # Conservative scale-down
      policies:
      - type: Percent
        value: 10   # Max 10% decrease per 5 minutes
        periodSeconds: 300

Multi-Region Deployment:

Deploy agent pools across multiple availability zones
Use regional load balancers with latency-based routing
Cross-region task replication for disaster recovery

6. Monitoring and Observability

6.1 Metrics Collection

System Metrics:

Request rate, latency, and error rate
Queue depth and processing time
Resource utilization (CPU, GPU, memory)
Agent lifecycle events

Business Metrics:

Task success/failure rates by type
Average task completion time
User adoption and API usage patterns
Cost per task execution

Custom Prometheus Metrics:

# Agent-specific metrics
ACTIVE_AGENTS = Gauge('neko_active_agents_total', 'Number of active agent instances')
TASK_DURATION = Histogram('neko_task_duration_seconds', 'Task execution time', ['task_type'])
QUEUE_SIZE = Gauge('neko_task_queue_size', 'Number of tasks in queue')
GPU_UTILIZATION = Gauge('neko_gpu_utilization_percent', 'GPU utilization per agent', ['agent_id'])

6.2 Logging Strategy

Structured Logging:

{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "INFO",
  "agent_id": "agent-pod-xyz",
  "task_id": "task_123456",
  "event": "task_completed",
  "duration": 45.2,
  "actions_executed": 8,
  "success": true
}

Log Aggregation: ELK Stack (Elasticsearch, Logstash, Kibana) or equivalent Distributed Tracing: OpenTelemetry with Jaeger for request flow tracking

7. Security Considerations

7.1 API Security

Authentication: JWT tokens with role-based access control Authorization: Resource-level permissions and quota enforcement
Input Validation: Request schema validation and sanitization Rate Limiting: Per-user and per-endpoint limits

7.2 Container Security

Image Scanning: Vulnerability scanning in CI/CD pipeline Runtime Security: Pod Security Standards (PSS) and AppArmor profiles Secret Management: Kubernetes secrets with encryption at rest Network Policies: Microsegmentation within the cluster

7.3 Model Security

Model Integrity: Checksum verification for downloaded models Inference Security: Input validation and output filtering Resource Isolation: GPU memory isolation between agents

8. Implementation Roadmap

Phase 1: Core API Development (4 weeks)

Implement FastAPI gateway with basic endpoints
Modify agent.py for API mode support
Create Docker containers for all components
Basic Kubernetes deployment manifests

Phase 2: Training Data & TTS Integration (3 weeks)

Deploy capture.py as sidecar service with MDS integration
Implement yap.py TTS service with F5-TTS and WebRTC
Configure voice management and persistent storage
Set up S3 integration for training data upload
Add new API endpoints for capture and TTS

Phase 3: Scaling Infrastructure (3 weeks)

Implement agent manager and orchestration logic
Set up Redis task queue and job processing
Configure HPA with custom metrics for multi-container pods
Load testing and performance optimization with sidecars

Phase 4: Production Hardening (3 weeks)

Implement monitoring and alerting for all services
Security hardening and compliance (including voice data)
Documentation and API reference (capture + TTS endpoints)
CI/CD pipeline integration with multi-stage builds

Phase 5: Advanced Features (4 weeks)

Predictive scaling algorithms with multi-container awareness
Multi-region deployment with voice model replication
Advanced monitoring and cost optimization for GPU sharing
Performance analytics and training data insights

9. Cost Analysis

9.1 Resource Requirements

Per Enhanced Agent Pod (Main + Sidecars):

GPU: 1x NVIDIA T4 or A10 (shared: 1.0 agent + 0.5 TTS) ($0.35-$1.10/hour)
CPU: 3.5-7 cores (2-4 agent + 1-2 TTS + 0.5-1 capture) ($0.14-$0.28/hour)
Memory: 6.5-13GB (4-8 agent + 2-4 TTS + 0.5-1 capture) ($0.0065-$0.013/hour)
Storage: 60GB SSD + 10GB voice models ($0.007/hour)

Additional Components per Pod:

Capture sidecar: Minimal overhead, primarily I/O and storage
TTS sidecar: GPU sharing, ~50% memory overhead
Voice storage: Shared across pods via NFS/PVC

Total Cost per Enhanced Pod/Hour: ~$0.52-$1.42 (+30% for sidecars)

Supporting Infrastructure:

Load balancer: $0.025/hour
Redis cluster: $0.10-$0.30/hour
Monitoring stack: $0.05-$0.15/hour

9.2 Cost Optimization Strategies

Spot Instances: 60-90% cost reduction for fault-tolerant workloads
Right-sizing: Dynamic resource allocation based on workload
Reserved Instances: 30-60% discount for predictable baseline load
Multi-tenancy: Share GPU resources across multiple small tasks

10. Conclusion

The proposed API mode architecture transforms the existing Neko Agent from a single-session WebRTC client into a comprehensive, cloud-native AI automation platform. The integration of capture.py and yap.py creates a complete solution with training data collection and real-time voice interaction capabilities.

Key Benefits:

Horizontal Scalability: Support for hundreds of concurrent automation tasks with integrated sidecars
Training Data Pipeline: Automated collection of high-quality training episodes in MosaicML format
Voice Interaction: Real-time TTS with F5-TTS and custom voice models via WebRTC
Cost Efficiency: Dynamic scaling with GPU sharing between agent and TTS workloads
Developer Experience: RESTful API with training and voice management endpoints
Production Ready: Monitoring, security, and operational excellence across all services
Future Proof: Built on modern container orchestration with ML/AI workflow patterns

Enhanced Capabilities:

Self-Improving: Continuous training data collection enables model refinement
Multimodal: Visual automation + audio output for richer user experiences
Enterprise Ready: S3 integration, voice asset management, and compliance features
Flexible Deployment: Sidecar pattern allows optional feature enablement per workload

The implementation leverages proven 2024 best practices for AI workload orchestration, GPU resource management, and microservices architecture, while adding cutting-edge training data collection and voice synthesis capabilities.

Total Implementation Estimate: 17 weeks for full production deployment (including training/TTS features) Expected Performance: 100+ concurrent tasks, <2s API response time, 99.9% availability, <500ms TTS latency Estimated Operating Cost: $0.52-$1.42 per enhanced pod-hour, with potential 50-60% reduction through optimization Training Data Output: ~2-5GB per hour of operation, ready for distributed training pipelines

References

Glossary

General Terms

Neko Server : A containerized remote desktop solution that provides browser-based access to desktop environments via WebRTC.

WebRTC : Web Real-Time Communication protocol used for peer-to-peer audio, video, and data transmission in web browsers.

ICE (Interactive Connectivity Establishment) : Protocol for establishing peer-to-peer connections across NATs and firewalls.

SDP (Session Description Protocol) : Protocol for describing media sessions and connection parameters.

Component-Specific Terms

Manual Control CLI

REPL (Read-Eval-Print Loop) : Interactive command-line interface for real-time control and testing.

Signaler : WebSocket connection manager that handles message routing and reconnection.

Broker : Event distribution system that routes incoming messages to appropriate topic queues.

Normalized Coordinates : Coordinate system using 0.0-1.0 range instead of pixel values for resolution independence.

Host Control : Exclusive mouse and keyboard control permission on the Neko server.

Core Agent

Action Space : Set of possible actions the agent can perform (click, type, scroll, etc.).

Observation Space : Visual and contextual information the agent receives from the environment.

Training Data : Collected sequences of observations and actions used for model training.

Capture Service

MDS (MosaicML Dataset) : Efficient dataset format optimized for streaming and distributed training.

Trajectory : Sequence of state-action pairs representing a complete task execution.

Temporal Alignment : Synchronization of visual frames with corresponding action timestamps.

TTS Service

Voice Synthesis : Process of generating spoken audio from text input.

Audio Streaming : Real-time transmission of audio data over WebRTC channels.

Voice Activity Detection (VAD) : Technology for detecting presence of human speech in audio.

Technical Terms

Async/Await : Python asynchronous programming pattern for concurrent execution.

Task Cancellation : Graceful shutdown mechanism for asyncio background tasks.

Exponential Backoff : Retry strategy with increasing delays between attempts.

Event Loop : Core of asyncio that manages and executes asynchronous operations.

Queue (asyncio.Queue) : Thread-safe queue implementation for passing data between async tasks.

Protocol Terms

Event Payload : Data structure containing the actual content of WebSocket messages.

Heartbeat : Periodic messages sent to maintain connection and detect timeouts.

Session ID : Unique identifier for each client connection to the Neko server.

Control Protocol : Set of message types and formats for remote desktop interaction.

Development Terms

mdBook : Documentation generator that creates books from Markdown files.

PEP 257 : Python Enhancement Proposal defining docstring conventions.

PEP 287 : Python Enhancement Proposal defining reStructuredText format for docstrings.

Type Hints : Python annotations that specify expected types for function parameters and return values.

Keyboard shortcuts

Neko Agent Documentation