Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Neko Agent Documentation

Welcome to the Neko Agent documentation! This project provides an AI-powered automation agent that interfaces with Neko servers to perform GUI automation tasks through WebRTC connections.

What is Neko Agent?

Neko Agent is a Python-based automation system that:

  • Connects to Neko servers via WebRTC for real-time GUI interaction
  • Uses AI vision models (ShowUI-2B/Qwen2VL) for visual reasoning and action planning
  • Captures training data automatically during automation sessions
  • Provides voice synthesis capabilities via F5-TTS and WebRTC audio
  • Supports various actions like clicking, typing, scrolling, and navigation

Key Components

  • src/agent.py - Core automation agent with WebRTC integration
  • src/capture.py - Training data capture service using MosaicML Streaming
  • src/yap.py - Text-to-speech service with F5-TTS and voice management

Getting Started

To get started with Neko Agent:

  1. Set up the development environment using Nix flakes
  2. Configure a Neko server for GUI automation targets
  3. Run the agent to begin automation tasks
  4. Optionally enable capture for training data collection

See the Getting Started guide for detailed setup instructions.

Architecture

The system is designed around WebRTC-based communication with Neko containers, allowing for:

  • Real-time video frame analysis and action execution
  • Automated training data collection in MosaicML format
  • Voice feedback through WebRTC audio streams
  • Scalable deployment patterns for production use

For technical details, see the Architecture Overview.

Getting Started

This guide will help you set up and run the Neko Agent system for AI-powered GUI automation.

Prerequisites

  • Nix with flakes enabled - for development environment
  • Neko server - target environment for automation
  • NVIDIA GPU (optional) - for optimal AI model performance

Development Environment Setup

The project uses Nix flakes for reproducible development environments. Choose the appropriate shell based on your needs:

Available Development Shells

# Basic CPU environment
nix develop

# GPU-accelerated environment (recommended)
nix develop .#gpu

# Shell with additional AI CLI tools
nix develop .#ai

# Docker-compose helper environment for the bundled Neko stack
nix develop .#neko

# Documentation tooling
nix develop .#docs

# Optimised variants
nix develop .#cpu-opt
nix develop .#gpu-opt

# Trusted Execution Environment tooling
nix develop .#tee

For the best performance with AI models:

nix develop .#gpu

This provides:

  • CUDA 12.8 toolkit and libraries
  • PyTorch with CUDA support
  • All required Python dependencies
  • Pre-configured environment variables

Neko Server Setup

The agent requires a running Neko server to connect to. You can either:

Option 1: Use Docker Compose (Included)

# Enter the Neko environment
nix develop .#neko

# Start Neko services
neko-services up

# View logs
neko-services logs

# Stop services
neko-services down

This starts a Neko server at http://localhost:8080 with Chrome ready for automation.

Option 2: External Neko Server

Configure connection to an existing Neko server by setting environment variables:

export NEKO_URL="https://your-neko-server.com"
export NEKO_USER="your-username"
export NEKO_PASS="your-password"

Running the Agent

Basic Usage

# Run with a simple task
uv run src/agent.py --task "Navigate to google.com and search for 'AI automation'"

# Run with REST authentication (the agent performs the login handshake)
uv run src/agent.py \
  --task "Your automation task" \
  --neko-url "http://localhost:8080" \
  --username "user" \
  --password "password"

# Keep the agent online and accept new tasks from chat
uv run src/agent.py --online --neko-url "http://localhost:8080" --username user --password password

With Training Data Capture

To enable training data collection during automation:

# Terminal 1: Start capture service
uv run src/capture.py

# Terminal 2: Run agent (capture watches chat messages for /start and /stop)
uv run src/agent.py --task "Your task description"

See Training Data Capture for detailed capture configuration.

With Voice Output

For voice feedback during automation:

# Terminal 1: Start TTS service (F5-TTS + WebRTC audio)
uv run src/yap.py

# Terminal 2: Run the agent – leave audio enabled (default) so the browser hears YAP
uv run src/agent.py --task "Describe what you see"

Configuration

Environment Variables

Create a .env file in the project root (copy .env.example first):

# Neko server connection
NEKO_URL=http://localhost:8080
NEKO_USER=user
NEKO_PASS=password

# Agent behaviour
NEKO_TASK="Default task description"
NEKO_MODE=web
NEKO_MAX_STEPS=8
NEKO_AUDIO=1
REFINEMENT_STEPS=5
NEKO_ICE_POLICY=strict
NEKO_RTCP_KEEPALIVE=0
NEKO_SKIP_INITIAL_FRAMES=5
NEKO_FORCE_EXIT_GUARD_MS=0
NEKO_LOGLEVEL=INFO
NEKO_LOG_FORMAT=text
NEKO_METRICS_PORT=9000
# FRAME_SAVE_PATH="./tmp/frame.png"
# CLICK_SAVE_PATH="./tmp/actions"

# Capture settings (optional)
CAPTURE_OUT=./data/mds
CAPTURE_FPS=2.0
CAPTURE_REMOTE=s3://your-bucket/training-data

# TTS settings (optional)
YAP_VOICES_DIR=./voices
YAP_SR=48000
YAP_PARALLEL=2
YAP_MAX_CHARS=350

GPU Configuration

For optimal GPU usage:

# Set specific GPU
export CUDA_VISIBLE_DEVICES=0

# Configure memory allocation
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

# Set target architecture
export TORCH_CUDA_ARCH_LIST=8.6

Verification

Test your setup:

# Check dependencies
python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"

# Test agent connection
uv run src/agent.py --healthcheck

# Explore capture options (the script exits when interrupted)
uv run src/capture.py --help

# Test TTS (if enabled)
uv run src/yap.py --healthcheck

Next Steps

Troubleshooting

Common Issues

CUDA not detected:

# Verify NVIDIA drivers
nvidia-smi

# Check CUDA installation in shell
echo $CUDA_HOME

Neko connection failed:

# Test Neko server accessibility
curl http://localhost:8080/health

# Check WebSocket endpoint
curl -i -N -H "Connection: Upgrade" -H "Upgrade: websocket" \
  http://localhost:8080/api/ws

Python import errors:

# Regenerate environment
nix develop .#gpu --command python -c "import streaming; import torch; print('OK')"

For more help, see the Developer Guide or check the project issues.

Manual Control Guide

The Manual Control CLI is an interactive tool that lets you remotely control any Neko server through your terminal. Think of it as a command-line remote desktop that you can use for testing, administration, or automating tasks on hosted desktop environments.

Quick Start

Prerequisites

  • Python 3.8 or newer
  • Access to a Neko server (either local or hosted)
  • Network connectivity to the Neko server

Launching the CLI

The manual control tool ships as src/manual.py inside this repository. From a Nix shell you can run:

uv run src/manual.py --help

First Connection

The easiest way to connect is using your Neko server credentials:

uv run src/manual.py \
  --neko-url "https://your-neko-server.com" \
  --username "your-username" \
  --password "your-password"

If successful, you'll see:

Starting Neko manual CLI. Type 'help' for commands. Ctrl+D or 'quit' to exit.
neko> 

Basic Usage

Getting Help

Type help at any time to see all available commands:

neko> help

This displays a comprehensive list of all commands with their syntax.

Essential Commands

Mouse Control

# Move mouse cursor to specific coordinates
neko> move 100 200

# Click at current mouse position
neko> click

# Move and click in one command
neko> tap 300 400

# Right-click
neko> rclick

# Double-click
neko> dblclick

Keyboard Input

# Type text
neko> text "Hello, World!"

# Press specific keys
neko> key Enter
neko> key Escape
neko> key F5

# Click somewhere and then type
neko> input 500 300 "username@example.com"
# Scroll in different directions
neko> scroll down 3
neko> scroll up
neko> scroll left 2

# Drag/swipe gestures
neko> swipe 100 100 300 300

CLI switches

manual.py mirrors the agent’s authentication behaviour: provide either a full WebSocket URL (--ws) or REST credentials (--neko-url, --username, --password). Additional flags include:

FlagPurpose
--normInterpret coordinates as 0.0–1.0 floats instead of pixels
--size 1920x1080Override the reported screen size when using pixel coordinates
--no-auto-hostDo not automatically request host control on connect
--no-mediaSkip WebRTC negotiation (only for debugging signalling)
--no-audioDisable audio subscription

By default the CLI learns the keyboard layout advertised by the server and stores logs under NEKO_LOGFILE if that environment variable is set.

Normalised coordinates example

uv run src/manual.py --norm \
  --neko-url "https://your-server.com" \
  --username "admin" --password "secret"

# Now coordinates are between 0 and 1
neko> move 0.5 0.5    # Center of screen
neko> tap 0.1 0.9     # Bottom-left area

Common Use Cases

Website Testing

Test web applications by automating browser interactions:

# Navigate to a website
neko> tap 400 60              # Click address bar
neko> text "https://example.com"
neko> key Enter

# Fill out a form
neko> tap 300 200             # Click username field
neko> text "testuser"
neko> key Tab                 # Move to next field
neko> text "password123"
neko> key Enter               # Submit form

# Test navigation
neko> tap 500 300             # Click a link
neko> key F5                  # Refresh page
neko> key Alt+Left            # Go back

Application Testing

Test desktop applications:

# Open application menu
neko> key Super               # Windows/Super key
neko> text "calculator"
neko> key Enter

# Use the application
neko> tap 200 300             # Click number 5
neko> tap 250 350             # Click plus
neko> tap 200 300             # Click number 5
neko> tap 300 400             # Click equals

System Administration

Perform administrative tasks:

# Open terminal
neko> key Ctrl+Alt+t

# Run system commands
neko> text "sudo apt update"
neko> key Enter
neko> text "your-password"    # If prompted
neko> key Enter

# Navigate file system
neko> text "cd /var/log"
neko> key Enter
neko> text "ls -la"
neko> key Enter

File Management

Work with files and folders:

# Open file manager
neko> key Super
neko> text "files"
neko> key Enter

# Navigate and create files
neko> key Ctrl+n              # New folder
neko> text "TestFolder"
neko> key Enter
neko> dblclick                # Enter folder
neko> rclick                  # Right-click for context menu
neko> text "n"                # New file (if available)

Advanced Features

Clipboard Operations

# Copy, cut, and paste
neko> select_all              # Select all text
neko> copy                    # Copy selection
neko> tap 400 500             # Click elsewhere
neko> paste                   # Paste content

# Paste specific text
neko> paste "This is custom text to paste"

Session Management

If you have admin rights, you can manage other users:

# List all connected users
neko> sessions

# Force take control from another user
neko> force-take

# Kick a specific user (use session ID from 'sessions' command)
neko> kick abc123-def456-789

# Release control
neko> force-release

Screen Size Management

# Check current screen size
neko> size

# Set virtual screen size (affects coordinate scaling)
neko> size 1920x1080

Raw Protocol Access

For advanced users, send custom commands:

# Send custom JSON messages to the server
neko> raw '{"event":"system/stats"}'
neko> raw '{"event":"control/scroll","payload":{"delta_x":0,"delta_y":240}}'

Configuration

Environment Variables

Set these to avoid typing credentials each time:

export NEKO_URL="https://your-neko-server.com"
export NEKO_USER="your-username"  
export NEKO_PASS="your-password"
export NEKO_SIZE="1920x1080"

# Now you can just run:
uv run src/manual.py

Command Line Options

OptionDescriptionExample
--wsDirect WebSocket URLwss://neko.example.com/api/ws?token=...
--neko-urlNeko server URL (for REST login)https://neko.example.com
--usernameLogin usernameadmin
--passwordLogin passwordsecretpass
--normUse 0-1 coordinates(flag only)
--sizeVirtual screen size1920x1080
--no-auto-hostDon't auto-request control(flag only)
--no-mediaSkip video/audio setup(flag only)
--no-audioDisable audio(flag only)

Logging

Enable detailed logging for troubleshooting:

export NEKO_LOGLEVEL="DEBUG"
export NEKO_LOGFILE="/tmp/neko-manual.log"
uv run src/manual.py --neko-url "..." --username "..." --password "..."

# In another terminal, watch the logs:
tail -f /tmp/neko-manual.log

Tips and Best Practices

Timing and Delays

For automation scripts, you may need to add delays between actions:

# The tool doesn't have built-in delays, but you can use shell scripting:
echo -e "tap 300 200\ntext hello\nkey Enter" | uv run src/manual.py --neko-url "..." --username "..." --password "..."

Screen Resolution

Always check the screen size when starting:

neko> size
size 1920x1080  normalized=false

This helps you understand the coordinate system you're working with.

Connection Issues

If you have connection problems:

  1. Check credentials: Make sure username/password are correct
  2. Verify URL: Ensure the Neko server URL is accessible
  3. Network connectivity: Test basic HTTP access to the server
  4. Enable debug logging: Use NEKO_LOGLEVEL=DEBUG for detailed logs

Multiple Users

Be aware that Neko servers can have multiple users connected:

  • Only one user can have "host" control at a time
  • Use sessions command to see who else is connected
  • Use host and unhost to request/release control
  • Admin users can use force-take and force-release

Automation Scripts

For repetitive tasks, consider creating shell scripts:

#!/bin/bash
# test-website.sh

export NEKO_URL="https://neko.example.com"
export NEKO_USER="admin"
export NEKO_PASS="password"

{
    echo "tap 400 60"                    # Click address bar
    echo "text https://example.com"      # Type URL
    echo "key Enter"                     # Navigate
    sleep 3                             # Wait for page load
    echo "tap 300 200"                  # Click login button
    echo "text testuser"                # Username
    echo "key Tab"                      # Next field
    echo "text testpass"                # Password  
    echo "key Enter"                    # Submit
} | uv run src/manual.py

Troubleshooting

Common Issues

"Connection failed"

  • Verify the Neko server is running and accessible
  • Check that the URL, username, and password are correct
  • Ensure no firewall is blocking the connection

"Command not found: manual.py"

  • Make sure you're in the correct directory
  • Use the full path: python /path/to/src/manual.py

"No host control"

  • Another user may have control; use sessions to check
  • Try host command to request control
  • Admin users can use force-take

Commands not working

  • Verify you have host control
  • Check if coordinates are correct for the screen size
  • Use size command to verify resolution

High latency

  • Network delay between you and the Neko server
  • Try reducing the number of rapid commands
  • Consider using the server closer to your location

Getting Help

If you encounter issues:

  1. Enable debug logging with NEKO_LOGLEVEL=DEBUG
  2. Check the log output for error messages
  3. Verify basic connectivity to the Neko server
  4. Review the Developer Guide for technical details

The Manual Control CLI is a powerful tool for remote desktop automation and testing. With practice, you can efficiently control any Neko server environment from your terminal.

Training Data Capture User Guide

The Neko capture tool records user interface interactions as training data for machine learning models. It captures screenshots and action sequences from Neko browser sessions, packaging them into datasets ready for training computer vision and UI automation models.

Quick Start

Prerequisites

  • Running Neko server with WebSocket API enabled
  • Python environment with mosaicml-streaming package installed
  • (Optional) S3-compatible storage credentials for remote upload

Basic Setup

  1. Install dependencies:

    pip install streaming requests websockets
    
  2. Set your Neko connection:

    export NEKO_URL="https://your-neko-server.com"
    export NEKO_USER="your-username"
    export NEKO_PASS="your-password"
    
  3. Start capturing:

    python src/capture.py
    
  4. Record an episode in Neko chat:

    /start Navigate to login page and enter credentials
    Action: {"type": "click", "coordinate": [150, 200]}
    Action: {"type": "type", "text": "username"}
    /stop
    

That's it! Your episode is now saved as training data in ./data/mds/.

What Gets Captured

The capture tool records complete "episodes" of UI interaction:

  • Screenshots: JPEG images captured at regular intervals (default: 2 FPS)
  • Actions: Structured annotations of user interactions (clicks, typing, etc.)
  • Metadata: Task descriptions, timestamps, screen dimensions
  • Context: Complete session information for reproducible training

Each episode becomes a single training sample packaged as:

episode.zip
├── meta.json           # Episode metadata and schema
├── frames/
│   ├── 000000.jpg     # Sequential screenshots
│   ├── 000001.jpg
│   └── ...
├── frames.ndjson      # Frame timing information
└── actions.ndjson     # Action annotations

Architecture

Data Flow

  1. Connect to Neko WebSocket API (/api/ws) for chat monitoring
  2. Listen for task boundaries (/start and /stop commands) and action annotations
  3. Capture screenshots at configurable FPS via HTTP endpoint (/api/room/screen/shot.jpg)
  4. Package episodes as ZIP archives containing metadata, frames, and action sequences
  5. Write to MDS shards with automatic S3 upload and shard rotation

Episode Structure

Each episode is packaged as a ZIP archive containing:

episode.zip
├── meta.json           # Episode metadata and schema
├── frames/
│   ├── 000000.jpg     # Sequential JPEG screenshots
│   ├── 000001.jpg
│   └── ...
├── frames.ndjson      # Frame index with timestamps
└── actions.ndjson     # Action annotations with timestamps

Configuration

All configuration is handled via environment variables for 12-factor compliance:

Connection Settings

# REST login (preferred method)
export NEKO_URL="https://neko.example.com"
export NEKO_USER="username"
export NEKO_PASS="password"

# OR direct WebSocket URL (bypasses REST)
export NEKO_WS="wss://neko.example.com/api/ws?token=..."

Output Configuration

# Local MDS directory
export CAPTURE_OUT="./data/mds"

# Remote storage (optional)
export CAPTURE_REMOTE="s3://bucket/prefix"
export CAPTURE_KEEP_LOCAL=0          # 0=delete local after upload, 1=keep

# MDS shard settings
export CAPTURE_COMPRESSION="zstd:6"   # Compression algorithm
export CAPTURE_SHARD_SIZE="512mb"     # Size before shard rotation
export CAPTURE_HASHES="sha1"          # Integrity checking

Capture Parameters

export CAPTURE_FPS=2                  # Screenshots per second
export CAPTURE_JPEG_QUALITY=85        # JPEG quality (0-100)
export CAPTURE_MIN_FRAMES=4           # Minimum frames to save episode
export CAPTURE_EPISODE_TIMEOUT=900    # Auto-end after N seconds

S3 Authentication

For S3 or S3-compatible storage:

export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_DEFAULT_REGION="us-east-1"
export AWS_SESSION_TOKEN="..."        # Optional for temporary credentials

# For S3-compatible endpoints (MinIO, R2, etc.)
export S3_ENDPOINT_URL="https://s3.example.com"

Common Use Cases

Training UI Automation Models

Capture complete workflows for training models to automate repetitive tasks:

# Set up for high-quality capture
export CAPTURE_FPS=3
export CAPTURE_JPEG_QUALITY=90

Then in Neko chat:

/start Fill out customer registration form
Action: {"type": "click", "coordinate": [245, 156], "element": "first_name_field"}
Action: {"type": "type", "text": "John"}
Action: {"type": "click", "coordinate": [245, 186], "element": "last_name_field"}
Action: {"type": "type", "text": "Doe"}
Action: {"type": "click", "coordinate": [245, 216], "element": "email_field"}
Action: {"type": "type", "text": "john.doe@example.com"}
Action: {"type": "click", "coordinate": [300, 350], "element": "submit_button"}
/stop

Collecting Web Navigation Data

Record browsing sessions for training navigation models:

# Lower FPS for longer sessions
export CAPTURE_FPS=1
export CAPTURE_EPISODE_TIMEOUT=1800  # 30 minutes

Then in Neko chat:

/start Research product reviews for laptop purchase
Action: {"type": "navigate", "url": "https://amazon.com"}
Action: {"type": "type", "text": "gaming laptop", "element": "search_box"}
Action: {"type": "click", "coordinate": [400, 45], "element": "search_button"}
Action: {"type": "click", "coordinate": [200, 180], "element": "product_link"}
Action: {"type": "scroll", "direction": "down", "amount": 500}
/stop

Capturing Error Recovery Workflows

Document how to handle and recover from errors:

In Neko chat:

/start Handle login failure and password reset
Action: {"type": "type", "text": "wrong_password", "element": "password_field"}
Action: {"type": "click", "coordinate": [300, 200], "element": "login_button"}
Action: {"type": "click", "coordinate": [250, 150], "element": "forgot_password_link"}
Action: {"type": "type", "text": "user@example.com", "element": "email_field"}
Action: {"type": "click", "coordinate": [200, 180], "element": "send_reset_button"}
/stop

Configuration Guide

Basic Configuration

For most users, these environment variables are sufficient:

# Required: Neko connection
export NEKO_URL="https://your-neko-server.com"
export NEKO_USER="your-username"  
export NEKO_PASS="your-password"

# Optional: Local storage (defaults to ./data/mds)
export CAPTURE_OUT="/path/to/training/data"

# Optional: Capture quality
export CAPTURE_FPS=2              # Screenshots per second
export CAPTURE_JPEG_QUALITY=85    # Image quality (0-100)

Cloud Storage Setup

AWS S3

export CAPTURE_REMOTE="s3://my-training-bucket/ui-episodes"
export AWS_ACCESS_KEY_ID="AKIA..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_DEFAULT_REGION="us-east-1"

MinIO/Self-hosted S3

export CAPTURE_REMOTE="s3://training-data/episodes"
export S3_ENDPOINT_URL="https://minio.mycompany.com"
export AWS_ACCESS_KEY_ID="minioadmin"
export AWS_SECRET_ACCESS_KEY="minioadmin"

Cloudflare R2

export CAPTURE_REMOTE="s3://my-r2-bucket/training"
export S3_ENDPOINT_URL="https://your-account.r2.cloudflarestorage.com"
export AWS_ACCESS_KEY_ID="your-r2-token"
export AWS_SECRET_ACCESS_KEY="your-r2-secret"

Advanced Configuration

Fine-tune capture behavior for specific use cases:

# Episode management
export CAPTURE_MIN_FRAMES=10       # Minimum frames to save episode
export CAPTURE_EPISODE_TIMEOUT=600 # Auto-stop after 10 minutes
export CAPTURE_KEEP_LOCAL=1        # Keep local copies when uploading

# Storage optimization
export CAPTURE_COMPRESSION="zstd:9"  # Maximum compression
export CAPTURE_SHARD_SIZE="1gb"      # Larger shards for fewer files
export CAPTURE_HASHES="sha1,md5"     # Multiple integrity checks

# Debugging
export CAPTURE_LOGLEVEL="DEBUG"

Running the Capture Tool

Command Line Usage

Basic usage with environment variables:

python src/capture.py

Override settings with command line flags:

python src/capture.py \
  --neko-url https://neko.example.com \
  --username myuser \
  --password mypass \
  --out ./custom/data/path \
  --remote s3://mybucket/training-data \
  --fps 3.0 \
  --jpeg-quality 95 \
  --episode-timeout 1200

Direct WebSocket Connection

Skip REST authentication with a direct WebSocket URL:

export NEKO_WS="wss://neko.example.com/api/ws?token=your-jwt-token"
python src/capture.py

Implementation Details

Core Classes

EpisodeBuffer (src/capture.py:161)

  • Manages temporary storage for a single episode
  • Handles frame and action storage during capture
  • Finalizes episode data into ZIP archive format

NekoSession (src/capture.py:342)

  • WebSocket client for Neko server communication
  • Handles authentication via REST API
  • Manages screenshot polling and message queuing

Capture (src/capture.py:542)

  • Main orchestrator coordinating capture workflow
  • Processes chat events for episode boundaries
  • Manages episode lifecycle and MDS writing

Episode Lifecycle

  1. Start Detection: Chat message matching /start <task> pattern
  2. Frame Capture: Screenshot polling thread captures frames at specified FPS
  3. Action Parsing: Chat messages matching Action: {...} are parsed and stored
  4. End Conditions:
    • Manual /stop command
    • Episode timeout (default 900 seconds)
    • Application shutdown
  5. Finalization: Episode packaged and written to MDS if minimum frame count met

MDS Schema

Each MDS record contains:

ColumnTypeDescription
episode_idstrUnique episode identifier
taskstrTask description from /start command
payloadbytesZIP archive containing episode data
num_framesintNumber of screenshot frames captured
num_actionsintNumber of action annotations
started_atstrEpisode start timestamp (ISO 8601)
ended_atstrEpisode end timestamp (ISO 8601)
screen_wintScreen width in pixels
screen_hintScreen height in pixels
agentjsonAgent metadata and configuration

Error Handling and Resilience

Network Resilience

  • Automatic WebSocket reconnection with exponential backoff
  • Screenshot polling continues through temporary network issues
  • MDS writer handles partial uploads and resumes interrupted transfers

Data Integrity

  • SHA1 hashing for shard integrity verification
  • Episode validation before MDS writing
  • Graceful handling of malformed action annotations

Resource Management

  • Temporary episode directories cleaned up after packaging
  • Queue overflow protection prevents memory exhaustion
  • Configurable episode timeouts prevent runaway captures

Integration with Training Pipelines

Loading Data

from streaming import StreamingDataset
import zipfile
import json

# Create streaming dataset
dataset = StreamingDataset(
    remote="s3://bucket/training-data",
    local="./cache",
    shuffle=True
)

# Process episodes
for sample in dataset:
    episode_id = sample['episode_id']
    task = sample['task']
    payload = sample['payload']
    
    # Extract episode contents
    with zipfile.ZipFile(io.BytesIO(payload)) as zf:
        meta = json.loads(zf.read('meta.json'))
        frames_index = [json.loads(line) for line in zf.read('frames.ndjson').decode().splitlines()]
        actions = [json.loads(line) for line in zf.read('actions.ndjson').decode().splitlines()]
        
        # Load frame images
        for frame_info in frames_index:
            frame_data = zf.read(frame_info['file'])
            # Process frame...

Random Window Sampling

The episode-as-record format enables efficient random window sampling for sequence models:

def sample_windows(episode_payload, window_size=32):
    with zipfile.ZipFile(io.BytesIO(episode_payload)) as zf:
        frames_index = [json.loads(line) for line in zf.read('frames.ndjson').decode().splitlines()]
        
        if len(frames_index) < window_size:
            return None
            
        start_idx = random.randint(0, len(frames_index) - window_size)
        window = frames_index[start_idx:start_idx + window_size]
        
        # Load frames for this window
        frames = []
        for frame_info in window:
            frame_data = zf.read(frame_info['file'])
            frames.append(frame_data)
            
        return frames

Troubleshooting

Quick Diagnostics

First, check if the capture tool can connect to your Neko server:

# Test basic connectivity
curl -I https://your-neko-server.com/api/health

# Test WebSocket endpoint (if available)
python -c "
import websockets
import asyncio
asyncio.run(websockets.connect('wss://your-neko-server.com/api/ws'))
print('WebSocket connection successful')
"

Common Issues & Solutions

❌ Connection Issues

Problem: Connection refused or Unable to connect to Neko server

Solutions:

  1. Verify Neko server is running: curl https://your-neko-server.com
  2. Check firewall settings - ensure ports 80/443 and WebSocket ports are open
  3. Verify URL format: https:// (not http://) for secure connections
  4. Test from the same network as your Neko server first

Problem: SSL certificate verification failed

Solutions:

  1. For self-signed certificates, set: export PYTHONHTTPSVERIFY=0 (development only)
  2. Or add certificate to system trust store
  3. Use IP address instead of hostname if DNS issues

❌ Authentication Issues

Problem: Authentication failed or 401 Unauthorized

Solutions:

  1. Double-check username/password: echo $NEKO_USER $NEKO_PASS
  2. Verify user has WebSocket API access in Neko admin panel
  3. Try direct WebSocket token if available:
    # Get token via REST API first
    curl -X POST https://your-neko-server.com/api/login \
      -H "Content-Type: application/json" \
      -d '{"username":"user","password":"pass"}'
    
    # Use token directly
    export NEKO_WS="wss://your-neko-server.com/api/ws?token=YOUR_TOKEN"
    

❌ Episode Recording Issues

Problem: Episodes not being saved/captured

Solutions:

  1. Check command format: Commands must be exact:

    /start task description here    ✅
    /Start task description         ❌ (wrong case)
    / start task description        ❌ (extra space)
    
  2. Verify action format: Actions must be valid JSON:

    Action: {"type": "click", "coordinate": [100, 200]}    ✅
    Action: {type: "click", coordinate: [100, 200]}        ❌ (missing quotes)
    Action {"type": "click"}                               ❌ (missing colon)
    
  3. Check minimum frames: Episodes with fewer than CAPTURE_MIN_FRAMES are discarded:

    export CAPTURE_MIN_FRAMES=1  # Save all episodes
    
  4. Enable debug logging:

    export CAPTURE_LOGLEVEL=DEBUG
    python src/capture.py
    

❌ Storage Issues

Problem: Permission denied writing to local directory

Solutions:

  1. Check directory permissions: ls -la ./data/
  2. Create directory manually: mkdir -p ./data/mds
  3. Use a different path: export CAPTURE_OUT=/tmp/capture-data

Problem: S3 upload failures

Solutions:

  1. Verify credentials:

    aws sts get-caller-identity  # Test AWS credentials
    
  2. Check bucket permissions: Ensure your credentials can write to the bucket

  3. Test endpoint connectivity:

    # For MinIO/R2
    curl -I $S3_ENDPOINT_URL
    
  4. Debug with minimal config:

    unset CAPTURE_REMOTE  # Disable S3, use local only
    python src/capture.py
    

❌ Performance Issues

Problem: High memory usage or slow performance

Solutions:

  1. Reduce capture rate:

    export CAPTURE_FPS=1              # Lower frame rate
    export CAPTURE_JPEG_QUALITY=70    # Lower image quality
    
  2. Shorter episodes:

    export CAPTURE_EPISODE_TIMEOUT=300  # 5 minutes max
    
  3. Monitor resources:

    # Watch memory usage
    watch 'ps aux | grep capture.py'
    
    # Check disk space
    df -h ./data/mds/
    

Debugging Commands

View Current Configuration

python src/capture.py --help  # See all options
env | grep CAPTURE            # Show capture settings
env | grep NEKO               # Show connection settings

Test Components Individually

Test REST Authentication:

python -c "
import requests, os
r = requests.post(f'{os.getenv(\"NEKO_URL\")}/api/login',
                 json={'username': os.getenv('NEKO_USER'),
                       'password': os.getenv('NEKO_PASS')})
print(f'Status: {r.status_code}')
print(f'Response: {r.text}')
"

Test Screenshot Endpoint:

python -c "
import requests, os
r = requests.get(f'{os.getenv(\"NEKO_URL\")}/api/room/screen/shot.jpg')
print(f'Status: {r.status_code}')
print(f'Content-Type: {r.headers.get(\"content-type\")}')
with open('/tmp/test_screenshot.jpg', 'wb') as f:
    f.write(r.content)
print('Screenshot saved to /tmp/test_screenshot.jpg')
"

Log Analysis

Enable detailed logging to diagnose issues:

export CAPTURE_LOGLEVEL=DEBUG
python src/capture.py 2>&1 | tee capture.log

Key log messages to look for:

  • REST login ok - Authentication successful
  • Screen size: 1920x1080 - WebSocket connection established
  • Episode [id] started - Episode recording began
  • Episode [id] committed - Episode saved successfully
  • WS loop error - WebSocket connection issues
  • Shot.jpg non-OK - Screenshot endpoint problems

Getting Help

If you're still having issues:

  1. Check the logs with CAPTURE_LOGLEVEL=DEBUG
  2. Verify your environment with env | grep -E "(NEKO|CAPTURE|AWS)"
  3. Test components separately using the debug commands above
  4. Create a minimal test case:
    # Simplest possible configuration
    export NEKO_URL="https://your-server.com"
    export NEKO_USER="testuser"
    export NEKO_PASS="testpass"
    export CAPTURE_OUT="/tmp/test-capture"
    export CAPTURE_LOGLEVEL="DEBUG"
    
    python src/capture.py
    

Performance Tuning

For optimal performance in different scenarios:

High-Quality Training Data

export CAPTURE_FPS=3
export CAPTURE_JPEG_QUALITY=95
export CAPTURE_COMPRESSION="zstd:3"  # Faster compression
export CAPTURE_SHARD_SIZE="256mb"    # Smaller shards for faster upload

Long Sessions/Low Memory

export CAPTURE_FPS=1
export CAPTURE_JPEG_QUALITY=70
export CAPTURE_EPISODE_TIMEOUT=1800
export CAPTURE_COMPRESSION="zstd:9"  # Maximum compression

Local Development

unset CAPTURE_REMOTE              # No S3 upload
export CAPTURE_COMPRESSION="none" # Faster local writes
export CAPTURE_LOGLEVEL="INFO"

Best Practices

Episode Design

Keep episodes focused: Record single, complete tasks rather than mixing multiple workflows:

# Good: focused task
/start Complete user registration process
# ... perform registration steps ...
/stop

# Poor: mixed tasks  
/start Do registration then check email then browse products

Use descriptive task names: Help your training pipeline understand the data:

/start Handle payment form validation errors
/start Navigate product catalog with filters
/start Recover from session timeout during checkout

Include error scenarios: Capture both success and failure paths:

/start Login with invalid credentials and recover
/start Handle network interruption during file upload
/start Deal with form validation errors

Action Annotation Guidelines

Be consistent with action types: Use standardized action schemas:

{"type": "click", "coordinate": [x, y], "element": "button_id"}
{"type": "type", "text": "input text", "element": "field_id"}  
{"type": "scroll", "direction": "down", "amount": 300}
{"type": "navigate", "url": "https://example.com"}
{"type": "wait", "duration": 2.0, "reason": "page_load"}

Include context in actions: Add semantic information when possible:

{"type": "click", "coordinate": [200, 100], "element": "login_button", "intent": "submit_form"}
{"type": "type", "text": "user@example.com", "element": "email_field", "intent": "enter_credentials"}

Storage Management

Choose appropriate compression: Balance speed vs storage:

  • Development: CAPTURE_COMPRESSION="none" (fastest)
  • Production: CAPTURE_COMPRESSION="zstd:6" (balanced)
  • Archive: CAPTURE_COMPRESSION="zstd:9" (smallest)

Optimize shard sizes for your infrastructure:

  • Fast networks: CAPTURE_SHARD_SIZE="1gb" (fewer files)
  • Slow networks: CAPTURE_SHARD_SIZE="128mb" (faster uploads)
  • Mobile/edge: CAPTURE_SHARD_SIZE="64mb" (reliable transfers)

Use appropriate retention policies:

# Keep local copies for immediate access
export CAPTURE_KEEP_LOCAL=1

# Or upload and clean for space efficiency  
export CAPTURE_KEEP_LOCAL=0

Advanced Usage

Continuous Capture Workflows

For long-running capture sessions, use systemd or docker for reliability:

Systemd service (/etc/systemd/system/neko-capture.service):

[Unit]
Description=Neko Training Data Capture
After=network.target

[Service]
Type=simple
User=capture
WorkingDirectory=/opt/neko-capture
Environment=NEKO_URL=https://neko.example.com
Environment=NEKO_USER=capture-bot
EnvironmentFile=/etc/neko-capture/config
ExecStart=/usr/bin/python3 src/capture.py
Restart=always
RestartSec=30

[Install]
WantedBy=multi-user.target

Docker deployment:

FROM python:3.11-slim
RUN pip install streaming requests websockets
COPY src/capture.py /app/
ENV NEKO_URL=https://neko.example.com
CMD ["python", "/app/capture.py"]

Multi-instance Scaling

Run multiple capture instances for parallel data collection:

# Instance 1: UI automation tasks
export CAPTURE_OUT="./data/ui-automation"
export CAPTURE_REMOTE="s3://training/ui-automation"
python src/capture.py &

# Instance 2: Navigation tasks  
export CAPTURE_OUT="./data/navigation"
export CAPTURE_REMOTE="s3://training/navigation"  
python src/capture.py &

# Instance 3: Error handling
export CAPTURE_OUT="./data/error-recovery"
export CAPTURE_REMOTE="s3://training/error-recovery"
python src/capture.py &

Integration with Training Pipelines

Streaming data loader example:

from streaming import StreamingDataset
import torch
from torch.utils.data import DataLoader

class UIDataset(StreamingDataset):
    def __getitem__(self, idx):
        sample = super().__getitem__(idx)
        
        # Extract and process episode
        payload = sample['payload']
        episode = self.process_episode(payload)
        
        return {
            'frames': episode['frames'],
            'actions': episode['actions'], 
            'task': sample['task']
        }

# Use with PyTorch
dataset = UIDataset(remote="s3://training/episodes")
loader = DataLoader(dataset, batch_size=32, num_workers=4)

Quality Assurance

Validate episodes before training:

def validate_episode(episode_data):
    """Check episode quality before using for training"""
    frames = episode_data['frames']
    actions = episode_data['actions']
    
    # Check minimum length
    if len(frames) < 10:
        return False, "Too few frames"
        
    # Check action/frame ratio
    if len(actions) / len(frames) < 0.1:
        return False, "Too few actions relative to frames"
        
    # Check for valid coordinates
    for action in actions:
        if action.get('type') == 'click':
            coord = action.get('coordinate', [])
            if not (0 <= coord[0] <= 1920 and 0 <= coord[1] <= 1080):
                return False, "Invalid click coordinates"
                
    return True, "Valid episode"

Monitor capture quality:

# Check recent episodes
python -c "
from streaming import StreamingDataset
ds = StreamingDataset(local='./data/mds')
for i, sample in enumerate(ds):
    if i >= 10: break
    print(f'Episode {sample[\"episode_id\"]}: {sample[\"num_frames\"]} frames, {sample[\"num_actions\"]} actions')
"

Finetuning on Captured Episodes

This guide explains how to finetune the agent’s ShowUI/Qwen2VL model using the episodes recorded by src/capture.py (MosaicML Streaming, aka MDS).

All commands assume you are inside the Nix development shell (nix develop).

What It Trains

The training script turns each annotated action in an episode into a supervised example:

  • Prompt: the same chat template used by the agent (system + task + image and optional short action history).
  • Target: the JSON action string (e.g., {"action":"CLICK", ...}) observed in chat as Action: {...} during capture.

This directly teaches the model to emit the agent’s expected action schema.

Environment Variables

Set these in .env (already scaffolded in .env.example):

# Data sources (MDS)
TRAIN_LOCAL="./data/mds"              # defaults to CAPTURE_OUT
# TRAIN_REMOTE="s3://bucket/prefix"    # optional remote MDS
TRAIN_CACHE="./data/cache"            # local cache for StreamingDataset

# Output
TRAIN_OUTPUT="./checkpoints"          # where to save model + processor

# Logging
TRAIN_LOGLEVEL=INFO                    # DEBUG|INFO|WARNING|ERROR
TRAIN_LOG_FORMAT=text                  # text|json

# Optimization
TRAIN_EPOCHS=1
TRAIN_BATCH=1
TRAIN_ACCUM=1
TRAIN_LR=5e-6
TRAIN_WD=0.0
TRAIN_MAX_STEPS=0                      # 0 disables cap
TRAIN_MAX_SAMPLES_PER_EPOCH=0          # 0 disables per-epoch cap
TRAIN_HISTORY_STEPS=0                  # include N prior actions in prompt
SEED=1337

The model ID and image sizes inherit from the agent defaults:

REPO_ID="showlab/ShowUI-2B"
SIZE_SHORTEST_EDGE=224
SIZE_LONGEST_EDGE=1344

Running Training

# Basic run (uses .env)
python src/train.py

# With overrides
TRAIN_BATCH=2 TRAIN_EPOCHS=1 python src/train.py --output ./ckpt/showui-ft

# Using just
just train

Checkpoints are written under TRAIN_OUTPUT (epoch-* and final/). Use them with the agent by pointing REPO_ID to the checkpoint folder.

Notes

  • The script streams episodes with streaming.StreamingDataset (MDS). It can read from a local directory and/or a remote (S3/R2/MinIO) when configured.
  • Mixed precision is enabled on CUDA automatically. On Apple MPS it uses the default precision.
  • This is a minimal reference trainer for reproducible fine-tuning. For large jobs, consider adding gradient checkpointing, deepspeed, or PEFT adapters in your environment.

Voice Synthesis (YAP)

YAP (Yet Another Presenter) provides real-time text-to-speech capabilities for your Neko automation sessions. Convert text messages into natural-sounding speech that plays directly in the browser.

What is YAP?

YAP transforms text into speech using advanced AI voice synthesis (F5-TTS) and streams the audio through WebRTC to your Neko browser session. This enables:

  • Voice announcements during automation tasks
  • Interactive conversations through chat commands
  • Live narration of automation steps
  • Multiple voice personas with custom characteristics

Quick Start

1. Prerequisites

Before using YAP, ensure you have:

  • A running Neko server (see Getting Started)
  • GPU environment for optimal performance: nix develop .#gpu
  • Basic voice files in the ./voices directory

2. Start YAP Service

# Connect to local Neko server
export NEKO_URL="http://localhost:8080"
export NEKO_USER="user" 
export NEKO_PASS="password"

uv run src/yap.py

You should see:

[12:34:56] yap INFO - WS connected
[12:34:56] yap INFO - RTC answer sent; audio track live.
[12:34:56] yap INFO - Voices reloaded (1 entries)

3. Test Voice Output

In your Neko browser chat, type:

/yap Hello! I am your voice assistant.

You should hear the text spoken through the browser audio.

Voice Commands

YAP responds to commands in the Neko chat interface:

Immediate Speech

Speak text immediately:

/yap Good morning! Ready to start automation.
/yap The task has been completed successfully.

Streaming Mode

For longer conversations or live narration:

/yap:begin
I'm starting the automation task now...
Navigating to the website...
Filling out the form...
Submitting the data...
/yap:end

In streaming mode, YAP processes text incrementally as you type, enabling natural conversation flow.

Stop/Clear Queue

Cancel current speech and clear the queue:

/yap:stop

Voice Management

Default Voice Setup

YAP needs at least one voice configured. Create a basic setup:

  1. Create voices directory:

    mkdir -p voices
    
  2. Add a reference audio file:

    # Record or copy a 3-10 second WAV file
    cp your-voice-sample.wav voices/default.wav
    
  3. YAP will auto-create voices.json on first run with default settings.

Adding New Voices

Add voices through chat commands:

/yap:voice add --spk alice --ref ./voices/alice.wav --ref-text "Hello, my name is Alice" --styles "friendly,calm"

Parameters:

  • --spk: Voice ID/name
  • --ref: Path to reference audio file (WAV format, 3-10 seconds)
  • --ref-text: Transcript of the reference audio
  • --styles: Comma-separated style tags
  • --rate: Speech speed (0.5-2.0, default 1.0)
  • --pitch: Pitch shift in semitones (-12 to +12, default 0.0)

Switching Voices

Change the active voice and parameters:

/yap:voice set --spk alice
/yap:voice set --spk bob --rate 1.2 --pitch -0.5
/yap:voice set --rate 0.8

Reload Voice Configuration

After manually editing voices/voices.json:

/yap:voice reload

Configuration

Basic Settings

Set these environment variables before starting YAP:

# Connection (choose one method)
export YAP_WS="wss://demo.neko.com/api/ws?token=your_token"
# OR
export NEKO_URL="https://demo.neko.com"
export NEKO_USER="username"
export NEKO_PASS="password"

# Voice directory
export YAP_VOICES_DIR="./voices"
export YAP_SPK_DEFAULT="default"

Audio Quality

# Audio format (recommended settings)
export YAP_SR=48000              # Sample rate (Hz)
export YAP_AUDIO_CHANNELS=1      # Channels (1=mono, 2=stereo)
export YAP_FRAME_MS=20           # WebRTC frame size

# Processing
export YAP_PARALLEL=2            # TTS worker threads
export YAP_MAX_CHARS=350         # Max characters per chunk
export YAP_OVERLAP_MS=30         # Audio crossfade overlap

Performance Tuning

# For faster response (lower quality)
export YAP_MAX_CHARS=200
export YAP_PARALLEL=4

# For better quality (higher latency)  
export YAP_MAX_CHARS=500
export YAP_OVERLAP_MS=50

# Buffer management
export YAP_JITTER_MAX_SEC=6.0    # Audio buffer size

Voice Configuration File

YAP stores voice settings in voices/voices.json:

{
  "default": {
    "ref_audio": "./voices/default.wav",
    "ref_text": "This is my default voice sample.",
    "styles": ["neutral"],
    "params": {
      "rate": 1.0,
      "pitch": 0.0
    }
  },
  "alice": {
    "ref_audio": "./voices/alice.wav", 
    "ref_text": "Hello, my name is Alice and I sound friendly.",
    "styles": ["friendly", "energetic"],
    "params": {
      "rate": 1.1,
      "pitch": 0.2
    }
  },
  "narrator": {
    "ref_audio": "./voices/narrator.wav",
    "ref_text": "I will be narrating the automation process.",
    "styles": ["professional", "clear"],
    "params": {
      "rate": 0.9,
      "pitch": -0.3
    }
  }
}

Voice Parameters

  • ref_audio: Path to reference WAV file (3-10 seconds, clear speech)
  • ref_text: Exact transcript of the reference audio
  • styles: Descriptive tags (friendly, professional, calm, energetic)
  • rate: Speech speed multiplier (0.5=slow, 1.0=normal, 2.0=fast)
  • pitch: Pitch adjustment in semitones (negative=lower, positive=higher)

Usage Scenarios

Automation Announcements

Start YAP alongside your automation agent:

# Terminal 1: Start YAP
uv run src/yap.py

# Terminal 2: Run automation with voice (audio is enabled by default)
uv run src/agent.py --task "Fill out contact form"

The agent can announce progress:

/yap Starting automation task: Fill out contact form
/yap Navigating to the website...
/yap Form submitted successfully!

Interactive Sessions

Use YAP for live interaction during manual control:

# Terminal 1: Start YAP
uv run src/yap.py

# Terminal 2: Manual control
python src/manual.py

Then control both automation and voice through chat:

!click 100 200
/yap Clicked on the submit button
!type "Hello World"
/yap Entered text in the field

Multi-Voice Conversations

Set up different voices for different purposes:

# Set up voices
/yap:voice add --spk system --ref ./voices/system.wav --ref-text "System notification" --rate 0.9
/yap:voice add --spk user --ref ./voices/user.wav --ref-text "User interaction" --rate 1.1

# Use in conversation
/yap:voice set --spk system
/yap System initialization complete.

/yap:voice set --spk user  
/yap Thank you for the update!

Troubleshooting

No Audio Output

Check browser audio permissions:

  1. Click the browser's address bar lock icon
  2. Ensure "Sound" is allowed
  3. Check browser volume settings

Verify WebRTC connection:

  • Open browser developer tools (F12)
  • Go to Console tab
  • Look for WebRTC connection messages
  • Check for audio stream indicators

Test connection:

uv run src/yap.py --healthcheck

Poor Audio Quality

Check reference audio:

  • Use high-quality WAV files (16-bit, 22kHz+)
  • 3-10 second samples with clear speech
  • No background noise or music
  • Single speaker only

Adjust processing:

# Increase overlap for smoother transitions
export YAP_OVERLAP_MS=50

# Reduce chunk size for lower latency
export YAP_MAX_CHARS=250

High Latency

Optimize for speed:

# Increase parallel workers
export YAP_PARALLEL=4

# Reduce chunk size
export YAP_MAX_CHARS=200

# Reduce buffer size
export YAP_JITTER_MAX_SEC=3.0

Check GPU usage:

# Monitor GPU usage while YAP is running
nvidia-smi -l 1

Connection Issues

WebSocket connection failed:

# Test WebSocket endpoint
curl -i -N -H "Connection: Upgrade" -H "Upgrade: websocket" \
  http://localhost:8080/api/ws

Authentication failed:

# Test REST login
curl -X POST http://localhost:8080/api/login \
  -H "Content-Type: application/json" \
  -d '{"username":"user","password":"password"}'

Check firewall/network:

  • Ensure ports 8080 (HTTP) and WebRTC ports are accessible
  • Test with STUN server connectivity
  • Check for corporate proxy/firewall blocking WebRTC

Debug Mode

Enable detailed logging:

export YAP_LOGLEVEL=DEBUG
export YAP_LOG_FORMAT=json
uv run src/yap.py 2>&1 | jq .

Look for:

  • WebSocket connection status
  • WebRTC negotiation progress
  • TTS processing times
  • Audio buffer status

Advanced Usage

Custom Voice Training

For better voice quality, record multiple reference samples:

  1. Record varied samples:

    # Different emotions/styles
    voices/alice-happy.wav
    voices/alice-serious.wav
    voices/alice-excited.wav
    
  2. Test different samples:

    /yap:voice set --spk alice-happy
    /yap I'm so excited about this automation!
    
    /yap:voice set --spk alice-serious  
    /yap Please review these results carefully.
    

Integration with Automation

Modify automation scripts to include voice feedback:

# In your automation script
import requests

def announce(text):
    """Send voice announcement to YAP"""
    requests.post(f"{neko_url}/api/chat", json={
        "message": f"/yap {text}"
    })

# Use in automation
announce("Starting login process")
agent.click_login_button()
announce("Login successful, proceeding to dashboard")

Docker Deployment

Deploy YAP as a container service:

FROM python:3.9-slim

# Install system dependencies
RUN apt-get update && apt-get install -y ffmpeg

# Install Python dependencies  
COPY requirements.txt .
RUN pip install -r requirements.txt

# Copy application and voices
COPY src/ /app/src/
COPY voices/ /app/voices/
WORKDIR /app

# Configuration
ENV YAP_VOICES_DIR=/app/voices
ENV YAP_SR=48000
ENV YAP_PARALLEL=2

ENTRYPOINT ["python", "src/yap.py"]

Next Steps

Architecture Overview

This document provides a comprehensive overview of the Neko Agent system architecture, covering the core components, data flow, and integration patterns.

System Overview

graph TB
    subgraph "Development Environment"
        Agent["`**Neko Agent**
        (src/agent.py)
        • WebRTC Client
        • AI Vision Model
        • Action Execution`"]
        
        Capture["`**Capture Service**
        (src/capture.py)
        • Training Data Collection
        • MosaicML Streaming
        • S3 Upload`"]
        
        TTS["`**TTS Service**
        (src/yap.py)
        • F5-TTS Synthesis
        • Voice Management
        • WebRTC Audio`"]
    end
    
    subgraph "Neko Server"
        Chrome["`**Chrome Container**
        • GUI Target
        • WebRTC Server
        • Screen Sharing`"]
        
        NekoAPI["`**Neko API**
        • Authentication
        • WebSocket Signaling
        • Session Management`"]
    end
    
    subgraph "External Services"
        S3["`**S3 Storage**
        • Training Data
        • Model Artifacts
        • Voice Assets`"]
        
        Models["`**AI Models**
        • ShowUI-2B (Vision)
        • F5-TTS (Audio)
        • Custom Voices`"]
    end
    
    Agent <-->|WebRTC Data| Chrome
    Agent <-->|Control API| NekoAPI
    Agent <-->|Screenshots| Capture
    Agent <-->|Voice Commands| TTS
    
    Capture -->|MDS Shards| S3
    TTS <-->|Audio Stream| Chrome
    TTS <-->|Voice Models| Models
    
    Agent <-->|Vision Inference| Models

Core Components

1. Neko Agent (src/agent.py)

The main automation agent that orchestrates GUI interactions through AI-powered decision making.

Key Features:

  • WebRTC Integration: Real-time connection to Neko Chrome containers
  • AI Vision: ShowUI-2B (Qwen2VL) model for visual reasoning and action planning
  • Action Execution: Supports CLICK, INPUT, SCROLL, SWIPE, TAP, ANSWER actions
  • Prometheus Metrics: Built-in observability and performance monitoring

Architecture:

class NekoAgent:
    def __init__(self):
        self.model = ShowUI2B()
        self.webrtc = WebRTCClient()
        self.metrics = PrometheusMetrics()
    
    async def run(self, task: str):
        # 1. Connect to Neko via WebRTC
        await self.webrtc.connect()
        
        # 2. Capture current screen
        frame = await self.webrtc.get_frame()
        
        # 3. AI model analyzes screen and plans action
        action = await self.model.analyze(frame, task)
        
        # 4. Execute action through WebRTC
        await self.webrtc.execute_action(action)
        
        # 5. Repeat until task complete

2. Capture Service (src/capture.py)

Automated training data collection system that records agent sessions for model improvement.

Key Features:

  • Episode Recording: Packages complete automation sessions as training samples
  • MosaicML Streaming: Generates MDS shards for efficient distributed training
  • S3 Integration: Direct upload to cloud storage with configurable backends
  • WebSocket Monitoring: Real-time capture of agent actions and screen states

Data Flow:

sequenceDiagram
    participant User
    participant Agent
    participant Capture
    participant Neko
    participant S3
    
    User->>Agent: Start automation task
    Agent->>Capture: Begin episode recording
    
    loop Every Action
        Agent->>Neko: Execute action
        Neko->>Agent: Screen update
        Agent->>Capture: Log action + timestamp
        Capture->>Capture: Store frame + metadata
    end
    
    Agent->>Capture: End episode
    Capture->>Capture: Generate ZIP archive
    Capture->>S3: Upload MDS shard

3. TTS Service (src/yap.py)

Real-time text-to-speech system providing voice feedback during automation.

Key Features:

  • F5-TTS Integration: High-quality voice synthesis with custom voice models
  • WebRTC Audio: Direct audio streaming to Neko sessions
  • Voice Management: Hot-reloadable voice registry with style parameters
  • Streaming Support: Both immediate and incremental text processing

Audio Pipeline:

graph LR
    Text["`Text Input
    /yap commands`"] --> Segment["`Segmentation
    Punctuation-aware
    chunks`"]
    
    Segment --> Workers["`Parallel Workers
    F5-TTS synthesis
    Rate/pitch control`"]
    
    Workers --> Splicer["`Audio Splicer
    Crossfade blending
    Jitter buffering`"]
    
    Splicer --> WebRTC["`WebRTC Output
    48kHz audio stream
    Real-time delivery`"]

Data Flow and Integration

WebRTC Communication

The system uses WebRTC for low-latency communication with Neko containers:

# WebRTC connection establishment
async def connect_to_neko():
    # 1. REST API authentication
    token = await authenticate(neko_url, username, password)
    
    # 2. WebSocket signaling
    ws_url = f"wss://{host}/api/ws?token={token}"
    signaling = await websockets.connect(ws_url)
    
    # 3. WebRTC peer connection
    pc = RTCPeerConnection()
    
    # 4. Add video track (for screen capture)
    @pc.on("track")
    async def on_track(track):
        if track.kind == "video":
            frame = await track.recv()
            # Process frame for AI analysis
    
    # 5. Exchange SDP offer/answer
    await exchange_sdp(pc, signaling)

Training Data Format

The capture service generates training samples in this format:

episode_xyz789.zip
├── meta.json              # Episode metadata
├── frames/                # Screenshot sequence
│   ├── 000000.jpg         #   Frame 0 (t=0.0s)
│   ├── 000001.jpg         #   Frame 1 (t=0.5s)
│   └── ...
├── frames.ndjson          # Frame index with timestamps
└── actions.ndjson         # Action sequence with timestamps

MDS Shard Structure:

# MosaicML Streaming format
columns = {
    "episode_id": "str",
    "task": "str", 
    "payload": "bytes",      # ZIP archive
    "num_frames": "int",
    "num_actions": "int",
    "started_at": "str",
    "ended_at": "str",
    "screen_w": "int",
    "screen_h": "int"
}

Voice Model Management

The TTS service manages voices through a JSON registry:

{
  "instructor": {
    "ref_audio": "/voices/instructor.wav",
    "ref_text": "This is an instructional voice for tutorials",
    "styles": ["calm", "explanatory"],
    "params": {
      "rate": 0.95,
      "pitch": -1.0
    }
  }
}

Deployment Patterns

Development Mode

# Terminal 1: Neko server
nix develop .#neko
neko-services up

# Terminal 2: Agent
nix develop .#gpu  
uv run src/agent.py --task "automation task"

# Terminal 3: Capture (optional)
uv run src/capture.py

# Terminal 4: TTS (optional)
uv run src/yap.py

Production Considerations

For production deployments, consider:

  1. Container Orchestration: Deploy components as separate containers
  2. GPU Resource Management: Efficient sharing between agent and TTS
  3. Storage Strategy: S3-compatible backends for training data
  4. Monitoring: Prometheus metrics and health checks
  5. Scaling: Multiple agent instances with load balancing

See the API Mode Design for detailed production architecture plans.

Error Handling and Resilience

Connection Resilience

# Exponential backoff for WebRTC reconnection
async def connect_with_retry():
    backoff = 1.0
    max_retries = 5
    
    for attempt in range(max_retries):
        try:
            return await establish_webrtc_connection()
        except ConnectionError:
            await asyncio.sleep(backoff)
            backoff = min(30.0, backoff * 2)
    
    raise MaxRetriesExceeded()

Graceful Degradation

  • Capture failures: Agent continues without training data collection
  • TTS unavailable: Silent operation mode
  • Model errors: Fallback to basic automation patterns
  • Network issues: Local caching and offline capabilities

Performance Characteristics

Latency Profile

  • Frame processing: ~100-200ms (GPU inference)
  • Action execution: ~50-100ms (WebRTC round-trip)
  • Voice synthesis: ~200-500ms (F5-TTS generation)
  • Training data write: ~10-50ms (local storage)

Resource Usage

  • GPU Memory: 2-4GB (ShowUI-2B + F5-TTS)
  • System Memory: 1-2GB (Python runtime + buffers)
  • Network: 5-10 Mbps (WebRTC video + audio)
  • Storage: Variable (training data accumulation)

Security Considerations

Authentication

  • JWT tokens for Neko API access
  • WebSocket secure connections (WSS)
  • S3 IAM roles for training data upload

Data Privacy

  • Local frame processing (no external API calls)
  • Configurable training data retention
  • Optional data encryption at rest

Network Security

  • TLS/SSL for all external connections
  • WebRTC DTLS encryption
  • Firewall-friendly STUN/TURN protocols

Future Architecture Evolution

The current architecture is designed to support future enhancements:

  1. API Mode: RESTful service with horizontal scaling
  2. Multi-Modal: Vision + audio + text understanding
  3. Distributed Training: Cloud-native ML pipelines
  4. Edge Deployment: Lightweight inference containers

See API Mode Design for the complete next-generation architecture.

Component Documentation

This section provides detailed documentation for each component of the Neko Agent system.

Overview

The Neko Agent system consists of four main components:

  1. Core Agent (src/agent.py) - Main automation engine
  2. Capture Service (src/capture.py) - Training data collection
  3. Manual Control CLI (src/manual.py) - Interactive remote control interface
  4. TTS Service (src/yap.py) - Voice synthesis and audio

Each component is designed to work independently while providing seamless integration when used together.

Component Interaction

graph TD
    Agent[Core Agent] --> Neko[Neko Server]
    Agent -.->|Optional| Capture[Capture Service]
    Agent -.->|Optional| TTS[TTS Service]
    
    Manual[Manual Control CLI] --> Neko
    Manual -.->|Admin API| Sessions[Session Management]
    
    Capture --> Storage[Training Data Storage]
    TTS --> Audio[WebRTC Audio Stream]
    
    Neko --> Chrome[Chrome Container]
    Audio --> Chrome

Development Workflow

When developing with multiple components:

  1. Start Neko Server (if using local setup)
  2. Launch Core Agent for basic automation
  3. Add Capture Service for training data collection
  4. Use Manual Control CLI for testing and debugging
  5. Add TTS Service for voice feedback

Each component has its own configuration and can be enabled/disabled as needed.

Next Steps

Core Agent (src/agent.py)

The core agent (src/agent.py) is the main automation engine of the Neko Agent system. It provides AI-powered GUI automation through WebRTC connections to Neko servers, using the ShowUI-2B vision model for intelligent action planning and execution.

Overview

The agent operates as a production-ready, 12-Factor App compliant service that:

  • Connects to Neko servers via WebRTC for real-time GUI interaction
  • Uses ShowUI-2B (Qwen2VL) for visual reasoning and action planning
  • Executes precise actions with iterative refinement for improved accuracy
  • Supports multiple modes including web automation and mobile device control
  • Provides comprehensive metrics via Prometheus for monitoring and observability
  • Handles graceful shutdown with proper resource cleanup

Architecture

graph TB
    subgraph "Agent Core"
        Agent[NekoAgent]
        Settings[Settings]
        Signaler[Signaler]
    end

    subgraph "Frame Sources"
        WebRTC[WebRTCFrameSource]
        Lite[LiteFrameSource]
    end

    subgraph "AI Processing"
        Model[ShowUI-2B Model]
        Processor[AutoProcessor]
        Executor[ThreadPoolExecutor]
    end

    subgraph "External Services"
        Neko[Neko Server]
        Metrics[Prometheus Metrics]
        Storage[Frame/Click Storage]
    end

    Agent --> Signaler
    Agent --> WebRTC
    Agent --> Lite
    Agent --> Model
    Agent --> Processor

    Signaler <-->|WebSocket| Neko
    WebRTC <-->|WebRTC| Neko
    Lite <-->|HTTP Polling| Neko

    Agent --> Metrics
    Agent --> Storage

    Model --> Executor

Core Classes

1. Settings

Purpose: Centralized configuration management following 12-Factor App principles.

Key Features:

  • Environment variable-driven configuration
  • Validation with error reporting
  • Support for both development and production deployments
  • Port priority handling ($PORT overrides NEKO_METRICS_PORT)

Configuration Options:

CategoryEnvironment VariableDefaultDescription
ModelREPO_IDshowlab/ShowUI-2BHugging Face model repository
SIZE_SHORTEST_EDGE224Image resize shortest edge
SIZE_LONGEST_EDGE1344Image resize longest edge
NetworkNEKO_WSwss://neko.example.com/api/wsWebSocket URL
NEKO_ICE_POLICYstrictICE candidate policy
NEKO_STUN_URLstun:stun.l.google.com:19302STUN server
NEKO_TURN_URL-TURN server (optional)
BehaviorNEKO_MAX_STEPS8Maximum automation steps
REFINEMENT_STEPS5Click refinement iterations
NEKO_AUDIO1Enable audio streaming
LoggingNEKO_LOGLEVELINFOLog level
NEKO_LOG_FORMATtextLog format (text/json)
NEKO_LOGFILE-Log file path (optional)
StorageFRAME_SAVE_PATH-Frame screenshot storage
CLICK_SAVE_PATH-Action visualization storage
OFFLOAD_FOLDER./offloadModel cache directory
MetricsPORT / NEKO_METRICS_PORT9000Prometheus metrics port

Example Usage:

# Load configuration from environment
settings = Settings.from_env()

# Validate configuration
errors = settings.validate()
if errors:
    for error in errors:
        print(f"Configuration error: {error}")
    sys.exit(1)

2. NekoAgent

Purpose: Main orchestration class that manages WebRTC connections, AI processing, and action execution.

Initialization:

agent = NekoAgent(
    model=model,                    # ShowUI-2B model instance
    processor=processor,            # AutoProcessor for model inputs
    ws_url="wss://neko.example.com/api/ws",
    nav_task="Navigate to google.com and search for AI",
    nav_mode="web",                 # "web" or "phone"
    settings=settings,
    logger=logger,
    max_steps=10,
    refinement_steps=3,
    audio=True,
    online=False                    # False=single task, True=multi-task chat mode
)

Key Methods:

async def run()

Main execution loop that handles connection lifecycle, reconnection logic, and task processing.

Flow:

  1. Signal handling - Register SIGINT/SIGTERM handlers
  2. Connection establishment - WebSocket connection with backoff
  3. Media negotiation - WebRTC SDP offer/answer exchange
  4. Frame processing - Continuous screen capture and AI analysis
  5. Action execution - Translate AI decisions to control commands
  6. Graceful shutdown - Clean resource cleanup

async def _navigate_once(img, history, step)

Performs single navigation step using AI model inference.

Parameters:

  • img: Current screen image (PIL Image)
  • history: List of previous actions for context
  • step: Current step number for logging

Process:

  1. Prepare inputs - Format image and chat template for model
  2. Model inference - Generate action prediction using ShowUI-2B
  3. Action parsing - Extract structured action from model output
  4. Iterative refinement - Crop and re-infer for precise click coordinates
  5. Action execution - Send control commands via WebSocket

Refinement Logic:

for i in range(self.refinement_steps):
    # Initial inference on full image or cropped region
    action = await model_inference(current_img)

    if action.type == "CLICK":
        # Crop around predicted click location
        current_img, crop_box = self._crop_image(original_img, action.position)
        # Re-infer on cropped image for better precision
    else:
        break  # No refinement for non-click actions

async def _execute_action(action, size)

Translates high-level actions into low-level control commands.

Supported Actions:

ActionDescriptionParameters
CLICKMouse click at coordinatesposition: [x, y] (normalized)
INPUTType text stringvalue: str
SCROLLScroll in directiondirection: "up/down/left/right"
SWIPETouch swipe gesturestartPosition: [x, y], endPosition: [x, y]
TAPMobile tap actionposition: [x, y]
ANSWERTask completion signalvalue: "final answer"

Example Action Execution:

# CLICK action
action = {
    "action": "CLICK",
    "position": [0.5, 0.3],  # Normalized coordinates (50%, 30%)
    "value": None
}

# Converts to pixel coordinates and sends control command
x, y = int(0.5 * screen_width), int(0.3 * screen_height)
await signaler.send({
    "event": "control/buttonpress",
    "payload": {"x": x, "y": y, "code": 1}  # Left click
})

3. Frame Sources

The agent supports multiple frame capture mechanisms:

WebRTCFrameSource

Purpose: Real-time frame capture via WebRTC video streams.

Features:

  • Low latency - Direct WebRTC video track access
  • High quality - Full resolution screen capture
  • Automatic buffering - Handles frame timing and delivery
  • Error resilience - Graceful handling of stream interruptions

LiteFrameSource

Purpose: HTTP-based frame capture for environments where WebRTC is unavailable.

Features:

  • Simple protocol - HTTP GET requests for screenshots
  • Fallback mode - Works when WebRTC fails or is blocked
  • Configurable polling - Adjustable frame rate
  • Resource efficient - Lower bandwidth than video streams

4. Signaler

Purpose: WebSocket communication layer for control and signaling.

Features:

  • Event-driven messaging - Pub/sub pattern with topic routing
  • Automatic reconnection - Exponential backoff strategy
  • Concurrent I/O - Separate read/write loops
  • Message queuing - Buffered sending with backpressure handling

Event Types:

  • signal/* - WebRTC signaling (offer/answer/ICE)
  • control/* - GUI control commands (mouse/keyboard)
  • chat/* - Text messaging for online mode
  • system/* - Session management and status

Action Space and Navigation Modes

Web Mode (nav_mode="web")

Optimized for: Desktop web browsers, web applications

Action Space:

Available Actions:
1. CLICK(position): Click at normalized coordinates [x, y] where 0 <= x, y <= 1
2. INPUT(value): Type the specified text string
3. SCROLL(direction): Scroll in direction (up, down, left, right)
4. ANSWER(value): Provide final answer when task is complete

Examples:
- CLICK(position=[0.5, 0.3]): Click at center-left of screen
- INPUT(value="search query"): Type text into focused element
- SCROLL(direction="down"): Scroll page downward
- ANSWER(value="Task completed successfully"): Signal completion

Phone Mode (nav_mode="phone")

Optimized for: Mobile device interfaces, touch interactions

Action Space:

Available Actions:
1. TAP(position): Tap at normalized coordinates [x, y] where 0 <= x, y <= 1
2. INPUT(value): Enter text using virtual keyboard
3. SWIPE(startPosition, endPosition): Swipe from start to end coordinates
4. SCROLL(direction): Scroll content in specified direction
5. ANSWER(value): Provide final answer when task is complete

Examples:
- TAP(position=[0.8, 0.1]): Tap near top-right corner
- SWIPE(startPosition=[0.5, 0.8], endPosition=[0.5, 0.2]): Swipe up
- SCROLL(direction="down"): Scroll content downward

AI Model Integration

ShowUI-2B (Qwen2VL)

Model Details:

  • Architecture: Qwen2VL-based multimodal model specialized for GUI understanding
  • Input: RGB images + text instructions
  • Output: Structured action commands
  • Performance: ~100-200ms inference time on GPU

Processing Pipeline:

# 1. Image preprocessing
image = resize_and_validate_image(frame, settings, logger)

# 2. Chat template preparation
content = [
    {"type": "text", "text": system_prompt},
    {"type": "text", "text": f"Task: {nav_task}"},
    {"type": "text", "text": f"Action history: {history}"},
    {"type": "image", "image": image, "size": {...}}
]

# 3. Model inference
inputs = processor(text=[text], images=[image], return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128)

# 4. Action parsing
action = safe_parse_action(decoded_output, nav_mode="web")

Iterative Refinement: The agent implements a novel refinement strategy for improved click accuracy:

  1. Initial prediction on full screen image
  2. Crop extraction around predicted click location (50% of original size)
  3. Re-inference on cropped region for sub-pixel precision
  4. Coordinate transformation back to full screen space

This typically improves click accuracy by 15-25% over single-shot inference.

Operation Modes

Offline Mode (Single Task)

Usage: Automated execution of a single, predefined task.

agent = NekoAgent(
    model=model,
    processor=processor,
    ws_url=ws_url,
    nav_task="Find the current weather in San Francisco",
    nav_mode="web",
    online=False,  # Single task mode
    max_steps=10
)

await agent.run()  # Executes task and exits

Behavior:

  • Immediate control - Requests host control on connection
  • Task execution - Runs navigation loop until completion or max steps
  • Automatic exit - Terminates after task completion
  • Host control maintained - Keeps control throughout session

Online Mode (Multi-Task)

Usage: Interactive agent that responds to chat commands for multiple tasks.

agent = NekoAgent(
    model=model,
    processor=processor,
    ws_url=ws_url,
    nav_task="",  # No initial task
    nav_mode="web",
    online=True,  # Multi-task chat mode
    max_steps=15
)

await agent.run()  # Runs indefinitely, responding to chat

Behavior:

  • On-demand control - Only requests control when active task starts
  • Chat monitoring - Listens for task commands in chat messages
  • Task queuing - Handles multiple sequential tasks
  • Persistent session - Remains connected between tasks
  • Host control release - Releases control when idle

Chat Commands:

Start a new task:
User: Navigate to amazon.com and find wireless headphones under $100

The agent will:
1. Request host control
2. Execute the navigation task
3. Release host control when complete
4. Wait for next command

Metrics and Observability

Prometheus Metrics

The agent exports comprehensive metrics for monitoring and alerting:

# Connection metrics
reconnects = Counter('neko_agent_reconnects_total', 'WebSocket reconnection attempts')
frames_processed = Counter('neko_agent_frames_processed_total', 'Video frames processed')

# Performance metrics
inference_latency = Histogram('neko_agent_inference_seconds', 'AI model inference time')
action_execution_time = Histogram('neko_agent_action_execution_seconds', 'Action execution time')

# Quality metrics
actions_executed = Counter('neko_agent_actions_executed_total', 'Actions executed by type', ['action_type'])
parse_errors = Counter('neko_agent_parse_errors_total', 'Action parsing failures')

Example Queries:

# Average inference latency over 5 minutes
rate(neko_agent_inference_seconds_sum[5m]) / rate(neko_agent_inference_seconds_count[5m])

# Action success rate
1 - (rate(neko_agent_parse_errors_total[5m]) / rate(neko_agent_frames_processed_total[5m]))

# Actions per minute by type
rate(neko_agent_actions_executed_total[1m]) * 60

Logging

Structured Logging: Supports both text and JSON formats for different environments.

# Text format (development)
export NEKO_LOG_FORMAT=text
# Output: 2024-01-15 10:30:00 INFO Model inference completed (step=3) | Duration: 0.15s | Device: GPU

# JSON format (production)
export NEKO_LOG_FORMAT=json
# Output: {"timestamp":"2024-01-15T10:30:00Z","level":"INFO","message":"Model inference completed","step":3,"duration":0.15,"device":"GPU"}

Log Levels:

  • DEBUG - Detailed execution flow, frame processing details
  • INFO - Task progress, action execution, performance metrics
  • WARNING - Recoverable errors, fallback activations
  • ERROR - Critical failures, connection issues

Configuration Examples

Development Environment

# Basic development setup
export NEKO_WS="ws://localhost:8080/api/ws"
export NEKO_LOGLEVEL="DEBUG"
export NEKO_LOG_FORMAT="text"
export FRAME_SAVE_PATH="./debug/frames"
export CLICK_SAVE_PATH="./debug/actions"
export NEKO_MAX_STEPS="15"
export REFINEMENT_STEPS="3"

Production Environment

# Production configuration
export NEKO_WS="wss://neko.prod.example.com/api/ws"
export NEKO_LOGLEVEL="INFO"
export NEKO_LOG_FORMAT="json"
export NEKO_LOGFILE="/var/log/neko-agent.log"
export PORT="8080"  # Metrics server port
export NEKO_MAX_STEPS="20"
export REFINEMENT_STEPS="5"
export NEKO_RTCP_KEEPALIVE="1"  # For NAT traversal
export NEKO_FORCE_EXIT_GUARD_MS="5000"  # Force exit after 5s

GPU Optimization

# GPU-specific settings
export CUDA_VISIBLE_DEVICES="0"
export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"
export TORCH_CUDA_ARCH_LIST="8.6"  # For RTX 30xx series
export OFFLOAD_FOLDER="/fast-ssd/model-cache"  # SSD for model loading

Error Handling and Recovery

Connection Resilience

# Automatic reconnection with exponential backoff
async def connect_with_backoff(self):
    backoff = 1.0
    max_backoff = 60.0

    while not self.shutdown.is_set():
        try:
            await self._establish_connection()
            return
        except Exception as e:
            logger.warning(f"Connection failed: {e}, retrying in {backoff}s")
            await asyncio.sleep(backoff)
            backoff = min(max_backoff, backoff * 1.5)

Inference Timeout Handling

# Model inference with timeout and cancellation
try:
    async with asyncio.timeout(120.0):  # 2 minute timeout
        outputs = await model_inference_task
except asyncio.TimeoutError:
    model_inference_task.cancel()
    logger.error("Model inference timeout, skipping frame")
    return None

Frame Source Fallback

# Automatic fallback from WebRTC to HTTP polling
if webrtc_frame_source.failed:
    logger.warning("WebRTC frame source failed, falling back to HTTP polling")
    self.frame_source = LiteFrameSource(settings, logger)
    await self.frame_source.start(session_id)

Performance Optimization

GPU Memory Management

# Efficient GPU memory usage
torch.cuda.empty_cache()  # Clear cache between tasks
model.half()  # Use FP16 for faster inference
torch.backends.cudnn.benchmark = True  # Optimize for fixed input sizes

Frame Processing Pipeline

# Asynchronous frame processing
async def process_frames():
    async for frame in frame_source:
        # Non-blocking processing
        task = asyncio.create_task(self._navigate_once(frame, history, step))

        # Rate limiting to prevent overwhelming the model
        await asyncio.sleep(0.1)  # 10 FPS maximum

Model Loading Optimization

# Efficient model initialization
model = Qwen2VLForConditionalGeneration.from_pretrained(
    repo_id,
    torch_dtype=torch.float16,  # Half precision
    device_map="auto",          # Automatic device placement
    offload_folder=settings.offload_folder,  # Disk offloading for large models
    trust_remote_code=True
)

Usage Examples

Basic Web Automation

# Simple web navigation task
import asyncio
from agent import NekoAgent, Settings, setup_logging

async def main():
    settings = Settings.from_env()
    logger = setup_logging(settings)

    # Load AI model
    model = Qwen2VLForConditionalGeneration.from_pretrained("showlab/ShowUI-2B")
    processor = AutoProcessor.from_pretrained("showlab/ShowUI-2B")

    # Create and run agent
    agent = NekoAgent(
        model=model,
        processor=processor,
        ws_url="ws://localhost:8080/api/ws",
        nav_task="Go to google.com and search for 'machine learning tutorials'",
        nav_mode="web",
        settings=settings,
        logger=logger
    )

    await agent.run()

if __name__ == "__main__":
    asyncio.run(main())

Interactive Online Mode

# Multi-task interactive agent
agent = NekoAgent(
    model=model,
    processor=processor,
    ws_url="wss://neko.example.com/api/ws",
    nav_task="",  # Start without a task
    nav_mode="web",
    online=True,  # Enable chat-based task assignment
    settings=settings,
    logger=logger
)

# Agent will respond to chat messages like:
# "Please navigate to amazon.com and find wireless keyboards under $50"
# "Go to the weather website and check the forecast for tomorrow"
await agent.run()

Custom Action Refinement

# Agent with custom refinement settings
agent = NekoAgent(
    model=model,
    processor=processor,
    ws_url=ws_url,
    nav_task=task,
    nav_mode="web",
    refinement_steps=5,  # More refinement for better accuracy
    max_steps=20,        # Allow longer task sequences
    settings=settings,
    logger=logger
)

Testing and Debugging

Health Check

# Validate configuration
uv run src/agent.py --healthcheck

# Expected output:
# Configuration validation: PASSED
# Model loading: PASSED
# WebSocket connectivity: PASSED
# All systems ready

Debug Mode

# Enable comprehensive debugging
export NEKO_LOGLEVEL="DEBUG"
export FRAME_SAVE_PATH="./debug/frames"
export CLICK_SAVE_PATH="./debug/actions"

uv run src/agent.py --task "test navigation" --max-steps 3

Frame Analysis

The agent can save frames and action visualizations for debugging:

# Saved frame: ./debug/frames/frame_001_1642261234.567.png
# Saved action: ./debug/actions/action_step_001_1642261234.567_CLICK.png

Action visualizations include:

  • Red circle - Predicted click location
  • Green overlay - Text input areas
  • Blue arrows - Scroll directions
  • Purple box - Crop regions during refinement

Integration with Other Components

With Capture Service

# The agent automatically integrates with capture service
# No additional configuration needed - capture service monitors WebSocket
export CAPTURE_OUT="./training-data"
export CAPTURE_REMOTE="s3://training-bucket/episodes"

# Start capture in background
uv run src/capture.py &

# Run agent (capture will automatically record)
uv run src/agent.py --task "automation task"

With TTS Service

# Enable voice announcements during automation
export YAP_VOICES_DIR="./voices"

# Start TTS service
uv run src/yap.py &

# Run agent with voice enabled (audio is negotiated automatically)
uv run src/agent.py --task "automation task"

Future Enhancements

The agent architecture is designed to support planned enhancements:

  1. API Mode - RESTful service with horizontal scaling
  2. Multi-Modal Input - Voice commands and text instructions
  3. Advanced Vision Models - Support for newer GUI understanding models
  4. Distributed Inference - Model sharding across multiple GPUs
  5. Custom Action Types - Extensible action framework
  6. Real-Time Learning - Online adaptation from user feedback

See API Mode Design for detailed future architecture plans.

Capture Service

Manual Control CLI

The Manual Control CLI (src/manual.py) provides a production-ready command-line interface for manually interacting with Neko v3 servers. This tool implements the complete WebRTC signaling handshake required by modern Neko servers and offers a comprehensive REPL for remote desktop control.

Overview

The Manual CLI is designed for:

  • Manual testing of Neko server functionality
  • Remote administration tasks on hosted desktops
  • Development and debugging of Neko integrations
  • Quality assurance testing of desktop applications

Key Features

  • Full WebRTC Signaling: Complete SDP/ICE handshake for control acceptance
  • Robust Authentication: REST login with automatic token management
  • Auto-reconnection: Intelligent reconnection with exponential backoff
  • REPL Interface: Interactive command-line with comprehensive controls
  • Multiple Input Methods: Mouse, keyboard, scrolling, and gesture support
  • Admin Commands: Session management and force control operations
  • Coordinate Systems: Support for both pixel and normalized (0..1) coordinates

Architecture

Core Components

graph TD
    CLI[ManualCLI] --> Signaler[WebSocket Signaler]
    CLI --> Broker[Event Broker]

    Signaler --> WS[WebSocket Connection]
    Signaler --> Tasks[Background Tasks]

    Broker --> SysQ[System Queue]
    Broker --> SigQ[Signal Queue]
    Broker --> MiscQ[Misc Queue]

    CLI --> PC[RTCPeerConnection]
    PC --> ICE[ICE Negotiation]
    PC --> SDP[SDP Exchange]

    WS --> Neko[Neko Server]
    ICE --> Neko
    SDP --> Neko

Class Hierarchy

  1. Broker - Event routing and message distribution
  2. Signaler - WebSocket connection management
  3. ManualCLI - Main application controller and REPL

Usage

Basic Connection

# Direct WebSocket connection
python manual.py --ws "wss://neko.example.com/api/ws?token=abc123"

# REST authentication
python manual.py --neko-url "https://neko.example.com" \
                 --username "admin" --password "password"

Configuration Options

FlagEnvironmentDescription
--wsNEKO_WSDirect WebSocket URL with token
--neko-urlNEKO_URLBase URL for REST authentication
--usernameNEKO_USERLogin username
--passwordNEKO_PASSLogin password
--sizeNEKO_SIZEVirtual screen size (e.g., "1920x1080")
--norm-Use normalized 0..1 coordinates
--no-auto-host-Disable automatic host control requests
--no-media-Skip WebRTC signaling (may break control)
--no-audio-Disable audio streaming

Environment Variables

# Logging configuration
export NEKO_LOGLEVEL=DEBUG
export NEKO_LOGFILE=/tmp/neko-manual.log

# Connection parameters
export NEKO_URL=https://neko.example.com
export NEKO_USER=admin
export NEKO_PASS=secretpassword
export NEKO_SIZE=1920x1080

REPL Commands

Mouse Operations

move X Y                    # Move cursor to coordinates
click [button]              # Click at current position (left/right/middle)
lclick                      # Left click shortcut
rclick                      # Right click shortcut
dblclick                    # Double click with proper timing
hover X Y                   # Move to coordinates without clicking
tap X Y [button]            # Move and click in one command
swipe x1 y1 x2 y2          # Drag gesture between two points

Keyboard Input

key <KeyName>               # Press a specific key (Escape, F5, Control, etc.)
text "some text"            # Type text at current cursor focus
input X Y "text"            # Click at coordinates and type text
enter                       # Press Enter/Return key

Scrolling

scroll <direction> [amount] # Scroll up/down/left/right (amount defaults to 1)

Clipboard Operations

copy                        # Send copy command
cut                         # Send cut command
select_all                  # Select all content
paste [text]                # Paste clipboard or specific text

Connection Management

host                        # Request host control
unhost                      # Release host control
size [WxH]                  # Show or set virtual screen size
raw '{"event":"...","payload":{}}' # Send raw JSON message

Administrative Commands

force-take                  # Force take host control (admin only)
force-release               # Force release host control (admin only)
kick <sessionId>            # Force disconnect a session (admin only)
sessions                    # List all connected users and session IDs

Utility Commands

help                        # Show command reference
quit / exit                 # Close connection and exit

API Reference

Core Classes

Broker

Event routing system that distributes incoming WebSocket messages to appropriate topic queues.

class Broker:
    def __init__(self)
    def topic_queue(self, topic: str, maxsize: int = 512) -> asyncio.Queue
    def publish(self, msg: Dict[str, Any]) -> None

Topic Routing:

  • signal/ events → signal queue (WebRTC signaling)
  • system/, control/, screen/, keyboard/, session/, error/system queue
  • All others → misc queue

Signaler

WebSocket connection manager with automatic reconnection and background message processing.

class Signaler:
    def __init__(self, url: str, **wsopts)
    async def connect_with_backoff(self) -> "Signaler"
    async def close(self) -> None
    async def send(self, msg: Dict[str, Any]) -> None

Key Features:

  • Exponential backoff reconnection (1s → 30s max)
  • Separate read/write loops for optimal performance
  • Graceful shutdown with task cleanup
  • Message queuing during disconnection

ManualCLI

Main application controller that orchestrates connection management, WebRTC signaling, and user interaction.

class ManualCLI:
    def __init__(self, ws: str, width: int, height: int,
                 normalized: bool, auto_host: bool,
                 request_media: bool, base_url: Optional[str] = None,
                 token: Optional[str] = None, audio: bool = True)
    async def start(self) -> None
    async def handle(self, line: str) -> None

Utility Functions

parse_size(s: str) -> Tuple[int, int]

Parse screen size string (e.g., "1920x1080") into width/height tuple.

ws_from_rest_login(neko_url: str, username: str, password: str, *, timeout: float = 10.0) -> Tuple[str, str, str]

Perform REST authentication and derive WebSocket URL with token.

name_keysym(name: str) -> int

Map key names to X11 keysym codes for keyboard events.

Protocol Implementation

WebRTC Signaling Flow

  1. Initial Request: Send signal/request with video/audio preferences
  2. Offer Reception: Wait for signal/offer or signal/provide from server
  3. ICE Configuration: Parse and validate ICE servers (strict mapping)
  4. SDP Exchange: Set remote description, create answer, send local description
  5. ICE Candidates: Process incoming candidates and send local candidates
  6. Media Handling: Log received tracks without processing (control-only)

Connection Lifecycle

sequenceDiagram
    participant C as Client
    participant S as Signaler
    participant N as Neko Server
    participant R as RTC Connection

    C->>S: connect_with_backoff()
    S->>N: WebSocket connection
    S->>C: Connection established

    C->>N: signal/request (media preferences)
    C->>N: control/request (host control)

    N->>S: signal/offer (SDP + ICE servers)
    S->>R: Create RTCPeerConnection
    R->>N: signal/answer (SDP response)

    loop ICE Negotiation
        N->>R: signal/candidate
        R->>N: signal/candidate
    end

    R->>C: Connection ready

    loop Event Processing
        N->>S: system/heartbeat
        S->>N: client/heartbeat

        C->>N: control/move, control/click, etc.
        N->>S: control/host, screen/updated, etc.
    end

Error Handling and Reconnection

The CLI implements robust error handling:

  • WebSocket Errors: Automatic reconnection with exponential backoff
  • Authentication Failures: Clear error messages and exit
  • Protocol Errors: Graceful degradation and retry
  • Resource Cleanup: Proper task cancellation and connection shutdown

Integration Examples

Basic Automation Script

import asyncio
from manual import ManualCLI, ws_from_rest_login

async def automate_task():
    # Get WebSocket URL via REST login
    ws_url, base_url, token = ws_from_rest_login(
        "https://neko.example.com",
        "admin",
        "password"
    )

    # Create CLI instance
    cli = ManualCLI(
        ws=ws_url,
        width=1920,
        height=1080,
        normalized=False,
        auto_host=True,
        request_media=True,
        base_url=base_url,
        token=token
    )

    # Start connection and wait for ready
    await cli.start()

    # Perform automation tasks
    await cli.handle("move 100 100")
    await cli.handle("click")
    await cli.handle("text 'Hello World'")
    await cli.handle("key Enter")

if __name__ == "__main__":
    asyncio.run(automate_task())

Custom Event Handler

class CustomCLI(ManualCLI):
    async def _event_logger(self):
        """Override event logger with custom handling."""
        q = self.signaler.broker.topic_queue("system")

        while self.running and self.signaler and not self.signaler._closed.is_set():
            try:
                msg = await q.get()
                ev = msg.get("event", "")
                payload = msg.get("payload", {})

                # Custom event processing
                if ev == "control/host":
                    print(f"Host control changed: {payload}")
                elif ev == "session/created":
                    print(f"New session: {payload}")

                # Call parent handler
                await super()._event_logger()

            except asyncio.CancelledError:
                break

Security Considerations

Authentication

  • Always use HTTPS/WSS in production
  • Store credentials securely (environment variables, not code)
  • Implement proper token rotation for long-running sessions

Network Security

  • Validate SSL certificates in production
  • Use VPN or private networks for sensitive operations
  • Monitor connection logs for suspicious activity

Access Control

  • Limit admin command usage to authorized users
  • Implement session timeouts for inactive connections
  • Log all control operations for audit trails

Troubleshooting

Common Issues

Connection Failures

# Enable debug logging
export NEKO_LOGLEVEL=DEBUG
export NEKO_LOGFILE=/tmp/debug.log

# Check WebSocket connectivity
python manual.py --ws "wss://neko.example.com/api/ws?token=test"

Authentication Errors

# Verify REST API access
curl -X POST https://neko.example.com/api/login \
     -H "Content-Type: application/json" \
     -d '{"username":"admin","password":"password"}'

WebRTC Issues

  • ICE Failures: Check firewall and NAT configuration
  • SDP Errors: Verify codec compatibility
  • Media Problems: Use --no-audio flag for control-only mode

Performance Issues

  • High Latency: Check network connectivity and server load
  • Dropped Connections: Reduce ping_interval in WebSocket options
  • Memory Leaks: Monitor task cleanup during reconnection

Debug Commands

# Raw protocol inspection
await cli.handle('raw \'{"event":"system/stats"}\'')

# Connection diagnostics
await cli.handle('raw \'{"event":"signal/stats"}\'')

# Session information
await cli.handle('sessions')

Development

Testing

# Unit tests
python -m pytest tests/test_manual.py

# Integration tests with local Neko server
docker-compose up neko
python manual.py --neko-url http://localhost:8080 --username admin --password neko

# Load testing
python scripts/stress_test_manual.py --connections 10 --duration 300

Contributing

When modifying manual.py:

  1. Maintain Protocol Compatibility: Don't break existing WebRTC signaling
  2. Add Comprehensive Tests: Cover new REPL commands and edge cases
  3. Update Documentation: Keep this guide and docstrings current
  4. Follow PEP Standards: Code must comply with PEP 257/287 for docstrings
  5. Handle Errors Gracefully: All new features must have proper error handling

Extending Functionality

Adding New Commands

# In ManualCLI.handle() method
if cmd == "my_command":
    await self._handle_my_command(rest)
    return

# Implement handler method
async def _handle_my_command(self, args) -> None:
    """Handle the 'my_command' REPL command.

    :param args: Command arguments from user input
    :type args: list
    """
    if not args:
        print("usage: my_command <parameter>")
        return

    # Command implementation
    await self._safe_send({
        "event": "custom/my_event",
        "payload": {"param": args[0]}
    })

Custom Event Processing

# In _event_logger() method
elif ev == "custom/my_event":
    self._handle_custom_event(payload)

def _handle_custom_event(self, payload: dict) -> None:
    """Process custom server events.

    :param payload: Event payload from server
    :type payload: dict
    """
    # Custom event handling logic
    pass

This documentation provides comprehensive coverage of the Manual Control CLI, from basic usage to advanced development scenarios. The tool serves as both a practical utility for Neko server interaction and a reference implementation for WebRTC signaling protocols.

TTS Service (yap.py)

The YAP (Yet Another Presenter) service provides real-time text-to-speech functionality through WebRTC audio streaming. It connects to Neko servers and responds to chat commands for immediate or streaming speech synthesis.

Overview

YAP transforms text messages into high-quality speech using F5-TTS and broadcasts the audio through WebRTC. It supports both immediate synthesis and streaming mode for real-time conversation scenarios.

Key Features

  • Real-time TTS: Low-latency speech synthesis using F5-TTS
  • WebRTC Integration: Direct audio streaming to Neko browser sessions
  • Voice Management: Hot-reloadable voice configurations with custom parameters
  • Streaming Mode: Incremental text processing for live conversations
  • Chat Commands: Interactive control through Neko chat interface
  • Parallel Processing: Multi-threaded TTS workers for improved performance
  • Audio Pipeline: Crossfade splicing and jitter buffering for smooth playback

Architecture

graph TD
    A[Chat Commands] --> B[Command Parser]
    B --> C[Stream Assembler]
    C --> D[Text Segmentation]
    D --> E[TTS Pipeline]
    E --> F[Parallel F5-TTS Workers]
    F --> G[Audio Splicer]
    G --> H[PCM Queue]
    H --> I[WebRTC Audio Track]
    I --> J[Neko Browser]

    K[Voice Manager] --> E
    L[WebSocket Signaler] --> A
    L --> M[WebRTC Peer Connection]
    M --> I

Installation

Dependencies

# Core dependencies
pip install aiortc av websockets numpy

# F5-TTS (required for speech synthesis)
pip install git+https://github.com/SWivid/F5-TTS.git

# Optional resampling backends (first available will be used)
pip install torchaudio  # Preferred
# OR
pip install scipy      # Alternative
# Linear fallback is built-in

System Requirements

  • Python 3.8+
  • CUDA-capable GPU (recommended for F5-TTS)
  • WebRTC-compatible environment

Configuration

YAP follows 12-factor app principles with environment-based configuration:

Connection Settings

VariableDefaultDescription
YAP_WS / NEKO_WS-Direct WebSocket URL
NEKO_URL-REST API base URL
NEKO_USER-Username for REST authentication
NEKO_PASS-Password for REST authentication

Audio Settings

VariableDefaultDescription
YAP_SR48000Output sample rate (Hz)
YAP_AUDIO_CHANNELS1Audio channels (1=mono, 2=stereo)
YAP_FRAME_MS20WebRTC frame size (10/20/30/40/60ms)
YAP_JITTER_MAX_SEC6.0PCM buffer maximum duration

TTS Pipeline Settings

VariableDefaultDescription
YAP_PARALLEL2Parallel TTS worker threads
YAP_CHUNK_TARGET_SEC3.0Target chunk duration hint
YAP_MAX_CHARS350Maximum characters per chunk
YAP_OVERLAP_MS30Audio crossfade overlap

Voice Settings

VariableDefaultDescription
YAP_VOICES_DIR./voicesVoice configuration directory
YAP_SPK_DEFAULTdefaultDefault speaker ID

Logging & ICE Settings

VariableDefaultDescription
YAP_LOGLEVELINFOLog level (DEBUG/INFO/WARNING/ERROR)
YAP_LOG_FORMATtextLog format (text/json)
YAP_STUN_URLstun:stun.l.google.com:19302STUN server
YAP_ICE_POLICYstrictICE server policy (strict/all)

Usage

Basic Usage

# Direct WebSocket connection
export YAP_WS="wss://demo.neko.com/api/ws?token=your_token"
python src/yap.py

# REST API authentication
export NEKO_URL="https://demo.neko.com"
export NEKO_USER="username"
export NEKO_PASS="password"
python src/yap.py

Command Line Options

python src/yap.py --help

# Common options
python src/yap.py \
  --ws "wss://host/api/ws?token=..." \
  --sr 48000 \
  --channels 1 \
  --parallel 4 \
  --loglevel DEBUG

Health Check

# Validate configuration
python src/yap.py --healthcheck

Chat Commands

YAP responds to commands in Neko chat:

Immediate Speech

/yap Hello, this is immediate speech synthesis!

Streaming Mode

/yap:begin
This text will be processed incrementally...
More text gets added and synthesized in chunks...
/yap:end

Voice Control

# Switch active voice and parameters
/yap:voice set --spk alice --rate 1.2 --pitch 0.5

# Add new voice
/yap:voice add --spk bob --ref /path/to/reference.wav --ref-text "Reference text" --styles "calm,friendly"

# Reload voice configurations
/yap:voice reload

Queue Management

# Stop current synthesis and clear queue
/yap:stop

Voice Management

Voice Configuration (voices.json)

{
  "default": {
    "ref_audio": "./voices/default.wav",
    "ref_text": "This is a default reference recording.",
    "styles": ["calm", "neutral"],
    "params": {
      "rate": 1.0,
      "pitch": 0.0
    }
  },
  "alice": {
    "ref_audio": "./voices/alice_sample.wav",
    "ref_text": "Hello, I'm Alice and this is my voice.",
    "styles": ["friendly", "energetic"],
    "params": {
      "rate": 1.1,
      "pitch": 0.2
    }
  }
}

Voice Parameters

  • rate: Speech speed multiplier (0.5-2.0)
  • pitch: Pitch shift in semitones (-12 to +12)
  • styles: Descriptive tags for voice characteristics
  • ref_audio: Path to reference audio file (WAV format)
  • ref_text: Transcript of reference audio

Hot Reloading

Voice configurations can be updated without restarting:

  1. Edit voices.json in the voices directory
  2. Send /yap:voice reload command in chat
  3. Changes take effect immediately

Technical Implementation

Audio Pipeline

  1. Text Segmentation: Smart punctuation-aware chunking
  2. Parallel Synthesis: Multi-threaded F5-TTS workers
  3. Audio Splicer: Crossfade blending between chunks
  4. PCM Queue: Jitter-buffered audio streaming
  5. WebRTC Track: Real-time audio transmission

Text Processing

# Punctuation-aware segmentation
chunks = segment_text("Hello world! How are you today?", max_chars=50)
# Result: ["Hello world!", "How are you today?"]

# Streaming assembly with opportunistic emission
assembler = StreamAssembler(max_chars=100)
ready_chunks = assembler.feed("Partial text...")
final_chunks = assembler.flush()

Audio Formats

  • Input: F5-TTS generated audio (variable sample rate)
  • Processing: Float32 waveforms with resampling
  • Output: 16-bit PCM at configured sample rate
  • WebRTC: Opus-encoded frames for transmission

Resampling Backends

YAP automatically selects the best available resampling backend:

  1. torchaudio (preferred): High-quality GPU acceleration
  2. scipy: CPU-based signal processing
  3. linear (fallback): Simple interpolation

Performance Tuning

Parallel Workers

# Increase for better throughput (GPU memory permitting)
export YAP_PARALLEL=4

Chunk Sizing

# Smaller chunks = lower latency, higher overhead
export YAP_MAX_CHARS=200

# Larger chunks = higher latency, better efficiency
export YAP_MAX_CHARS=500

Audio Buffer

# Reduce for lower latency (risk of underruns)
export YAP_JITTER_MAX_SEC=3.0

# Increase for stability (higher latency)
export YAP_JITTER_MAX_SEC=10.0

Troubleshooting

Common Issues

No audio output

  • Check WebRTC connection status in browser developer tools
  • Verify audio permissions in browser
  • Confirm STUN/TURN server connectivity

High latency

  • Reduce YAP_MAX_CHARS for smaller chunks
  • Decrease YAP_JITTER_MAX_SEC buffer size
  • Increase YAP_PARALLEL workers (if GPU permits)

Audio artifacts

  • Increase YAP_OVERLAP_MS for smoother crossfades
  • Check F5-TTS reference audio quality
  • Verify sample rate consistency

Connection failures

  • Validate WebSocket URL and authentication
  • Check firewall settings for WebRTC ports
  • Test with simpler STUN configuration

Debug Logging

export YAP_LOGLEVEL=DEBUG
export YAP_LOG_FORMAT=json
python src/yap.py 2>&1 | jq .

Health Validation

# Check configuration and dependencies
python src/yap.py --healthcheck

API Reference

Core Classes

YapApp

Main application coordinator handling WebSocket signaling, WebRTC setup, and command processing.

app = YapApp(settings, logger)
await app.run()  # Main event loop

TTSPipeline

Manages parallel TTS synthesis with crossfade splicing.

pipeline = TTSPipeline(voices, tts, pcmq, sr_out, ch_out, overlap_ms, parallel, logger, max_chars)
await pipeline.speak_text("Hello world", speaker="alice")

VoiceManager

Hot-reloadable voice configuration registry.

voices = VoiceManager(voices_dir, default_speaker, logger)
voice = voices.get("alice")  # Get voice config
voices.reload()  # Hot reload from JSON

PCMQueue

Thread-safe jitter buffer for WebRTC audio streaming.

pcmq = PCMQueue(sr=48000, channels=1, max_sec=6.0, logger=logger)
pcmq.push(audio_samples)  # Producer
samples = pcmq.pull(frame_size)  # Consumer

Key Functions

segment_text(text, max_chars)

Intelligent text segmentation respecting punctuation boundaries.

apply_rate_pitch(wave, sr, rate, pitch)

Apply naive prosody transformations (rate/pitch adjustments).

_resample(wave, sr_from, sr_to)

Multi-backend audio resampling with automatic fallback.

Integration Examples

Docker Deployment

FROM python:3.9-slim

# Install system dependencies
RUN apt-get update && apt-get install -y \
    ffmpeg \
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
COPY requirements.txt .
RUN pip install -r requirements.txt

# Copy application
COPY src/ /app/src/
COPY voices/ /app/voices/
WORKDIR /app

# Configuration via environment
ENV YAP_SR=48000
ENV YAP_PARALLEL=2
ENV YAP_VOICES_DIR=/app/voices

ENTRYPOINT ["python", "src/yap.py"]

Docker Compose

version: '3.8'
services:
  yap:
    build: .
    environment:
      - NEKO_URL=http://neko:8080
      - NEKO_USER=admin
      - NEKO_PASS=password
      - YAP_PARALLEL=4
      - YAP_LOGLEVEL=INFO
    volumes:
      - ./voices:/app/voices
    depends_on:
      - neko

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: yap-tts
spec:
  replicas: 1
  selector:
    matchLabels:
      app: yap-tts
  template:
    metadata:
      labels:
        app: yap-tts
    spec:
      containers:
      - name: yap
        image: neko-agent/yap:latest
        env:
        - name: NEKO_URL
          valueFrom:
            secretKeyRef:
              name: neko-config
              key: url
        - name: NEKO_USER
          valueFrom:
            secretKeyRef:
              name: neko-config
              key: username
        - name: NEKO_PASS
          valueFrom:
            secretKeyRef:
              name: neko-config
              key: password
        - name: YAP_PARALLEL
          value: "4"
        resources:
          requests:
            memory: "2Gi"
            cpu: "1000m"
          limits:
            memory: "4Gi"
            cpu: "2000m"
        volumeMounts:
        - name: voices
          mountPath: /app/voices
      volumes:
      - name: voices
        persistentVolumeClaim:
          claimName: yap-voices

Future Enhancements

  • Voice Cloning: Real-time voice learning from user samples
  • Emotion Control: Dynamic emotional expression parameters
  • SSML Support: Advanced speech markup for prosody control
  • Multi-language: Automatic language detection and synthesis
  • Audio Effects: Real-time audio processing (reverb, filters)
  • Metrics: Prometheus/OpenTelemetry integration for monitoring

Source Reference

File: src/yap.py:1 Lines of Code: ~1550 Dependencies: aiortc, av, websockets, numpy, F5-TTS

Development Setup

This guide covers the development environment setup, coding standards, and contribution workflow for the Neko Agent project.

Development Environment

Prerequisites

  • Nix with flakes - Package management and development shells
  • Git - Version control
  • NVIDIA GPU (optional but recommended) - For AI model acceleration

Setting Up

  1. Clone the repository:

    git clone <repository-url>
    cd neko-agent
    
  2. Enter development environment:

    # For GPU development (recommended)
    nix develop .#gpu
    
    # For CPU-only development
    nix develop
    
  3. Verify setup:

    python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"
    

Available Development Shells

ShellPurposeKey Packages
defaultBasic Python developmentPyTorch CPU, transformers, websockets
gpuGPU-accelerated developmentCUDA 12.8, PyTorch GPU, all dependencies
docsDocumentation developmentmdBook, Sphinx, preprocessing tools
nekoNeko server managementDocker, Colima, compose tools
aiAI tool integrationClaude Code CLI, OpenAI tools

Project Structure

├── src/                    # Source code
│   ├── agent.py           # Core automation agent
│   ├── capture.py         # Training data capture
│   └── yap.py             # TTS service
├── docs/                  # Documentation
│   ├── src/               # mdBook source files
│   └── book.toml          # Documentation configuration
├── voices/                # Voice models and assets
├── data/                  # Training data output
├── overlays/              # Nix package overlays
├── nix/                   # Nix configuration
└── flake.nix              # Development environment

Coding Standards

Python Style

  • Follow PEP 8 for code style
  • Use type hints for all function signatures
  • Write docstrings for public functions and classes
  • Use async/await for I/O operations

Example:

async def process_frame(frame: np.ndarray, task: str) -> Optional[Action]:
    """
    Process a video frame and determine the next action.
    
    :param frame: RGB frame data as numpy array
    :param task: Natural language task description
    :return: Action to execute, or None if task complete
    """
    # Implementation here
    pass

Error Handling

  • Use specific exceptions rather than generic Exception
  • Log errors with context for debugging
  • Implement graceful degradation where possible
try:
    result = await risky_operation()
except SpecificError as e:
    logger.warning(f"Operation failed: {e}, falling back to default")
    result = default_fallback()

Configuration Management

  • Use environment variables for configuration
  • Provide sensible defaults in code
  • Document all configuration options
NEKO_URL = os.environ.get("NEKO_URL", "http://localhost:8080")
MAX_RETRIES = int(os.environ.get("MAX_RETRIES", "3"))

Testing

Running Tests

# Run all tests
python -m pytest

# Run with coverage
python -m pytest --cov=src

# Run specific test file
python -m pytest tests/test_agent.py

Test Structure

  • Unit tests for individual functions
  • Integration tests for component interaction
  • End-to-end tests for full workflows

Writing Tests

import pytest
from unittest.mock import AsyncMock, patch

@pytest.mark.asyncio
async def test_agent_action_execution():
    """Test that agent executes actions correctly."""
    agent = NekoAgent()
    
    with patch.object(agent, 'webrtc_client') as mock_client:
        mock_client.execute_action = AsyncMock()
        
        action = ClickAction(x=100, y=200)
        await agent.execute_action(action)
        
        mock_client.execute_action.assert_called_once_with(action)

Documentation

Building Documentation

# Enter docs environment
nix develop .#docs

# Serve documentation locally
cd docs && mdbook serve --open

# Build static documentation
cd docs && mdbook build

Documentation Standards

  • Write in Markdown using mdBook extensions
  • Include code examples for all APIs
  • Use mermaid diagrams for architecture
  • Keep docs up-to-date with code changes

Git Workflow

Branch Strategy

  • main - Stable release branch
  • dev - Development integration branch
  • feature/* - Feature development branches
  • fix/* - Bug fix branches

Commit Guidelines

  • Use conventional commits format
  • Write descriptive messages explaining the why
  • Keep commits atomic - one logical change per commit

Examples:

feat(agent): add support for swipe actions
fix(capture): handle missing frames gracefully  
docs(api): update WebRTC connection examples
refactor(tts): improve voice loading performance

Pull Request Process

  1. Create feature branch from dev
  2. Implement changes following coding standards
  3. Add/update tests for new functionality
  4. Update documentation if needed
  5. Submit PR with clear description
  6. Address review feedback
  7. Merge after approval

Debugging

Logging Configuration

# Set log levels
export AGENT_LOGLEVEL=DEBUG
export CAPTURE_LOGLEVEL=INFO
export YAP_LOGLEVEL=WARNING

Common Debug Tasks

WebRTC connection issues:

# Check Neko server status
curl http://localhost:8080/health

# Test WebSocket endpoint
wscat -c ws://localhost:8080/api/ws?token=<token>

AI model problems:

# Check GPU availability
python -c "import torch; print(torch.cuda.is_available())"

# Test model loading
python -c "from transformers import Qwen2VLForConditionalGeneration; print('OK')"

Training data capture:

# Verify MDS output
python -c "from streaming import StreamingDataset; print('MDS OK')"

# Check S3 connectivity
aws s3 ls s3://your-bucket/

Performance Optimization

Profiling

# Profile with cProfile
python -m cProfile -o profile.stats src/agent.py

# Analyze with snakeviz
snakeviz profile.stats

GPU Memory Management

# Monitor GPU usage
nvidia-smi -l 1

# Set memory allocation strategy
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

WebRTC Optimization

  • Reduce frame rate for lower bandwidth
  • Adjust video quality based on task complexity
  • Use hardware encoding when available

Contributing

Getting Started

  1. Read the documentation to understand the system
  2. Set up development environment
  3. Pick an issue from the project board
  4. Ask questions if anything is unclear

Areas for Contribution

  • New action types (drag, hover, etc.)
  • Additional AI models integration
  • Performance improvements
  • Documentation enhancements
  • Test coverage expansion

Code Review Guidelines

  • Be constructive in feedback
  • Explain the reasoning behind suggestions
  • Test the changes locally when possible
  • Approve promptly for good contributions

Release Process

Version Management

  • Semantic versioning (MAJOR.MINOR.PATCH)
  • Tag releases in Git
  • Update changelog for each release

Deployment

  • Build Docker images for production
  • Test in staging environment
  • Deploy with rolling updates
  • Monitor metrics post-deployment

Support

Getting Help

  • Check documentation first
  • Search existing issues on GitHub
  • Ask in discussions for general questions
  • Open issues for bugs or feature requests

Community Guidelines

  • Be respectful and inclusive
  • Help others when you can
  • Share knowledge and experiences
  • Follow the code of conduct

Nix Flake Development Environment

The Neko Agent project uses a sophisticated Nix flake to provide reproducible, cross-platform development environments with specialized configurations for different use cases. This document provides comprehensive documentation of all flake features, development shells, and usage patterns.

Overview

The flake (flake.nix) is designed around multiple specialized development environments that cater to different aspects of the project:

  • AI/ML Development - GPU-accelerated environments with CUDA support
  • Documentation - Publishing and development of project documentation
  • Container Operations - Docker and Neko server management
  • Performance Optimization - CPU-optimized builds with architecture-specific flags
  • TEE Deployment - Trusted Execution Environment deployment with attestation
  • Registry Management - Multi-registry container deployment support
  • Cross-Platform Support - Works on x86_64-Linux and aarch64-Darwin (Apple Silicon)
graph TB
    subgraph "Nix Flake Architecture"
        Flake[flake.nix]
        Inputs[External Inputs]
        Overlays[Custom Overlays]
        Shells[Development Shells]
        Packages[Docker Images]
        Apps[Utility Apps]
    end
    
    subgraph "External Dependencies"
        Nixpkgs[nixpkgs/nixos-unstable]
        MLPkgs[nixvital/ml-pkgs]
    end
    
    subgraph "Custom Overlays"
        WebRTC[WebRTC Stack]
        ML[ML Libraries]
        Audio[Audio Processing]
        Optimization[CPU Optimization]
    end
    
    subgraph "Development Shells"
        Default[default]
        GPU[gpu]
        AI[ai]
        Neko[neko]
        Docs[docs]
        CPUOpt[cpu-opt]
        GPUOpt[gpu-opt]
    end
    
    Flake --> Inputs
    Flake --> Overlays
    Flake --> Shells
    Flake --> Packages
    Flake --> Apps
    
    Inputs --> Nixpkgs
    Inputs --> MLPkgs
    
    Overlays --> WebRTC
    Overlays --> ML
    Overlays --> Audio
    Overlays --> Optimization
    
    Shells --> Default
    Shells --> GPU
    Shells --> AI
    Shells --> Neko
    Shells --> Docs
    Shells --> CPUOpt
    Shells --> GPUOpt
    Shells --> TEE

Flake Inputs

External Dependencies

inputs = {
  nixpkgs.url = "github:NixOS/nixpkgs/nixos-unstable";
  ml-pkgs.url = "github:nixvital/ml-pkgs";
};
InputSourcePurpose
nixpkgsnixos-unstableLatest packages and system libraries
ml-pkgsnixvital/ml-pkgsSpecialized ML/AI packages (PyTorch, CUDA)

Why nixos-unstable?

  • Latest packages - Access to newest versions of AI/ML libraries
  • CUDA support - Most recent NVIDIA driver and toolkit support (CUDA 12.8)
  • Python ecosystem - Up-to-date Python packages for transformers and WebRTC
  • Security updates - Timely security patches for all dependencies

Build Metadata and Reproducibility

The flake includes comprehensive build metadata for reproducible builds and attestation:

buildInfo = rec {
  timestamp = "${year}-${month}-${day}T${hour}:${minute}:${second}Z";
  revision = self.rev or self.dirtyRev or "unknown";
  shortRev = builtins.substring 0 8 revision;
  version = if (self ? rev) then shortRev else "${shortRev}-dirty";
  nixpkgsRev = nixpkgs.rev or "unknown";
  
  imageMetadata = {
    "org.opencontainers.image.title" = "Neko Agent";
    "org.opencontainers.image.created" = timestamp;
    "org.opencontainers.image.revision" = revision;
    "dev.neko.build.reproducible" = "true";
  };
};

Custom Overlay System

The flake uses a comprehensive overlay system to provide packages not available in standard Nixpkgs:

WebRTC and Media Stack

nekoOverlays = [
  (import ./overlays/pylibsrtp.nix)     # Secure RTP protocol
  (import ./overlays/aioice.nix)        # Async ICE implementation  
  (import ./overlays/aiortc.nix)        # WebRTC for Python
  # ... more overlays
];
OverlayPackagePurpose
pylibsrtp.nixpylibsrtpSecure Real-time Transport Protocol for WebRTC
aioice.nixaioiceAsynchronous ICE (Interactive Connectivity Establishment)
aiortc.nixaiortcWebRTC implementation for Python with media support

AI/ML and Audio Processing

OverlayPackagePurpose
streaming.nixstreamingMosaicML Streaming for training data
f5-tts.nixf5-ttsF5-TTS voice synthesis model
vocos.nixvocosNeural vocoder for audio generation
ema-pytorch.nixema-pytorchExponential Moving Average for PyTorch
transformers-stream-generator.nixtransformers-stream-generatorStreaming text generation
bitsandbytes.nixbitsandbytes8-bit optimizers for PyTorch

Pi-Zero PyTorch Dependencies

The flake includes comprehensive packaging for pi-zero-pytorch and its dependencies:

OverlayPackagePurpose
pi-zero-pytorch/pi-zero-pytorch.nixpi-zero-pytorchMain π0 implementation in PyTorch
pi-zero-pytorch/einx.nixeinxUniversal tensor operations with Einstein notation
pi-zero-pytorch/x-transformers.nixx-transformersTransformer architectures library
pi-zero-pytorch/rotary-embedding-torch.nixrotary-embedding-torchRotary positional embeddings
pi-zero-pytorch/accelerated-scan.nixaccelerated-scanAccelerated scan operations
pi-zero-pytorch/bidirectional-cross-attention.nixbidirectional-cross-attentionCross-attention mechanisms
pi-zero-pytorch/hl-gauss-pytorch.nixhl-gauss-pytorchGaussian operations for ML
pi-zero-pytorch/evolutionary-policy-optimization.nixevolutionary-policy-optimizationEvolution strategies

Performance Optimization

OverlayPackagePurpose
cached-path.nixcached-pathEfficient file caching utilities
znver2-flags.nixnekoZnver2EnvAMD Zen2 CPU optimization flags
vmm-cli.nixvmm-cliVirtual machine management CLI

Example Znver2 Optimization:

# Generated environment variables for AMD Zen2 CPUs
export NIX_CFLAGS_COMPILE="-O3 -pipe -march=znver2 -mtune=znver2 -fno-plt"
export RUSTFLAGS="-C target-cpu=znver2 -C target-feature=+sse2,+sse4.2,+avx,+avx2,+fma,+bmi1,+bmi2"

External ML Packages

ml-pkgs.overlays.torch-family  # Provides torch-bin, torchvision-bin, etc.

Benefits:

  • Pre-compiled binaries - Faster setup without compilation
  • CUDA integration - Proper CUDA toolkit linkage
  • Consistent versions - Matching PyTorch ecosystem versions

Development Shells

1. Default Shell (default)

Purpose: Basic Python development with CPU-only PyTorch.

Usage:

nix develop
# or
nix develop .#default

Includes:

  • Python Environment: PyTorch CPU, Transformers, WebRTC stack
  • System Tools: FFmpeg, Git, Curl, Just, pkg-config
  • Node.js Ecosystem: Node 20, NPM for AI tools
  • AI CLI Tools: OpenAI Codex, Anthropic Claude Code (auto-installed)

Python Packages:

# Core ML/AI
transformers
torch (CPU)
torchvision
pillow
accelerate

# WebRTC and networking
websockets
av (PyAV for video processing)
pylibsrtp
aioice
aiortc

# Data and streaming
streaming (MosaicML)
f5-tts
numpy
scipy
zstandard
xxhash
tqdm

# Monitoring
prometheus-client

When to Use:

  • Initial project setup and exploration
  • Development on systems without NVIDIA GPUs
  • Testing compatibility with CPU-only environments
  • CI/CD pipelines where GPU access is unavailable

2. GPU Shell (gpu)

Purpose: GPU-accelerated development with CUDA 12.8 support.

Usage:

nix develop .#gpu

NVIDIA hosts: When running outside NixOS you will typically need nixGL to expose the system GPU. Use:

NIXPKGS_ALLOW_UNFREE=1 nix run --impure github:nix-community/nixGL#nixGLNvidia -- nix develop .#gpu

This wraps the GPU shell with the right OpenGL/EGL libraries from the host driver.

Additional Features over Default:

  • CUDA Toolkit 12.8 - Complete CUDA development environment
  • cuDNN and NCCL - Optimized neural network and communication libraries
  • GPU-enabled PyTorch - Tensor operations on NVIDIA GPUs
  • Environment Variables - Automatic CUDA path and library configuration

CUDA Environment Setup:

# Automatically configured
export CUDA_HOME=/nix/store/.../cuda-12.8
export CUDA_PATH=$CUDA_HOME
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH

# GPU control
export NVIDIA_VISIBLE_DEVICES=all
export NVIDIA_DRIVER_CAPABILITIES=compute,utility
export CUDA_MODULE_LOADING=LAZY
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

Verification Commands:

# Check CUDA installation
nvidia-smi
nvcc --version

# Test PyTorch GPU support
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
python -c "import torch; print(f'CUDA devices: {torch.cuda.device_count()}')"

When to Use:

  • AI model inference and training
  • GPU-accelerated image/video processing
  • Development requiring CUDA libraries
  • Performance-critical workloads

3. AI Shell (ai)

Purpose: Lightweight environment focused on AI development tools.

Usage:

nix develop .#ai

Includes:

  • Core System Tools - FFmpeg, Git, networking utilities
  • Node.js Environment - Node 20, NPM
  • AI CLI Tools - Automatic installation of OpenAI and Anthropic CLIs
  • Minimal Footprint - No heavy ML libraries, faster startup

AI Tools Installed:

# OpenAI Codex CLI
npm install -g @openai/codex

# Anthropic Claude Code CLI  
npm install -g @anthropic-ai/claude-code

Environment Setup:

# NPM global packages in project directory
export NPM_CONFIG_PREFIX=$PWD/.npm-global
export PATH=$NPM_CONFIG_PREFIX/bin:$PATH

When to Use:

  • AI-assisted development workflows
  • Code generation and review tasks
  • Integration with AI development services
  • Quick environment for AI tool testing

4. Neko Shell (neko)

Purpose: Container and Neko server management.

Usage:

nix develop .#neko

Container Stack:

  • Colima - Lightweight Docker runtime for macOS/Linux
  • Docker & Docker Compose - Container orchestration
  • Docker Buildx - Multi-platform image building
  • Networking Tools - curl, jq for API interaction

Custom Scripts:

# Neko service management script
neko-services up      # Start Neko server
neko-services down    # Stop services
neko-services logs    # View container logs
neko-services status  # Check service status
neko-services restart # Restart services
neko-services update  # Pull latest images and restart

Colima Configuration:

# Automatically configured VM
colima start --vm-type vz --cpu 2 --memory 4 \
  --mount-type sshfs --mount "~:w"

Docker Environment:

# Automatic Docker socket configuration
export DOCKER_HOST="unix://$HOME/.colima/default/docker.sock"

When to Use:

  • Neko server development and testing
  • Container image building and deployment
  • Docker-based development workflows
  • Local testing of production deployments

5. Documentation Shell (docs)

Purpose: Documentation development, building, and publishing.

Usage:

nix develop .#docs

Documentation Stack:

  • mdBook - Rust-based documentation generator
  • mdBook Extensions:
    • mdbook-mermaid - Diagram support
    • mdbook-linkcheck - Link validation
    • mdbook-toc - Table of contents generation
  • Sphinx - Python documentation with reStructuredText support
  • Node.js - For additional tooling and preprocessing

Python Documentation Tools:

sphinx              # Documentation generator
sphinx-rtd-theme     # Read the Docs theme
myst-parser          # Markdown support for Sphinx
sphinxcontrib-mermaid # Mermaid diagrams in Sphinx

Available Commands:

# From inside docs/
mdbook serve --open     # Development server with live reload
mdbook build           # Build static documentation
mdbook test            # Test code examples and links

# Sphinx alternative
sphinx-build -b html source build/

When to Use:

  • Writing and editing project documentation
  • Building documentation for deployment
  • Testing documentation changes locally
  • Contributing to API reference and guides

6. CPU-Optimized Shell (cpu-opt)

Purpose: Performance-optimized CPU development.

Usage:

nix develop .#cpu-opt

Optimization Features:

  • Architecture-Specific Compilation - Znver2 flags for AMD CPUs
  • Optimized Python Environment - Performance-tuned package builds
  • Compiler Optimizations - -O3, -march=znver2, -mtune=znver2

Generated Optimization Flags (Linux only):

# Compiler flags
export NIX_CFLAGS_COMPILE="-O3 -pipe -march=znver2 -mtune=znver2 -fno-plt"

# Rust flags
export RUSTFLAGS="-C target-cpu=znver2 -C target-feature=+sse2,+sse4.2,+avx,+avx2,+fma,+bmi1,+bmi2 -C link-arg=-Wl,-O1 -C link-arg=--as-needed"

When to Use:

  • Performance-critical CPU workloads
  • Benchmarking and optimization work
  • Production builds targeting specific CPU architectures
  • Environments where every bit of CPU performance matters

7. GPU-Optimized Shell (gpu-opt)

Purpose: Maximum performance GPU development with optimizations.

Usage:

nix develop .#gpu-opt

Combined Optimizations:

  • All GPU features - CUDA 12.8, cuDNN, NCCL
  • CPU optimizations - Znver2 flags for host code
  • PyTorch optimizations - Optimized builds with CPU and GPU acceleration
  • Memory optimizations - Advanced CUDA memory management

GPU-Specific Optimizations:

# Target specific GPU architecture (configurable)
export TORCH_CUDA_ARCH_LIST=8.6  # RTX 30xx series

# Memory allocation strategy
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

Performance Verification:

# Check optimizations are active
echo $NIX_CFLAGS_COMPILE  # Should show znver2 flags
echo $TORCH_CUDA_ARCH_LIST  # Should show target GPU architecture

# Benchmark performance
python -c "
import torch
import time
x = torch.randn(1000, 1000, device='cuda')
start = time.time()
torch.mm(x, x)
print(f'GPU matrix multiply: {time.time() - start:.4f}s')
"

When to Use:

  • Maximum performance AI inference
  • GPU-accelerated training workloads
  • Performance benchmarking and optimization
  • Production deployments requiring peak performance

8. TEE Shell (tee)

Purpose: Trusted Execution Environment deployment and attestation.

Usage:

nix develop .#tee

TEE Deployment Stack:

  • Phala Cloud CLI - Modern CLI for TEE deployments
  • Legacy VMM CLI - Compatible with older dstack systems
  • Docker & Docker Compose - Container orchestration
  • Bun Runtime - Fast JavaScript runtime
  • Reproducible Image Builder - Attestation-ready container building

Available Commands:

# Modern Phala CLI
phala auth login <api-key>      # Authenticate with Phala Cloud
phala status                    # Check authentication status
phala cvms list                 # List Confidential VMs
phala nodes                     # List available TEE nodes

# Legacy VMM CLI (if needed)
vmm-cli lsvm                    # List virtual machines
vmm-cli lsimage                 # List available images
vmm-cli lsgpu                   # List available GPUs

# Reproducible builds
nix run .#build-images          # Build reproducible images
nix run .#deploy-to-tee         # Deploy with attestation metadata
nix run .#verify-attestation    # Verify TEE attestation

Multi-Registry Support:

# Deploy to ttl.sh (ephemeral registry)
NEKO_REGISTRY=ttl.sh NEKO_TTL=1h nix run .#deploy-to-tee
nix run .#deploy-to-ttl 24h

# Deploy to GitHub Container Registry
NEKO_REGISTRY=ghcr.io/your-org nix run .#deploy-to-tee

# Deploy to Docker Hub
NEKO_REGISTRY=docker.io/your-org nix run .#deploy-to-tee

# Deploy to local registry
NEKO_REGISTRY=localhost:5000/neko nix run .#deploy-to-tee

When to Use:

  • Deploying to Trusted Execution Environments
  • Creating attestable, reproducible deployments
  • Multi-registry container management
  • TEE-based inference deployments
  • Confidential computing workloads

Docker Images and Packages

The flake builds optimized Docker images for production deployment:

Available Images

The flake now builds multiple specialized images for different components:

# Build all images
nix run .#build-images

# Agent images
nix build .#neko-agent-docker-generic
nix build .#neko-agent-docker-opt

# Capture images  
nix build .#neko-capture-docker-generic
nix build .#neko-capture-docker-opt

# YAP (TTS) images
nix build .#neko-yap-docker-generic
nix build .#neko-yap-docker-opt

# Train images
nix build .#neko-train-docker-generic
nix build .#neko-train-docker-opt

1. Generic CUDA Image (neko-agent-docker-generic)

Target: neko-agent:cuda12.8-generic

Features:

  • Portable CUDA - Includes PTX for forward compatibility
  • CUDA 12.8 - Full toolkit and libraries
  • Python Environment - All dependencies with torch-bin
  • Broad GPU Support - Works on any CUDA 8.6+ GPU

Configuration:

# Environment variables
CUDA_HOME=/nix/store/.../cuda-12.8
LD_LIBRARY_PATH=$CUDA_HOME/lib64:$CUDA_HOME/lib
CUDA_MODULE_LOADING=LAZY
TORCH_CUDA_ARCH_LIST=8.6+PTX  # Forward compatibility

Use Cases:

  • Multi-GPU deployment environments
  • Cloud platforms with varying GPU types
  • Development and testing across different hardware

2. Optimized Image (neko-agent-docker-opt)

Target: neko-agent:cuda12.8-sm86-v3

Features:

  • Specific GPU targeting - Optimized for RTX 30xx series (sm_86)
  • CPU optimizations - Znver2 architecture flags
  • Smaller size - No PTX, specific architecture only
  • Maximum performance - All available optimizations enabled

Configuration:

# Optimized environment
TORCH_CUDA_ARCH_LIST=8.6  # Specific architecture only
NIX_CFLAGS_COMPILE="-O3 -pipe -march=znver2 -mtune=znver2 -fno-plt"
RUSTFLAGS="-C target-cpu=znver2 ..."  # Rust optimizations

Use Cases:

  • Production deployments with known hardware
  • Performance-critical applications
  • Cost-optimized cloud instances

Image Building System

# Helper function for consistent container structure
mkRoot = paths: pkgs.buildEnv {
  name = "image-root";
  inherit paths;
  pathsToLink = [ "/bin" ];
};

# Generic image build
neko-agent-docker-generic = pkgs.dockerTools.buildImage {
  name = "neko-agent:cuda12.8-generic";
  created = "now";
  copyToRoot = mkRoot ([
    runnerGeneric
    pyEnvGeneric
    cuda.cudatoolkit
    cuda.cudnn
    cuda.nccl
    pkgs.bashInteractive
  ] ++ commonSystemPackages);
  config = {
    Env = baseEnv ++ [
      "CUDA_HOME=${cuda.cudatoolkit}"
      "LD_LIBRARY_PATH=${cuda.cudatoolkit}/lib64:${cuda.cudnn}/lib"
      "TORCH_CUDA_ARCH_LIST=8.6+PTX"
    ];
    WorkingDir = "/workspace";
    Entrypoint = [ "/bin/neko-agent" ];
  };
};

Utility Apps

The flake provides comprehensive utility applications for common tasks:

Documentation Apps

# Build documentation
nix run .#docs-build

# Serve documentation with live reload
nix run .#docs-serve

# Check documentation for issues
nix run .#docs-check

Build and Deployment Apps

# Build all Docker images with attestation metadata
nix run .#build-images

# TEE deployment with multi-registry support
nix run .#deploy-to-tee
nix run .#deploy-to-ttl 24h              # Quick ttl.sh deployment
nix run .#push-to-ttl 1h                 # Just push to ttl.sh

# Attestation verification
nix run .#verify-attestation <app-id> <expected-hash>

Container Registry Apps

# Local registry management
nix run .#start-registry                 # HTTP registry with auth
nix run .#start-registry-https           # HTTPS with Tailscale certs
nix run .#stop-registry

# Public exposure
nix run .#start-tailscale-funnel         # Expose via Tailscale Funnel
nix run .#start-cloudflare-tunnel        # Expose via Cloudflare Tunnel

Registry Configuration Examples:

# Environment variables for registry customization
NEKO_REGISTRY_PORT=5000
NEKO_REGISTRY_USER=neko
NEKO_REGISTRY_PASSWORD=pushme
NEKO_REGISTRY_DATA_DIR=$PWD/registry-data
NEKO_REGISTRY_AUTH_DIR=$PWD/auth
NEKO_REGISTRY_CERTS_DIR=$PWD/certs

# Tailscale Funnel setup
NEKO_REGISTRY=your-device.tail-scale.ts.net/neko

# Cloudflare Tunnel setup
NEKO_CF_TUNNEL_NAME=neko-registry
NEKO_CF_HOSTNAME=registry.example.com

Common Development Workflows

Initial Setup

# Clone repository
git clone <repo-url>
cd neko-agent

# Enter development environment
nix develop .#gpu  # or .#default for CPU-only

# Verify setup
python -c "import torch; print(torch.cuda.is_available())"

AI Development Workflow

# 1. Enter GPU environment
nix develop .#gpu

# 2. Load environment variables (if .env exists)
# Automatically loaded by shell hook

# 3. Test model loading
python -c "
from transformers import Qwen2VLForConditionalGeneration
model = Qwen2VLForConditionalGeneration.from_pretrained('showlab/ShowUI-2B')
print('Model loaded successfully')
"

# 4. Run agent
uv run src/agent.py --task "Navigate to google.com"

Documentation Development

# 1. Enter docs environment
nix develop .#docs

# 2. Start development server
nix run .#docs-serve
# Opens browser to http://localhost:3000

# 3. Edit files in docs/src/
# Changes automatically reload in browser

# 4. Build for deployment
nix run .#docs-build

Container Development

# 1. Enter container environment
nix develop .#neko

# 2. Start Neko server
neko-services up

# 3. Check status
neko-services status

# 4. View logs
neko-services logs neko

# 5. Test connection
curl http://localhost:8080/health

Performance Optimization

# 1. Use optimized environment
nix develop .#gpu-opt

# 2. Verify optimizations
echo $NIX_CFLAGS_COMPILE
echo $TORCH_CUDA_ARCH_LIST

# 3. Run performance benchmarks
python benchmarks/inference_speed.py

# 4. Build optimized container
nix build .#neko-agent-docker-opt

TEE Deployment Workflow

# 1. Enter TEE environment
nix develop .#tee

# 2. Build reproducible images
nix run .#build-images

# 3. Deploy to TEE (with registry choice)
# Option A: Use ttl.sh for testing
nix run .#deploy-to-ttl 1h

# Option B: Use GitHub Container Registry
NEKO_REGISTRY=ghcr.io/your-org nix run .#deploy-to-tee

# Option C: Use local registry (start it first)
nix run .#start-registry  # In another terminal
NEKO_REGISTRY=localhost:5000/neko nix run .#deploy-to-tee

# 4. Verify attestation (inside TEE)
nix run .#verify-attestation <app-id> <compose-hash>

# 5. Check deployment status
phala cvms list  # Modern CLI
# or
vmm-cli lsvm    # Legacy CLI

Multi-Registry Development

# Setup local registry for testing
nix run .#start-registry

# Push images to multiple registries
docker tag neko-agent:latest localhost:5000/neko/agent:v1
docker push localhost:5000/neko/agent:v1

# Use Tailscale for team access
nix run .#start-tailscale-funnel

# Use Cloudflare for public access
nix run .#start-cloudflare-tunnel

Environment Variables and Configuration

Automatic .env Loading

All development shells automatically load .env files:

# .env file example
NEKO_WS=ws://localhost:8080/api/ws
NEKO_LOGLEVEL=DEBUG
CUDA_VISIBLE_DEVICES=0
TORCH_CUDA_ARCH_LIST=8.6

Common Environment Variables

VariablePurposeDefaultSet By
CUDA_HOMECUDA installation pathAuto-detectedGPU shells
CUDA_VISIBLE_DEVICESGPU selectionallUser configurable
PYTORCH_CUDA_ALLOC_CONFMemory strategyexpandable_segments:TrueGPU shells
NPM_CONFIG_PREFIXNPM global location$PWD/.npm-globalAll shells
NIX_CFLAGS_COMPILECompiler optimizationsZnver2 flagsOptimized shells

Shell-Specific Variables

GPU Shells:

export CUDA_MODULE_LOADING=LAZY
export NVIDIA_DRIVER_CAPABILITIES=compute,utility
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$CUDA_HOME/lib

Documentation Shell:

# No specific variables, uses standard tool defaults

Container Shell:

export DOCKER_HOST="unix://$HOME/.colima/default/docker.sock"

Cross-Platform Support

Supported Systems

supportedSystems = [ "x86_64-linux" "aarch64-darwin" ];

Platform-Specific Features

x86_64-Linux:

  • Full GPU support - NVIDIA CUDA, Docker GPU passthrough
  • CPU optimizations - Znver2, Intel architecture targeting
  • Container building - Docker images with CUDA support

aarch64-Darwin (Apple Silicon):

  • Metal Performance Shaders - GPU acceleration via MPS
  • Rosetta compatibility - x86_64 dependencies when needed
  • Native performance - ARM64-optimized packages

Platform Detection

# Conditional features based on platform
${pkgs.lib.optionalString pkgs.stdenv.isLinux ''
  source ${znver2File}
  echo "[cpu-opt] Using znver2 flags: $NIX_CFLAGS_COMPILE"
''}

Troubleshooting

Common Issues

CUDA Not Detected:

# Check NVIDIA drivers
nvidia-smi

# Verify CUDA environment
echo $CUDA_HOME
echo $LD_LIBRARY_PATH

# Test PyTorch CUDA
python -c "import torch; print(torch.cuda.is_available())"

Solution: Ensure NVIDIA drivers are installed and compatible with CUDA 12.8.

Docker Issues on macOS:

# Check Colima status
colima status

# Restart if needed
colima stop
colima start --vm-type vz --cpu 2 --memory 4

Slow Package Installation:

# Use binary cache
echo "substituters = https://cache.nixos.org https://cuda-maintainers.cachix.org" >> ~/.config/nix/nix.conf
echo "trusted-public-keys = cache.nixos.org-1:6NCHdD59X431o0gWypbMrAURkbJ16ZPMQFGspcDShjY= cuda-maintainers.cachix.org-1:0dq3bujKpuEPiCgBv7/11NEBpCcEKUzZzUNjRgPTOOA=" >> ~/.config/nix/nix.conf

Memory Issues

GPU Memory:

# Monitor GPU memory
nvidia-smi -l 1

# Optimize PyTorch memory
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128,expandable_segments:True

System Memory:

# Check available memory
free -h

# Monitor during development
htop

Performance Issues

Check Optimizations:

# Verify CPU flags
cat /proc/cpuinfo | grep flags

# Check compiler optimizations
echo $NIX_CFLAGS_COMPILE

# Benchmark inference
python -c "
import torch
import time
device = 'cuda' if torch.cuda.is_available() else 'cpu'
x = torch.randn(1000, 1000, device=device)
start = time.time()
result = torch.mm(x, x)
print(f'{device} time: {time.time() - start:.4f}s')
"

Advanced Usage

Custom Overlays

Create project-specific overlays in overlays/:

# overlays/custom-package.nix
final: prev: {
  custom-package = prev.python3Packages.buildPythonPackage {
    pname = "custom-package";
    version = "1.0.0";
    src = prev.fetchFromGitHub {
      owner = "owner";
      repo = "repo";
      rev = "v1.0.0";
      sha256 = "...";
    };
    propagatedBuildInputs = with prev.python3Packages; [
      numpy
      torch
    ];
  };
}

Custom Development Shells

Add new shells to the flake:

# Add to devShells
experimental = pkgs.mkShell {
  buildInputs = commonSystemPackages ++ [
    # Custom packages
  ];
  shellHook = ''
    echo "Experimental environment loaded"
    # Custom setup
  '';
};

Environment Specialization

Create environment-specific configurations:

# .env.gpu
CUDA_VISIBLE_DEVICES=0
TORCH_CUDA_ARCH_LIST=8.6

# .env.multi-gpu  
CUDA_VISIBLE_DEVICES=0,1,2,3
NCCL_DEBUG=INFO

# Load specific environment
cp .env.gpu .env
nix develop .#gpu

Contributing to the Flake

Adding New Packages

  1. Create overlay in overlays/new-package.nix
  2. Add to overlay list in nekoOverlays
  3. Include in appropriate shells
  4. Test across platforms
  5. Update documentation

Testing Changes

# Test specific shell
nix develop .#shell-name --command python -c "import new_package"

# Test all shells
for shell in default gpu ai neko docs cpu-opt gpu-opt; do
  echo "Testing $shell..."
  nix develop .#$shell --command echo "✓ $shell loads successfully"
done

# Test image builds
nix build .#neko-agent-docker-generic
nix build .#neko-agent-docker-opt

Performance Considerations

  • Binary caches - Use Cachix for custom packages
  • Layer optimization - Minimize Docker image layers
  • Dependency management - Avoid unnecessary dependencies
  • Build reproducibility - Pin package versions when needed

This comprehensive flake system provides a robust, reproducible development environment that scales from local development to production deployment while maintaining consistency across different platforms and use cases.

Nix Flake Python Packaging Guide

This guide explains the nuances and patterns used for packaging Python dependencies in the Neko Agent project's Nix flake. The project uses a sophisticated overlay system to package complex Python libraries that aren't available in standard Nixpkgs, with special attention to PyTorch ecosystem integration and CUDA support.

Overview

The Neko Agent project requires numerous specialized Python packages for AI/ML, WebRTC, and audio processing that either don't exist in Nixpkgs or need custom configurations. The flake uses a comprehensive overlay system to package these dependencies while maintaining compatibility with the PyTorch ecosystem.

Key Packaging Principles

  1. Torch-bin Integration - Use pre-compiled PyTorch wheels for consistency
  2. CUDA Compatibility - Ensure proper CUDA toolkit integration
  3. Dependency Override - Override transitive dependencies to use torch-bin
  4. Test Skipping - Disable problematic tests that require GPU access
  5. Version Flexibility - Patch restrictive version constraints when needed

Core Patterns

1. Basic Python Package Overlay

self: super: {
  python3Packages = super.python3Packages // {
    example-package = super.python3Packages.buildPythonPackage rec {
      pname = "example-package";
      version = "1.0.0";
      format = "pyproject";

      src = super.fetchFromGitHub {
        owner = "owner";
        repo = "example-package";
        rev = "v${version}";
        sha256 = "...";
      };

      nativeBuildInputs = with super.python3Packages; [ setuptools wheel ];
      propagatedBuildInputs = with super.python3Packages; [ numpy scipy ];
      doCheck = false;

      meta = with super.lib; {
        description = "Package description";
        homepage = "https://github.com/owner/example-package";
        license = licenses.mit;
      };
    };
  };
}

2. PyTorch Integration Pattern

Always use torch-bin and override transitive dependencies:

self: super: {
  python3Packages = super.python3Packages // {
    ml-package = super.python3Packages.buildPythonPackage rec {
      # ... basic fields ...
      
      propagatedBuildInputs = with super.python3Packages; [
        super.python3Packages."torch-bin"
        super.python3Packages."torchvision-bin"
        
        # Override packages that depend on torch
        (accelerate.override { torch = super.python3Packages."torch-bin"; })
        (torchdiffeq.override { torch = super.python3Packages."torch-bin"; })
        
        transformers
        numpy
      ];

      doCheck = false;  # Skip GPU-dependent tests
    };
  };
}

3. CUDA-Aware Package Pattern

For packages needing CUDA compilation:

self: super: 
let
  cudaPackages = super.cudaPackages_12_8;
  cuda-redist = super.symlinkJoin {
    name = "cuda-redist-${cudaPackages.cudaMajorMinorVersion}";
    paths = with cudaPackages; [
      (super.lib.getDev cuda_cccl)
      (super.lib.getDev libcublas)
      # ... more CUDA libs
    ];
  };
in
{
  python3Packages = super.python3Packages // {
    cuda-package = super.python3Packages.buildPythonPackage rec {
      # ... basic fields ...
      
      postPatch = ''
        substituteInPlace setup.py \
          --replace-fail "find_cuda_libs()" "['${cuda-redist}/lib/libcudart.so']"
      '';

      nativeBuildInputs = [ super.cmake cudaPackages.cuda_nvcc ];
      buildInputs = [ cuda-redist ];
      
      CUDA_HOME = "${cuda-redist}";
      NVCC_PREPEND_FLAGS = [ "-I${cuda-redist}/include" ];
      
      doCheck = false;
    };
  };
}

4. Version Constraint Patching

Relax overly restrictive version constraints:

postPatch = ''
  substituteInPlace pyproject.toml \
    --replace "numpy<=1.24.0" "numpy" \
    --replace "pydantic>=2.0,<2.5" "pydantic>=2.0"
'';
## Real-World Examples

### F5-TTS Package

Demonstrates torch overrides, version patching, and platform-specific dependencies:

```nix
self: super: {
  python3Packages = super.python3Packages // {
    f5-tts = super.python3Packages.buildPythonPackage rec {
      pname = "f5-tts";
      version = "1.1.7";
      
      src = super.fetchFromGitHub {
        owner = "SWivid";
        repo = "F5-TTS";
        rev = "main";
        sha256 = "sha256-MtPyqS5aNrq929pHMlDp3HFUSf+i9xYDb5xMA0Eqh9Y=";
      };

      propagatedBuildInputs = with super.python3Packages; [
        # Torch overrides
        (accelerate.override { torch = super.python3Packages."torch-bin"; })
        (x-transformers.override { torch = super.python3Packages."torch-bin"; })
        
        # Custom overlays
        cached-path ema-pytorch vocos
        
        # Standard deps
        transformers gradio librosa soundfile
      ] ++ super.lib.optionals (!super.stdenv.isAarch64) [
        bitsandbytes  # x86_64 only
      ];

      # Version constraint fixes
      postPatch = ''
        substituteInPlace pyproject.toml \
          --replace "numpy<=1.26.4" "numpy" \
          --replace "pydantic<=2.10.6" "pydantic"
      '';
      
      doCheck = false;
    };
  };
}

Pi-Zero PyTorch

Shows complex dependency chains and overlay usage:

self: super: {
  python3Packages = super.python3Packages // {
    pi-zero-pytorch = super.python3Packages.buildPythonPackage rec {
      pname = "pi-zero-pytorch";
      version = "0.2.5";
      
      src = super.fetchFromGitHub {
        owner = "lucidrains";
        repo = "pi-zero-pytorch";
        rev = "1c13cbbcc236c3cb4fd18213c543188fdf083b33";
        sha256 = "1xl38c5spv798xydljahq8fw9nglwcw95s95zpn11cm9ra4rg4ib";
      };

      nativeBuildInputs = [ super.python3Packages.hatchling ];
      
      propagatedBuildInputs = with super.python3Packages; [
        # Torch overrides
        (accelerate.override { torch = super.python3Packages."torch-bin"; })
        
        # Custom overlays (all from overlays/)
        einx x-transformers rotary-embedding-torch
        accelerated-scan bidirectional-cross-attention
        
        # Use patched einops from self
        (self.python3Packages.einops)
        
        beartype jaxtyping scipy tqdm
        super.python3Packages."torch-bin"
      ];
      
      doCheck = false;
    };
  };
}

Advanced Techniques

Using self vs super

  • super: Access packages before overlay modifications
  • self: Access final overlaid packages (avoid infinite recursion)
propagatedBuildInputs = with super.python3Packages; [
  numpy  # Use super for standard packages
  (self.python3Packages.custom-overlay-package)  # Use self for overlay deps
  (accelerate.override { torch = super.python3Packages."torch-bin"; })
];

Conditional Dependencies

propagatedBuildInputs = with super.python3Packages; [
  numpy scipy
] ++ super.lib.optionals (!super.stdenv.isAarch64) [
  bitsandbytes  # x86_64 only
] ++ super.lib.optionals super.stdenv.isLinux [
  nvidia-ml-py  # Linux only
];

Source Fetching

# Preferred: Specific tag/commit
src = super.fetchFromGitHub {
  owner = "owner"; repo = "repo";
  tag = "v${version}";  # or rev = "commit-hash";
  hash = "sha256-...";  # Use nix-prefetch-github
};

# PyPI when available
src = super.fetchPypi {
  inherit pname version;
  hash = "sha256-...";
};

Build Systems

# Modern pyproject.toml
format = "pyproject";
nativeBuildInputs = [ super.python3Packages.hatchling ];

# Legacy setup.py  
format = "setuptools";
nativeBuildInputs = [ super.python3Packages.setuptools ];

Best Practices

Hash Management

# Get hashes for sources
nix-prefetch-github owner repo --rev v1.0.0
nix hash to-sri --type sha256 abc123...  # Convert to SRI format

Testing Overlays

# Test single package
nix-shell -p '(python3.withPackages (ps: [ ps.my-package ]))'
nix-shell --run 'python -c "import my_package"'

Dependency Management

  • Keep dependencies minimal and explicit
  • Always use torch-bin for PyTorch ecosystem
  • Override transitive torch dependencies

Quality Metadata

meta = with super.lib; {
  description = "Clear, concise description";
  homepage = "https://github.com/owner/repo";
  license = licenses.mit;
  platforms = platforms.unix;
};

Troubleshooting

Common Issues

Import Errors: Use super.python3Packages."torch-bin" not "torch"

Version Conflicts: Patch constraints with postPatch

postPatch = ''
  substituteInPlace setup.py --replace "torch==2.0.0" "torch>=2.0.0"
'';

CUDA Issues: Set proper CUDA environment

buildInputs = [ cuda-redist ];
CUDA_HOME = "${cuda-redist}";

Test Failures: Disable with doCheck = false;

Integration with Flake

Adding New Overlays

  1. Create overlay file in overlays/ directory
  2. Add to nekoOverlays list in flake.nix
  3. Include in Python environments
# In flake.nix
nekoOverlays = [
  (import ./overlays/new-package.nix)
];

# Usage in environments
pyEnvExample = pkgs.python3.withPackages (ps: [
  ps.new-package
]);

Overlay Ordering

Dependencies must come first:

nekoOverlays = [
  # Base packages first
  (import ./overlays/cached-path.nix)
  (import ./overlays/ema-pytorch.nix)
  
  # Dependent packages after
  (import ./overlays/f5-tts.nix)  # Uses ema-pytorch
  
  # Complex packages last
  (import ./overlays/pi-zero-pytorch/pi-zero-pytorch.nix)
  
  # Patches to existing packages last
  (import ./overlays/einops.nix)  # Overrides existing einops
];

Handling GPU-Heavy Test Suites

Some upstream projects (for example einops) ship notebook-based test suites that start long-running CUDA benchmarks. In CI or local builds these checks routinely exceed the sandbox timeout. Our overlays/einops.nix patch disables the notebook runner and its specific test_notebook_3 case, while still allowing the rest of the package to build reproducibly. Always prefer pinpointing the slow tests (via disabledTestPaths/disabledTests) instead of globally disabling checks so that lightweight import tests keep running.

This guide covers the key patterns for packaging Python dependencies in the Neko Agent project with proper PyTorch ecosystem integration and CUDA support.

Neko

Repository Cheat Sheet (Current Codebase)

  • Project Structure:
    • server/: Go backend (cmd/, internal/, pkg/, optional plugins/). Build via server/buildserver/bin/neko.
    • client/: Vue 2 + TypeScript SPA (src/, public/), built to client/dist.
    • apps/: Docker app images (e.g., firefox/, chromium/, kde/, vlc/).
    • runtime/: Base image and Xorg/driver configs; basis for final images.
    • utils/: Tooling; Dockerfile generator at utils/docker/main.go.
    • webpage/: Docusaurus docs site.
    • Root: build (image builder), docker-compose.yaml (local run), config.yml (sample config).
  • Dev Commands:
    • Client dev: cd client && npm ci && npm run serve (hot-reload dev server).
    • Client build: cd client && npm run buildclient/dist.
    • Client lint: cd client && npm run lint.
    • Server build: cd server && ./build (use ./build core to skip plugins) → server/bin/neko.
    • Docker images: base ./build, app ./build -a firefox, flavor ./build -f nvidia -a chromium.
    • Local run: docker compose up -d (serves ghcr.io/m1k1o/neko/firefox:latest on :8080).
  • Coding Style:
    • Indent 2 spaces (LF). Client: Prettier single quotes, no semicolons, trailing commas. Go: gofmt/go vet; package names short/lowercase; files snake_case.
  • Architecture & Entry Points:
    • SPA served as static assets from container; talks HTTP/WS on :8080.
    • server/cmd/neko/main.gocmd.Execute(); server/cmd/root.go config/logging init; server/cmd/serve.go wires managers in order: session → member → desktop → capture → webrtc → websocket → api → http.
    • HTTP: server/internal/http/manager.go registers /api, /api/ws, /health, /metrics, static files, optional pprof.
    • REST router: server/internal/api/router.go. WebSocket: server/internal/websocket/manager.go. WebRTC: server/internal/webrtc/manager.go. Session/auth: server/internal/session, server/internal/member/*. Capture/Xorg: server/internal/capture, server/internal/desktop.
  • API Surface (summary; see server/openapi.yaml):
    • Auth: POST /api/login, POST /api/logout, GET /api/whoami, POST /api/profile.
    • Sessions: GET /api/sessions, GET/DELETE /api/sessions/{id}, POST /api/sessions/{id}/disconnect.
    • Room: GET/POST /api/room/settings, broadcast GET /api/room/broadcast, POST /api/room/broadcast/start|stop.
    • Clipboard: GET/POST /api/room/clipboard, image GET /api/room/clipboard/image.png.
    • Keyboard: GET/POST /api/room/keyboard/map, GET/POST /api/room/keyboard/modifiers.
    • Control: GET /api/room/control, POST /api/room/control/{request|release|take|give/{sessionId}|reset}.
    • Screen: GET/POST /api/room/screen, GET /api/room/screen/configurations, GET /api/room/screen/{cast.jpg|shot.jpg}.
    • Upload: POST /api/room/upload/{drop|dialog}, DELETE /api/room/upload/dialog.
    • Utility: POST /api/batch, GET /health, GET /metrics.
  • WebSocket:
    • Connect ws(s)://<host>/api/ws with cookie, Authorization: Bearer <token>, or ?token=<token>.
    • Envelope { "event": "<string>", "payload": <JSON> } with events defined under server/pkg/types/event/events.go (e.g., system/init, signal/request, control/move, clipboard/set, keyboard/map, broadcast/status, file_chooser_dialog/opened|closed).
  • Configuration Model:
    • Defaults in config.yml; override via NEKO_* envs; legacy v2 envs supported behind NEKO_LEGACY=1 and have v3 equivalents (see server/internal/config/*).

Complete REST API (v3)

Authentication

  • Methods: Cookie (NEKO_SESSION), Bearer token (Authorization: Bearer <token>), or query token (?token=<token>).
  • Default security applies to all endpoints unless explicitly noted; /health, /metrics, and /api/login do not require auth.

General

  • GET /health: Liveness probe. 200 text/plain.
  • GET /metrics: Prometheus metrics. 200 text/plain.
  • POST /api/batch: Execute multiple API requests in one call.
    • Request: [{ path: string, method: 'GET'|'POST'|'DELETE', body?: any }].
    • Response: [{ path, method, status: number, body?: any }].
  • GET /api/stats: Server/session statistics.
    • Response: { has_host, host_id, server_started_at, total_users, last_user_left_at, total_admins, last_admin_left_at }.

Current Session

  • POST /api/login: Authenticate and start a session. No auth required.
    • Request: { username, password }.
    • Response: SessionLoginResponse = SessionData plus optional { token } if cookies are disabled.
  • POST /api/logout: Terminate current session. 200.
  • GET /api/whoami: Retrieve current session info.
    • Response: SessionData.
  • POST /api/profile: Update current session’s runtime profile (no member sync).
    • Request: MemberProfile. 204.

Sessions

  • GET /api/sessions: List active sessions.
    • Response: SessionData[].
  • GET /api/sessions/{sessionId}: Get a session by id.
    • Response: SessionData. 404 if not found.
  • DELETE /api/sessions/{sessionId}: Remove a session. 204.
  • POST /api/sessions/{sessionId}/disconnect: Force disconnect a session. 204.

Room Settings

  • GET /api/room/settings: Get room settings.
    • Response: Settings.
  • POST /api/room/settings: Update room settings. 204.
    • Request: Settings.

Room Broadcast

  • GET /api/room/broadcast: Get broadcast status.
    • Response: { url?: string, is_active: boolean }.
  • POST /api/room/broadcast/start: Start RTMP broadcast.
    • Request: { url: string }. 204. Errors: 400 missing URL; 422 already broadcasting; 500 start failure.
  • POST /api/room/broadcast/stop: Stop broadcast. 204. Error: 422 not broadcasting.

Room Clipboard

  • GET /api/room/clipboard: Get clipboard text/HTML.
    • Response: { text?: string, html?: string }.
  • POST /api/room/clipboard: Set clipboard text/HTML. 204.
    • Request: { text?: string, html?: string }.
  • GET /api/room/clipboard/image.png: Get clipboard image.
    • Response: image/png binary.

Room Keyboard

  • GET /api/room/keyboard/map: Get keyboard map.
    • Response: { layout?: string, variant?: string }.
  • POST /api/room/keyboard/map: Set keyboard map. 204.
    • Request: { layout?: string, variant?: string }.
  • GET /api/room/keyboard/modifiers: Get keyboard modifiers.
    • Response: { shift?, capslock?, control?, alt?, numlock?, meta?, super?, altgr? }: boolean.
  • POST /api/room/keyboard/modifiers: Set modifiers. 204.
    • Request: same shape as response.

Room Control

  • GET /api/room/control: Get control status.
    • Response: { has_host: boolean, host_id?: string }.
  • POST /api/room/control/request: Request host control. 204. Error: 422 already has host.
  • POST /api/room/control/release: Release control. 204.
  • POST /api/room/control/take: Take control (admin/host override). 204.
  • POST /api/room/control/give/{sessionId}: Give control to session. 204.
    • Path: sessionId string. Errors: 400 target can’t host; 404 unknown session.
  • POST /api/room/control/reset: Reset control state. 204.

Room Screen

  • GET /api/room/screen: Get current screen config.
    • Response: ScreenConfiguration.
  • POST /api/room/screen: Change screen config.
    • Request/Response: ScreenConfiguration. Errors: 422 invalid config.
  • GET /api/room/screen/configurations: List available configurations.
    • Response: ScreenConfiguration[].
  • GET /api/room/screen/cast.jpg: Current screencast JPEG.
    • Response: image/jpeg. Errors: 400 screencast disabled; 500 fetch error.
  • GET /api/room/screen/shot.jpg: On-demand screenshot JPEG.
    • Query: quality integer (0–100). Response: image/jpeg. Errors: 500 create image.

Room Upload

  • POST /api/room/upload/drop: Upload files and drop at coordinates. 204.
    • multipart/form-data: x: number, y: number, files: file[].
    • Errors: 400 upload error; 500 processing error.
  • POST /api/room/upload/dialog: Upload files to active dialog. 204.
    • multipart/form-data: files: file[]. Errors: 422 no dialog.
  • DELETE /api/room/upload/dialog: Close file chooser dialog. 204. Error: 422 no dialog.

Members

  • GET /api/members: List members.
    • Query: limit?: number, offset?: number.
    • Response: MemberData[].
  • POST /api/members: Create member.
    • Request: MemberCreate.
    • Response: MemberData. Error: 422 ID exists.
  • GET /api/members/{memberId}: Get member profile.
    • Response: MemberProfile. 404 if not found.
  • POST /api/members/{memberId}: Update member profile. 204.
    • Request: MemberProfile.
  • DELETE /api/members/{memberId}: Remove member. 204.
  • POST /api/members/{memberId}/password: Update member password. 204.
    • Request: { password: string }.
  • POST /api/members_bulk/update: Bulk update member profiles. 204.
    • Request: { ids: string[], profile: MemberProfile }.
  • POST /api/members_bulk/delete: Bulk delete members. 204.
    • Request: { ids: string[] }.

Schemas (Shapes)

  • SessionData: { id: string, profile: MemberProfile, state: { is_connected: boolean, is_watching: boolean } }.
  • MemberProfile: { name?: string, is_admin?: boolean, can_login?: boolean, can_connect?: boolean, can_watch?: boolean, can_host?: boolean, can_share_media?: boolean, can_access_clipboard?: boolean, sends_inactive_cursor?: boolean, can_see_inactive_cursors?: boolean, plugins?: object }.
  • MemberData: { id: string, profile: MemberProfile }.
  • MemberCreate: { username: string, password: string, profile?: MemberProfile }.
  • Settings: { private_mode?, locked_controls?, implicit_hosting?, inactive_cursors?, merciful_reconnect?, plugins?: object }.
  • ScreenConfiguration: { width: number, height: number, rate: number }.
  • KeyboardMap: { layout?: string, variant?: string }; KeyboardModifiers: all booleans listed above.
  • BroadcastStatus: { url?: string, is_active: boolean }.
  • ClipboardText: { text?: string, html?: string }.
  • ErrorMessage: { message: string } (for 4xx/5xx bodies).

Notes

  • Unless stated, successful POSTs return 204 No Content; GETs return JSON bodies per schema.
  • Image endpoints return binary JPEG/PNG with appropriate content types.
  • Security schemes in effect globally: Bearer, Cookie, or token query; /api/login has security disabled.

Complete WebSocket API

  • Envelope: { "event": string, "payload": any }. Connect to ws(s)://<host>/api/ws with cookie, Bearer, or ?token=.
  • Heartbeats: server sends system/heartbeat ~10s and low-level pings; clients may send client/heartbeat (no payloads).

System

  • Server → Client system/init:
    • { session_id: string, control_host: { id?: string, has_host: boolean, host_id?: string }, screen_size: { width, height, rate }, sessions: { [id]: { id, profile: MemberProfile, state: SessionState } }, settings: Settings, touch_events: boolean, screencast_enabled: boolean, webrtc: { videos: string[] } }.
  • Server → Client (admins) system/admin:
    • { screen_sizes_list: ScreenSize[], broadcast_status: { is_active: boolean, url?: string } }.
  • Server → Client system/settings: { id: string, ...Settings } (actor id, new settings).
  • Client → Server system/logs: [{ level: 'debug'|'info'|'warn'|'error', fields?: object, message: string }].
  • Server → Client system/disconnect: { message: string }.

Signaling (WebRTC)

  • Client → Server signal/request: { video: { disabled?: boolean, selector?: { id: string, type: 'exact'|'best' }, auto?: boolean }, audio: { disabled?: boolean }, auto?: boolean }.
  • Server → Client signal/provide: { sdp: string, iceservers: [{ urls: string[], username?: string, credential?: string }], video: { disabled: boolean, id: string, video?: string, auto: boolean }, audio: { disabled: boolean } }.
  • Client ↔ Server signal/offer | signal/answer: { sdp: string } (direction depends on negotiation; Neko often offers first via signal/provide).
  • Client → Server signal/candidate: { candidate: string, sdpMid?: string, sdpMLineIndex?: number }.
  • Client → Server signal/video: { disabled?: boolean, selector?: { id: string, type: 'exact'|'best' }, auto?: boolean }.
  • Client → Server signal/audio: { disabled?: boolean }.
  • Server → Client signal/restart: { sdp: string } (ICE restart offer).
  • Server → Client signal/close: null payload (peer disconnected).

Sessions

  • Server → Client session/created: { id, profile: MemberProfile, state: SessionState }.
  • Server → Client session/deleted: { id }.
  • Server → Client session/profile: { id, ...MemberProfile }.
  • Server → Client session/state: { id, is_connected, connected_since?, not_connected_since?, is_watching, watching_since?, not_watching_since? }.
  • Server → Client session/cursors: [{ id: string, cursors: { x: number, y: number }[] }] (only when settings.inactive_cursors enabled).

Control & Input

  • Server ↔ Client control/request:
    • Client → Server: no payload (request host control).
    • Server → Client (to current host): { id: string } (requester id).
  • Server → Client control/host: { id: string, has_host: boolean, host_id?: string }.
  • Client → Server control/release: no payload (only host; requires can_host).
  • Client → Server control/move: { x, y }.
  • Client → Server control/scroll: { delta_x?: number, delta_y?: number, control_key?: boolean } (legacy { x, y } supported).
  • Client → Server control/buttonpress|buttondown|buttonup: { code: uint32, x?: number, y?: number }.
  • Client → Server control/keypress|keydown|keyup: { keysym: uint32, x?: number, y?: number }.
  • Client → Server control/touchbegin|touchupdate|touchend: { touch_id: uint32, x: number, y: number, pressure: uint8 }.
  • Client → Server control/cut|control/copy|control/select_all: no payload.
  • Client → Server control/paste: { text?: string } (optionally seeds clipboard first).
  • Notes: Control requires profile.can_host and either being host or implicit hosting; clipboard ops also require profile.can_access_clipboard.

Screen

  • Client → Server screen/set: { width, height, rate } (admin only).
  • Server → Client screen/updated: { id: string, width, height, rate }.

Clipboard

  • Client → Server clipboard/set: { text: string } (host with clipboard permission).
  • Server → Client clipboard/updated: { text: string } (sent to host on OS clipboard change).

Keyboard

  • Client → Server keyboard/map: { layout?: string, variant?: string } (host only).
  • Client → Server keyboard/modifiers: { shift?, capslock?, control?, alt?, numlock?, meta?, super?, altgr? } (host only; omitted fields unchanged).

Broadcast

  • Server → Client (admins) broadcast/status: { is_active: boolean, url?: string }.

Opaque Send Channel

  • Client → Server send/unicast: { receiver: string, subject: string, body: any } → forwarded to receiver as { sender, receiver, subject, body }.
  • Client → Server send/broadcast: { subject: string, body: any } → broadcast to others as { sender, subject, body }.

File Chooser Dialog

  • Server → Client file_chooser_dialog/opened: { id: string } (host holding dialog).
  • Server → Client file_chooser_dialog/closed: {}.

Shared Types (payload shapes)

  • MemberProfile: { name?: string, is_admin?, can_login?, can_connect?, can_watch?, can_host?, can_share_media?, can_access_clipboard?, sends_inactive_cursor?, can_see_inactive_cursors?, plugins?: object }.
  • SessionState: { is_connected: boolean, connected_since?: string, not_connected_since?: string, is_watching: boolean, watching_since?: string, not_watching_since?: string } (ISO timestamps).
  • Settings (WS): { private_mode, locked_logins, locked_controls, control_protection, implicit_hosting, inactive_cursors, merciful_reconnect, heartbeat_interval, plugins?: object }.
  • ScreenSize: { width: number, height: number, rate: number }.
  • KeyboardMap: { layout: string, variant: string }; KeyboardModifiers: booleans listed above.
  • Cursor: { x: number, y: number }.

Heartbeat & Reconnect Behavior

  • Layers of liveness
    • WebSocket ping/pong: Server sends a WS Ping every ~10s and also emits system/heartbeat in-band. Browsers auto‑reply with Pong; no client code is needed for WS Pong.
    • Client heartbeat event: Clients may send client/heartbeat at a cadence derived from settings.heartbeat_interval (default 120s). The server accepts it but does not strictly require it for liveness; it’s useful for analytics and legacy bridges.
  • Settings and defaults
    • session.heartbeat_interval (default 120s) is included in system/init.settings.heartbeat_interval for the client to display/use.
    • session.merciful_reconnect (default true): If a session is “already connected” and a new WS arrives, the server replaces the old connection instead of rejecting it. With this off, a second connection is rejected with reason “already connected”.
  • Timeouts and proxies
    • Ensure reverse proxies have read timeouts comfortably above both the WS Ping cadence (~10s) and the client heartbeat interval (≥120s). Recommended proxy_read_timeout ≥ 300s and to forward Upgrade/Connection headers for WebSocket.
    • Avoid aggressive idle timeouts on L4/L7 load balancers that terminate idle TCP flows under a few minutes; set ≥5 minutes where possible.
  • Disconnect semantics
    • If WS Ping fails or the socket errors, the server closes the connection and sends system/disconnect {message} when possible. The session is marked disconnected; clients should auto‑reconnect and renegotiate WebRTC.
    • On WebRTC transport loss, the server emits signal/close; clients should re‑issue signal/request or handle signal/restart.
  • Troubleshooting frequent drops
    • Blackouts every N seconds: increase reverse proxy proxy_read_timeout/keep‑alive timeouts; confirm no CDN/WebApp firewall is in the WS path.
    • “already connected” on reconnect: enable/keep session.merciful_reconnect=true (default), or ensure clients close prior tabs before reconnecting.
    • Mobile/background tabs: some OSes suspend WS/Pong; expect disconnects when backgrounded. The client must reconnect on resume.

Legacy WebSocket Events (v2 Compatibility Proxy)

  • Enabled when legacy mode is active (see server/internal/http/manager.go uses viper legacy). The legacy bridge maps v3 events to v2 event names for old clients and also emits some additional compatibility messages.

  • Envelope: legacy messages also use {event, ...payload}.

  • System:

    • system/init: { locks: {login?, control?, file_transfer?}: string map, implicit_hosting: boolean, file_transfer: boolean, heartbeat_interval: number }.
    • system/disconnect: { title?: string, message: string }.
    • system/error: historical; emitted by legacy paths on errors.
  • Members:

    • member/list: { members: [{ id, name, admin: boolean, muted: boolean }] }.
    • member/connected: { id, name, admin, muted }.
    • member/disconnected: { id }.
  • Signaling:

    • signal/provide: { id: string, sdp: string, lite: boolean, ice: [{ urls, username?, credential? }] }.
    • signal/offer: { sdp: string }.
    • signal/answer: { displayname: string, sdp: string }.
    • signal/candidate: { data: string } (raw ICE JSON as string).
  • Control/Admin (compatibility notifications):

    • control/locked: { id: string } (lock established).
    • control/release: { id: string }.
    • control/request: client-originated; control/requesting: notification to host; control/give: { id, target }.
    • control/clipboard: { text: string }.
    • control/keyboard: { layout?: string, capsLock?: boolean, numLock?: boolean, scrollLock?: boolean }.
    • admin/lock|unlock: { resource: 'control'|'login'|'file_transfer', id: string }.
    • admin/control|release|give: { id: string } or { id, target }.
    • admin/ban|kick|mute|unmute: { id: string } or { target: string, id: string }.
  • Chat (legacy plugin messages):

    • chat/message: send/receive { id?: string, content: string }.
    • chat/emote: send/receive { id?: string, emote: string }.
  • Filetransfer (legacy plugin messages):

    • filetransfer/list: { cwd: string, files: [{ name, size, is_dir, mtime, perms, ... }] }.
    • filetransfer/refresh: same shape as list (triggered refresh).
  • Screen/Broadcast:

    • screen/configurations: { configurations: { [idx]: { width, height, rates: { [idx]: rate } } } }.
    • screen/resolution: { id?: string, width, height, rate }.
    • screen/set: { width, height, rate }.
    • broadcast/status: { url: string, isActive: boolean }.
    • broadcast/create: { url: string }; broadcast/destroy: no payload.

1. What Neko Is (Concept & Origin)

Neko (often styled n.eko) is an open‑source, self‑hosted virtual browser / remote desktop environment: you run a containerized Linux desktop with a preinstalled browser (Firefox, Chromium, etc.) on your own infrastructure; Neko streams the interactive desktop (video, audio, input) to remote clients via WebRTC, so multiple participants can watch and even take control in real time. GitHub neko.m1k1o.net heise online

The project was started by its author after the shutdown of Rabb.it; needing a reliable way to watch anime remotely with friends over limited bandwidth + unstable Discord streaming, he built a WebRTC‑based Dockerized environment so everyone could share a single browser session. This collaborative genesis still shapes Neko’s multi‑user design (shared control queue, watch‑party friendliness). GitHub heise online

Neko targets privacy, isolation, and portability: browsing happens in the container, not on the viewer’s device; host fingerprints/cookies stay server‑side; nothing persistent need touch the client unless you configure it. This “shielded browser” model is highlighted in both the docs and independent coverage (Heise), which also frames Neko as a lightweight VPN alternative for accessing internal resources without distributing full desktop access. neko.m1k1o.net heise online fossengineer.com

2. Primary Use Cases

  • Collaborative browsing & watch parties: All participants see the same live browser; host control can be passed; synchronized media playback works well because WebRTC streams the rendered video/audio from the container. GitHub neko.m1k1o.net fossengineer.com
  • Interactive presentations, workshops, remote support: Presenter drives a shared browser/desktop; participants can be granted temporary control for demos or troubleshooting. Heise specifically calls out company trainings and support scenarios. GitHub heise online fossengineer.com
  • Privacy / throwaway browsing / firewall bypass: Because traffic originates from the Neko host, users can browse sites blocked locally (subject to policy/ethics); community reports note using Neko to get around locked‑down work networks. heise online fossengineer.com Reddit
  • Web dev & cross‑browser testing in controlled envs: Spin up specific browser versions (incl. Waterfox, Tor, Chromium variants) to test sites without polluting local machines. neko.m1k1o.net neko.m1k1o.net heise online
  • Remote application streaming beyond browsers: Official images include full desktop environments (KDE, Xfce), Remmina (RDP/VNC client), VLC, and more; you can install arbitrary Linux GUI apps, turning Neko into a general remote app delivery layer. GitHub neko.m1k1o.net heise online
  • Embedding into other web properties / programmatic rooms: Docs and community guides show URL query param auth for frictionless embedding; REST API + Neko Rooms enable dynamic, ephemeral shareable sessions. neko.m1k1o.net GitHub fossengineer.com

3. High‑Level Architecture

At a high level, a Neko deployment comprises:

  • Server container(s): Run the Linux desktop + target browser/application; capture Xorg display frames + PulseAudio; encode via GStreamer; feed into WebRTC pipeline (Pion stack). GitHub neko.m1k1o.net GitHub
  • Signaling / control plane: HTTP + WebSocket endpoints manage sessions, auth, and host‑control; periodic ping/heartbeat maintain liveness (esp. behind proxies). neko.m1k1o.net neko.m1k1o.net GitHub
  • WebRTC media plane: ICE negotiation (STUN/TURN) to establish peer link(s); selectable port strategy (ephemeral range vs. UDP/TCP mux single port); optional Coturn relay for NAT‑restricted environments. neko.m1k1o.net neko.m1k1o.net GitHub
  • Client UI (served over HTTPS): Browser front‑end page that renders the stream in a canvas/video element, sends input events (mouse/keyboard), displays participant cursors, chat/plugins, and exposes host‑control queue. GitHub neko.m1k1o.net neko.m1k1o.net
  • Optional ecosystem services: REST API, Prometheus metrics exporter, plugin hooks (chat, file upload), and higher‑level orchestration projects (Neko Rooms / Apps / VPN). GitHub neko.m1k1o.net neko.m1k1o.net

4. Feature Inventory (v3 era)

  • Multi‑user concurrent session w/ host handoff + inactive cursors: Participants can join; privileges (watch / host / share media / clipboard) governed per‑member profile. neko.m1k1o.net GitHub neko.m1k1o.net
  • Audio + video streaming w/ low latency: WebRTC transport from container to clients; GStreamer capture; stream selector to adjust quality. neko.m1k1o.net GitHub neko.m1k1o.net
  • GPU acceleration modes (Intel/Nvidia flavors) & CPU builds: Select appropriate image flavor to offload encoding & improve responsiveness; GPU support maturity varies—docs caution focus currently on CPU images. neko.m1k1o.net heise online neko.m1k1o.net
  • Granular auth/authorization (admin vs user; fine‑grained caps): Role bits include can_login, can_connect, can_watch, can_host, can_share_media, can_access_clipboard, etc.; supports multiuser password split, file‑backed users, in‑memory object sets, and no‑auth (dev only). neko.m1k1o.net GitHub
  • REST API + API token (admin programmatic control) & batch HTTP: Added in v3; enables external orchestration, dynamic user provisioning, and admin operations without interactive login; API token should be short‑lived in ephemeral rooms. GitHub neko.m1k1o.net
  • Prometheus metrics & pprof profiling: Expose runtime health / performance metrics; integrate into observability stacks; profiling hooks assist tuning. GitHub neko.m1k1o.net
  • Desktop quality‑of‑life: Clipboard reworked via xclip; drag‑and‑drop & file chooser upload; touchscreen input driver; dynamic resolution via xrandr; cursor image events. GitHub neko.m1k1o.net
  • Screencast endpoints + webcam/mic passthrough (experimental): HTTP/JPEG screencast endpoints for snapshots/casting; optional upstream of user webcam/mic. GitHub neko.m1k1o.net
  • Plugin system (chat, file upload, user‑scoped plugin config map). GitHub neko.m1k1o.net neko.m1k1o.net

5. Supported Browsers / Apps / Desktops

Neko ships many tagged images; availability varies by architecture and GPU flavor. Current matrix (AMD64 strongest support): Firefox, Waterfox, Tor Browser; Chromium family incl. Google Chrome, Microsoft Edge, Brave, Vivaldi, Opera; plus Ungoogled Chromium. Additional desktop/media apps: KDE, Xfce, Remmina, VLC. ARM support exists for subsets (e.g., Brave & Vivaldi on ARM64; some lack DRM). neko.m1k1o.net heise online GitHub

Community packages (Umbrel) surface a streamlined install for home servers; Umbrel metadata shows current packaged version (3.0.4 at capture) and highlights collaboration + tunneling access patterns. apps.umbrel.com neko.m1k1o.net

6. Deployment Overview (Minimal to Advanced)

6.1 Quick Minimal Docker Run

Pull an image (e.g., Firefox flavor) and run mapping HTTP + WebRTC ports; provide screen size and user/admin passwords via env vars; share memory sized for modern browsers (e.g., 2GB). Community example docker‑compose (FOSS Engineer) shows mapping 8888:8080 plus 52000-52100/udp EPR range and NEKO_MEMBER_MULTIUSER_* passwords. fossengineer.com neko.m1k1o.net neko.m1k1o.net

6.2 Choosing Registry & Tags

Prefer GitHub Container Registry (GHCR) for stable, flavor‑specific version tags; Docker Hub hosts latest dev (amd64) convenience builds. Semantic versioning (MAJOR.MINOR.PATCH) supported; latest for most recent stable—pin explicit tags for reproducibility. neko.m1k1o.net Docker Hub

6.3 Selecting Flavors (CPU vs GPU)

Image suffix selects hardware accel stack: nvidia-* for CUDA GPUs (AMD64), intel-* for VA‑API/QuickSync paths, or base CPU images. Docs caution GPU support may lag; verify in your environment. neko.m1k1o.net heise online

6.4 Architecture Match & Resource Planning

Images published for linux/amd64, arm64, arm/v7; not every browser builds on all arches; some Chromium‑derived variants require ≥2GB RAM (Heise). Check the docs availability matrix before pulling. neko.m1k1o.net heise online

6.5 Persistent State (Data Volumes)

While Neko can be run “throwaway,” you may bind‑mount config, member files, and persistent browser profiles to retain bookmarks, extensions (if policy permits), and user lists; docs show file/member providers referencing host paths (e.g., /opt/neko/members.json). neko.m1k1o.net neko.m1k1o.net neko.m1k1o.net

7. Networking & WebRTC Ports

7.1 Why Ports Matter

WebRTC media does not traverse your HTTP reverse proxy; you must expose the negotiated media ports (or provide a TURN relay). If you only open 443 you will fail unless multiplexing or relay is used. neko.m1k1o.net neko.m1k1o.net Reddit

7.2 Ephemeral UDP Port Range (EPR)

Configure NEKO_WEBRTC_EPR (e.g., 59000-59100) and expose identical host:container UDP range; don’t remap—ICE candidates must match reachable ports. neko.m1k1o.net

7.3 UDP/TCP Multiplexing

Alternatively specify single udpmux / tcpmux ports when firewall pinholes are scarce; open both protocols for fallback where UDP blocked. neko.m1k1o.net

7.4 Public vs NAT’d IPs

Set nat1to1 when advertising a different reachable address (NAT hairpin caveats); or provide an IP retrieval URL to auto‑detect public address; otherwise ICE may hand out unroutable candidates. neko.m1k1o.net

7.5 TURN Integration

Provide STUN/TURN server JSON (frontend/back‑end separation) via env vars; example Coturn compose snippet in docs; TURN recommended when clients sit behind strict NAT/firewalls. neko.m1k1o.net neko.m1k1o.net

7.6 Real‑World Gotchas

Community reverse‑proxy thread shows mis‑set X‑Forwarded headers and missing additional port exposures leading to 502s; verifying correct WebRTC ports resolved issues for some users. Reddit neko.m1k1o.net

8. Reverse Proxy Patterns (HTTP Plane)

8.1 Enable Proxy Trust

Set server.proxy=true so Neko honors X-Forwarded-* headers (important for logging, CSRF, cookie domain/path). Docs warn to adjust WebSocket timeouts because Neko pings every ~10s and expects client heartbeat ~120s. neko.m1k1o.net neko.m1k1o.net

8.2 Traefik v2 Example

Label‑driven routing to backend 8080; integrate TLS cert resolver; ensure UDP media ports separately exposed. neko.m1k1o.net neko.m1k1o.net

8.3 Nginx Example & Header Hygiene

Minimal conf proxies HTTP + WebSocket upgrade; you may add X‑Forwarded‑For/Proto, cache bypass, and long read timeouts—legacy v2 docs show extended header set; community notes correcting X-Forwarded-Proto spelling vs “Protocol.” neko.m1k1o.net neko.m1k1o.net Reddit

8.4 Apache, Caddy, HAProxy Templates

Docs provide working snippets incl. WebSocket rewrite for Apache; one‑liner reverse_proxy for Caddy w/ auto HTTPS; HAProxy ACL routing recipe w/ timeout tuning guidance. neko.m1k1o.net neko.m1k1o.net

9. Authentication & Authorization

9.1 Member vs Session Providers

Auth split: Member Provider validates credentials + returns capability profile; Session Provider persists session state (memory/file). Single member provider active at a time. neko.m1k1o.net

9.2 Capability Flags (Granular Rights)

Per‑user profile booleans drive UI & backend enforcement: admin status; login/API; connect vs watch; host control; share media; clipboard access; send inactive cursor; see inactive cursors; plugin‑specific keys. neko.m1k1o.net GitHub

9.3 Provider Types

  • Multiuser: Two shared passwords (admin/user) generate ephemeral usernames; mirrors legacy v2 behavior.
  • File: Persistent JSON map of users → hashed (optional) passwords + profiles.
  • Object: In‑memory static list; no dup logins.
  • No‑Auth: Open guest access (testing only—danger). neko.m1k1o.net

9.4 API User Token

Separate non‑interactive admin identity for HTTP API calls; cannot join media; recommend ephemeral/rotated tokens (avoid long‑lived static in exposed rooms). neko.m1k1o.net GitHub

Session cookie name, expiry, secure, httpOnly, domain/path configurable; disabling cookies falls back to token in client local storage—less secure (XSS risk). Keep cookies enabled for production. neko.m1k1o.net

10. Security Considerations

  • Surface reduction via containerization: Browsing occurs inside an isolated container; you can discard state or run read‑only images for throwaway sessions; community privacy guides emphasize non‑retention setups. fossengineer.com GitHub
  • Transport security & certs: Terminate TLS at your reverse proxy (Traefik/Caddy/Certbot etc.); ensure WebSocket upgrades & long timeouts; see official reverse proxy examples. neko.m1k1o.net neko.m1k1o.net
  • Auth hardening: Use strong unique admin/user passwords (or file/object providers w/ hashed credentials); avoid enabling no‑auth in public deployments; scope API tokens tightly. neko.m1k1o.net neko.m1k1o.net
  • Cookie vs token leakage: Leaving cookies enabled (secure, httpOnly) prevents script access to session; disabling pushes token into JS‑accessible storage increasing exfiltration risk. neko.m1k1o.net
  • Firewalling media ports: Only expose required UDP/TCP ranges; where possible, restrict source IPs or require authenticated TURN; community reports of leaving ports closed manifest as connection failures rather than leaks—but mis‑config can open broad EPR ranges; plan network policy. neko.m1k1o.net Reddit
  • Extension install policy: Browser policies in images may block arbitrary extension installs; you must explicitly allow if you need them—reduces attack surface by default. neko.m1k1o.net neko.m1k1o.net

11. Performance & Tuning

  • Screen resolution & frame rate: NEKO_DESKTOP_SCREEN / NEKO_SCREEN env controls virtual display mode (e.g., 1920x1080@30); higher rates = more bandwidth/CPU/GPU; choose based on clients & uplink. fossengineer.com neko.m1k1o.net
  • Shared memory size: Modern Chromium‑family browsers need large /dev/shm; examples allocate shm_size: 2gb; undersizing leads to crashes. fossengineer.com neko.m1k1o.net
  • Bandwidth estimator (experimental adaptive bitrate): Optional server‑side estimator can downgrade/upgrade encodes based on measured throughput; disabled by default; numerous thresholds/backoffs tunable. neko.m1k1o.net GitHub
  • Hardware accel vs CPU encode tradeoffs: GPU flavors reduce encode latency but add driver complexity; docs call out limited support maturity; Heise notes Neko can leverage Intel/Nvidia accelerated builds. neko.m1k1o.net heise online
  • Resource guidance for Chromium variants: Heise reports ≥2GB RAM allocation recommended when running Chromium‑based browsers in containers; plan host sizing accordingly. heise online neko.m1k1o.net

12. Administration & Operations

  • Logging & Debugging: Enable debug logging via log.level=debug or env NEKO_DEBUG=1; GStreamer verbosity via GST_DEBUG; Pion debug by PION_LOG_DEBUG=all; inspect docker logs and browser dev console. neko.m1k1o.net GitHub
  • Metrics & Profiling: Prometheus metrics endpoint + pprof instrumentation introduced in v3 support operational monitoring and performance investigation. GitHub neko.m1k1o.net
  • Upgrades / Migration from v2: Config modularization in v3; backward compatibility shims but deprecated; consult v3 docs + legacy reverse proxy header diffs when migrating. GitHub neko.m1k1o.net neko.m1k1o.net
  • Embedding Auto‑Login: For kiosk/iframe use, append ?usr=<user>&pwd=<pwd> to URL to bypass login prompt for viewers—use carefully; combine w/ restricted capability profile. neko.m1k1o.net neko.m1k1o.net
  • Clipboard Behavior: When accessed over HTTPS in supported host browsers (Chromium family), Neko hides its own clipboard button, deferring to native Clipboard API integration; not a bug. neko.m1k1o.net GitHub

13. Ecosystem Projects

  • Neko Rooms: Multi‑room orchestration wrapper that spins up independent Neko instances (ephemeral or persistent) with simplified onboarding (scripts, HTTPS via Let’s Encrypt, Traefik/NGINX automation); useful when you need per‑group isolation. neko.m1k1o.net fossengineer.com Reddit
  • Neko Apps: Library of containerized app bundles beyond browsers—expands use cases to general remote Linux app streaming; complements Rooms for scaling out multi‑app catalogs. fossengineer.com neko.m1k1o.net
  • Neko VPN (experimental): Mentioned in docs nav as companion project enabling tunneled access paths; explore if you need integrated network overlay to reach internal apps through Neko. neko.m1k1o.net neko.m1k1o.net
  • Umbrel Packaging: Curated home‑server integration; one‑click install, Umbrel tunneling for remote reachability, version tracking; good for homelab / non‑Docker‑experts. apps.umbrel.com neko.m1k1o.net

14. Comparison Touchpoints

  • vs. Kasm Workspaces: Heise positions Neko as the lightweight alternative—Kasm provides full multi‑tenant workspace management & security layers but is heavier; Neko is simpler, container‑first, optimized for shared live sessions rather than individual isolated desktops (though you can run per‑user instances). heise online GitHub
  • vs. Hyperbeam API (hosted embeddable co‑browse): Neko offers a similar embeddable shared browser experience but is self‑hosted, giving you data control & on‑prem compliance; Heise explicitly calls out analogous embedding. heise online neko.m1k1o.net
  • vs. Generic Remote Desktop (VNC/NoVNC/Guacamole): WebRTC yields smoother video + audio sync and lower interactive latency compared to image‑diff or poll‑based remotes; community commentary and docs emphasize superior streaming for media/watch usage. fossengineer.com GitHub

15. Practical Config Snippets

version: "3.4"
services:
neko:
  image: "ghcr.io/m1k1o/neko/firefox:latest"   # pick flavor/tag
  restart: unless-stopped
  shm_size: 2gb
  ports:
    - "8080:8080"                       # HTTP / signaling
    - "59000-59100:59000-59100/udp"     # WebRTC EPR
  environment:
    NEKO_DESKTOP_SCREEN: 1920x1080@30
    NEKO_MEMBER_MULTIUSER_ADMIN_PASSWORD: ${NEKO_ADMIN:?err}
    NEKO_MEMBER_MULTIUSER_USER_PASSWORD: ${NEKO_USER:?err}
    NEKO_WEBRTC_EPR: 59000-59100
    NEKO_WEBRTC_ICELITE: 1
    NEKO_DEBUG: 0
    # optionally front/back STUN/TURN JSON:
    # NEKO_WEBRTC_ICESERVERS_FRONTEND: '[{"urls":["stun:stun.l.google.com:19302"]}]'
    # NEKO_WEBRTC_ICESERVERS_BACKEND:  '[]'
volumes:
# mount for persistent member/session files if using file provider
# - ./data:/opt/neko

fossengineer.com neko.m1k1o.net neko.m1k1o.net

server {
listen 443 ssl http2;
server_name neko.example.com;

location / {
  proxy_pass http://127.0.0.1:8080;
  proxy_http_version 1.1;
  proxy_set_header Upgrade $http_upgrade;
  proxy_set_header Connection "upgrade";
  proxy_set_header Host $host;
  proxy_set_header X-Real-IP $remote_addr;
  proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
  proxy_set_header X-Forwarded-Proto $scheme;
  proxy_cache_bypass $http_upgrade;
}
}```
[neko.m1k1o.net](https://neko.m1k1o.net/docs/v3/reverse-proxy-setup) [neko.m1k1o.net](https://neko.m1k1o.net/docs/v2/reverse-proxy) [Reddit](https://www.reddit.com/r/nginxproxymanager/comments/ut8zyu/help_with_setting_up_reverse_proxy_custom_headers/)

```yaml
# Minimal member provider switch to file‑backed users with hashed passwords
member:
provider: file
file:
  path: "/opt/neko/members.json"
  hash: true
session:
file: "/opt/neko/sessions.json"
session:
api_token: "<short-lived-random-hex>"

neko.m1k1o.net GitHub

16. Operational Runbook Checklist

  • Preflight: Pick image flavor + arch; allocate ≥2GB RAM (Chromium); set shm_size; open media ports (EPR or mux); decide auth provider; create strong creds. neko.m1k1o.net heise online neko.m1k1o.net
  • Launch: Compose up; confirm logs show listening on 8080 + WebRTC ports; test LAN client first; verify ICE candidates reachable (browser dev console). neko.m1k1o.net neko.m1k1o.net
  • Secure: Put behind TLS proxy; enable proxy trust; restrict ports/firewall; rotate API tokens; store hashed passwords. neko.m1k1o.net neko.m1k1o.net neko.m1k1o.net
  • Scale / Multi‑tenant: Use Neko Rooms or orchestration (k8s, compose bundles) to spin per‑team instances; leverage REST API + metrics for automation & autoscaling triggers. GitHub neko.m1k1o.net fossengineer.com
  • Troubleshoot: Turn on debug envs; inspect GStreamer logs for encode issues; validate reverse proxy headers; check that WebRTC ports aren’t blocked (common 502 confusion). neko.m1k1o.net Reddit neko.m1k1o.net

17. Roadmap Glimpses & Future Directions

Recent release notes hint at additional session backends (Redis/Postgres), richer plugin ecosystem, and potential RDP/VNC relay modes where Neko acts as a WebRTC gateway rather than running the browser locally. Heise reports interest in direct protocol relay; docs flag “in the future” for expanded session providers. heise online neko.m1k1o.net GitHub

18. Community Lore / Field Notes

Homelabbers use Neko to co‑watch media, punch through restrictive corporate firewalls (when Neko host has outbound freedom), and expose full Linux desktops (KDE) to lightweight tablets. These anecdotes underscore why low‑latency WebRTC streaming and easy multi‑user control were prioritized. Reddit fossengineer.com GitHub

Reverse‑proxy misconfig (wrong header name, missing EPR exposure) is a recurring community stumbling block; always validate both HTTP and media planes. Reddit neko.m1k1o.net neko.m1k1o.net

Neko Browser + Playwright/CDP Integration Deep Dive

Neko (“n.eko”) is an open‑source, self‑hosted virtual browser that streams a full Linux desktop (not just a headless DOM) over WebRTC so multiple remote users can view and interactively control the same session in real time. It targets collaborative browsing, watch parties, remote support, embedded browser surfaces, and hardened “throwaway” cloud browsing where nothing persists locally. neko.m1k1o.net GitHub heise online

Unlike simple remote‑automation containers, Neko can run any Linux GUI application—browsers (Firefox, Chromium, Brave, Vivaldi, Waterfox, Tor, etc.), media players like VLC, full desktop environments (XFCE, KDE), and bespoke tools—because it captures an Xorg display and streams audio/video frames to clients via WebRTC. This breadth makes it viable for shared debugging sessions, interactive presentations, and as a privacy “jump box” into otherwise restricted networks. GitHub heise online Umbrel App Store

Multi‑user collaboration is first‑class: user roles, admin elevation, shared cursor visibility, host (keyboard/mouse) control arbitration, clipboard access, media sharing, and plugin‑scoped per‑user settings are governed by Neko’s v3 authentication system (Member + Session providers). This replaces v2’s simple dual‑password model and lets you express richer authorization matrices or plug in external identity sources. neko.m1k1o.net neko.m1k1o.net

Versioning: v3 vs v2 & Legacy Mode

Neko v3 reorganized configuration into modular namespaces (server, member, session, webrtc, desktop, capture, plugins, etc.) and introduced providers; however, v3 retains backward compatibility with v2 environment variables when NEKO_LEGACY=true is set (and some legacy features auto‑detected). A migration table maps every major v2 var to its v3 equivalent (e.g., NEKO_SCREENNEKO_DESKTOP_SCREEN; NEKO_PASSWORDNEKO_MEMBER_MULTIUSER_USER_PASSWORD; NEKO_NAT1TO1NEKO_WEBRTC_NAT1TO1). This is critical when modernizing older compose files (like the snippet you shared) to avoid silent fallbacks and dual‑stream cursor quirks. neko.m1k1o.net neko.m1k1o.net

Heise’s Neko 3.0 coverage underscores why migrating matters: new browser flavors (Waterfox, additional Chromium builds, ARM variants), GPU‑accelerated options, HTTP/JPEG screencast endpoints, plugin ecosystem growth, and structural config changes—all shipping under a maintained Apache‑2.0 project—mean staying current pays dividends in stability and capability. heise online GitHub

Community quick‑start guides still widely circulate v2 envs (e.g., NEKO_SCREEN, NEKO_PASSWORD, NEKO_ICELITE, NEKO_EPR), which “work” only because legacy support remains—but they obscure v3 tuning knobs and can yield performance or auth surprises (e.g., no granular per‑user policy). Use the migration mapping to upgrade; I’ll show a patched compose below. fossengineer.com neko.m1k1o.net

Authentication Model (v3)

Authentication splits into Member Provider (who are you? what can you do?) and Session Provider (state & tokens). The multiuser provider emulates v2’s “user password” + “admin password” flow; you enable it via NEKO_MEMBER_PROVIDER=multiuser, then supply NEKO_MEMBER_MULTIUSER_USER_PASSWORD and NEKO_MEMBER_MULTIUSER_ADMIN_PASSWORD, optionally overriding default per‑role capability profiles (host, watch, clipboard, etc.). For tighter control, switch to file or object providers to define fixed accounts, hashed passwords, and granular profiles; or noauth for unsecured demo setups (never production). Session storage can persist to file; API access can be separately tokenized (NEKO_SESSION_API_TOKEN). neko.m1k1o.net

When exposing Neko programmatically (embedding in an app, auto‑provisioning rooms, LLM agents), consider disabling cookies or providing short‑lived API tokens; but weigh increased XSS risk if tokens leak into client JS when cookies are off. v3 exposes cookie flags (secure, http_only, domain/path scoping) so you can harden deployment behind TLS. neko.m1k1o.net

WebRTC Transport Essentials

For smooth low‑latency A/V + input streaming you must correctly expose Neko’s WebRTC ports. Three main patterns:

  1. Ephemeral UDP Port Range (EPR) — Specify a contiguous range (e.g., 56000-56100) via NEKO_WEBRTC_EPR and map the exact same range host:container without remap. Each new participant consumes ports; size range accordingly. neko.m1k1o.net
  2. UDP/TCP Multiplexing — Collapse to a single well‑known port (e.g., 59000) as NEKO_WEBRTC_UDPMUX / NEKO_WEBRTC_TCPMUX for NAT‑challenged environments; trade throughput. neko.m1k1o.net
  3. ICE Servers — Provide STUN/TURN front/back split: NEKO_WEBRTC_ICESERVERS_FRONTEND (what clients see) and ..._BACKEND (what the server dials internally); JSON‑encoded arrays. Required when clients are off‑LAN and UDP paths are blocked. neko.m1k1o.net

If you run behind NAT, set NEKO_WEBRTC_NAT1TO1 to the public (hairpin‑reachable) address; otherwise clients may ICE‑candidate a private IP and fail to connect. Automatic public IP fetch is available but you can override with NEKO_WEBRTC_IP_RETRIEVAL_URL. neko.m1k1o.net neko.m1k1o.net

Do not rely on your HTTP reverse proxy to relay WebRTC media. Nginx/Traefik only front the signaling/control (HTTP(S)/WS) on port 8080; actual RTP/DTLS flows use the ports you expose above and must be reachable end‑to‑end or via TURN. neko.m1k1o.net neko.m1k1o.net

Reverse Proxy & Timeouts

When fronting Neko with nginx/Traefik/etc., enable proxy trust in server config (server.proxy=true / NEKO_SERVER_PROXY=1 in v3) so real client IPs from X-Forwarded-* are honored. Neko sends WS pings ~10s; clients heartbeat ~120s—so bump proxy read timeouts accordingly or users drop during long idle automation runs. Official nginx sample shows required Upgrade/Connection headers for WebSocket upgrade; community Nginx Proxy Manager threads confirm these plus extended proxy_read_timeout and forwarded IP headers to avoid 502s and broken control channels. neko.m1k1o.net Reddit

Container Security / Chromium Sandboxing

Chromium inside containers often needs elevated namespaces to run its sandbox; many headless automation images either add --no-sandbox (reduced isolation) or grant --cap-add=SYS_ADMIN and supporting kernel flags so Chrome’s sandbox works. Puppeteer’s Docker docs call out the SYS_ADMIN requirement for their hardened image; Neko’s own v2 troubleshooting notes that forgetting SYS_ADMIN yields a black screen in Chromium variants—evidence the capability remains relevant. Decide: secure host kernel + allow SYS_ADMIN (preferred for full sandbox) or run --no-sandbox and accept risk; the sample supervisord snippet you posted already includes --no-sandbox, so SYS_ADMIN is belt‑and‑suspenders but still recommended for stability in GPU/namespace operations. pptr.dev neko.m1k1o.net

Enabling Chrome DevTools Protocol (CDP) in Neko for Playwright

Your goal: let humans drive the streamed Neko Chromium UI and attach automation via Playwright. Playwright supports attaching to any existing Chromium instance that exposes a DevTools endpoint via chromium.connectOverCDP(endpointURL), where endpointURL can be the HTTP JSON version URL or direct WS endpoint; the returned browser exposes existing contexts/pages. Lower fidelity than full Playwright protocol, but ideal for “co‑drive” scenarios. Playwright GitHub

Once connected, you can open a raw CDPSession per page/context to send protocol commands (e.g., Runtime.evaluate, Animation.enable), mirroring the manual WebSocket probes in your test.js. This is useful for diagnostics, performance metrics, and low‑level tweaks Playwright doesn’t expose natively. Playwright Playwright

Remote Debugging Flags & Port Forward Pattern

Modern Chromium removed unrestricted --remote-debugging-address=0.0.0.0 for security; recommended practice is bind the DevTools socket to localhost within the container (e.g., --remote-debugging-port=9223), then selectively forward or reverse‑proxy to an external port (e.g., 9222) with an auth / ACL layer (nginx, socat, SSH tunnel). Your nginx‑cdp sidecar implements precisely this 9222→9223 pass‑through with WebSocket upgrade and long timeouts—aligning with guidance from the Dockerized Chromium remote debugging discussion. Stack Overflow neko.m1k1o.net

Review of Your web-agent/neko-with-playwright Compose Snippet

You posted a two‑service stack: neko (using m1k1o/neko:chromium) and an nginx-cdp sidecar in service network_mode sharing; supervisord launches Chromium with CDP flags and disables sandbox/gpu; nginx maps host 9222 to internal 9223 to front DevTools with WS keepalive/timeouts. Ports published: 52000→8080(tcp?) and 9222 (tcp). Issues & improvements:

  • 1. Legacy Env Vars – You’re mixing v2 (NEKO_SCREEN, NEKO_PASSWORD*, NEKO_ICELITE, NEKO_NAT1TO1) in a v3 world; while legacy support exists, you lose granular control and risk double cursor streams (cursor once in video, once separate) plus awkward auth extension later. Upgrade to v3 vars (NEKO_DESKTOP_SCREEN, NEKO_MEMBER_PROVIDER=multiuser, NEKO_MEMBER_MULTIUSER_*, NEKO_WEBRTC_ICELITE, NEKO_WEBRTC_NAT1TO1). neko.m1k1o.net neko.m1k1o.net neko.m1k1o.net

  • 2. Missing WebRTC Ports – No UDP EPR or mux port is exposed, so remote WebRTC will fail off‑box unless clients are on the container host network and fallback mechanisms kick in. Add either an EPR range mapping and NEKO_WEBRTC_EPR or UDPMUX/TCPMUX single‑port mapping. neko.m1k1o.net neko.m1k1o.net

  • 3. Public vs Private Subnet – Your custom Docker subnet 17.100.0.0/16 collides with publicly routed Apple allocations (17.0.0.0/8 owned by Apple); choose RFC1918 (e.g., 172.31.0.0/16 or 10.67.0.0/16) to avoid confusing clients seeing ICE candidates referencing real vs container ranges. Proper NAT1TO1 matters when advertising ICE addresses. neko.m1k1o.net neko.m1k1o.net

  • 4. Proxy Headers & Timeouts – Good start; ensure proxy_read_timeout is comfortably above Neko’s WebSocket ping interval (~10s)—set ≥60s or higher—and that NEKO_SERVER_PROXY=1 (or config) is set so Neko trusts forwarded IPs; align with official reverse proxy doc + community NPM thread. neko.m1k1o.net Reddit

  • 5. Chromium Capability / Sandbox – You added cap_add: SYS_ADMIN (good) and --no-sandbox (less secure). Consider removing --no-sandbox once you confirm kernel support; Neko experiences black screens without SYS_ADMIN in Chromium images; Puppeteer’s hardened image docs reinforce giving SYS_ADMIN if you want sandbox. pptr.dev neko.m1k1o.net

  • 6. Password Hygiene – Hard‑coding neko / admin is fine for testing but never production; switch to secrets or .env injection; multiuser provider makes it easy. neko.m1k1o.net neko.m1k1o.net

  • 7. NAT Hairpin & ICE Lite – You set NEKO_ICELITE=0 (full ICE) and NAT1TO1 to container IP; if you actually need WAN access supply your public IP or domain; ICE Lite mode is only appropriate when server has public reflexive; official doc warns not to mix with external ICE servers. neko.m1k1o.net

  • 8. Debug Logging – When diagnosing CDP or WebRTC handshake, enable NEKO_DEBUG=1 and optional GST_DEBUG per FAQ; huge time saver. neko.m1k1o.net

Hardened & Modernized Compose Example (v3 Vars, CDP Enabled)

Below is an updated docker-compose.yml (org‑mode src). Key changes:

  • Switched to GHCR explicit version tag (pin for reproducibility).
  • RFC1918 subnet.
  • Proper WebRTC EPR exposure.
  • v3 auth vars.
  • Proxy flag so Neko trusts sidecar.
  • Optional API token for automation mgmt.
  • Chromium started with localhost‑bound remote debugging; nginx sidecar terminates TLS (optional) & ACLs; you can env‑inject allowed upstream (e.g., ngrok tunnel).
  • Dropped --no-sandbox (commented) to prefer secure sandbox; toggle per your threat model.
  • Added healthcheck & log volumes.
version: "3.8"

x-neko-env: &neko-env
NEKO_DESKTOP_SCREEN: 1920x1080@30
NEKO_MEMBER_PROVIDER: multiuser
NEKO_MEMBER_MULTIUSER_USER_PASSWORD: ${NEKO_USER_PASSWORD:-neko}
NEKO_MEMBER_MULTIUSER_ADMIN_PASSWORD: ${NEKO_ADMIN_PASSWORD:-admin}
NEKO_WEBRTC_EPR: 56000-56100          # match ports below
NEKO_WEBRTC_ICELITE: "false"          # full ICE unless static public IP
NEKO_WEBRTC_NAT1TO1: ${NEKO_PUBLIC_IP:-auto}  # set literal IP or leave unset to auto-detect
NEKO_SERVER_PROXY: "true"             # trust reverse proxy headers
NEKO_SESSION_API_TOKEN: ${NEKO_API_TOKEN:-}   # optional; blank disables
NEKO_DEBUG: ${NEKO_DEBUG:-0}

services:
neko:
  image: ghcr.io/m1k1o/neko/chromium:3.0.4
  container_name: neko
  restart: unless-stopped
  shm_size: 2gb
  networks:
    proxy:
      ipv4_address: 172.31.0.3
  ports:
    - "8080:8080/tcp"                 # web / signaling
    - "56000-56100:56000-56100/udp"   # WebRTC EPR (must match env)
  environment:
    <<: *neko-env
  cap_add:
    - SYS_ADMIN                       # required if not using --no-sandbox
  volumes:
    - neko-data:/var/lib/neko         # persistent config / sessions (bind as needed)
    - neko-logs:/var/log/neko
  configs:
    - source: supervisord_chromium
      target: /etc/neko/supervisord/chromium.conf

nginx-cdp:
  image: nginx:alpine
  container_name: neko-cdp
  network_mode: "service:neko"        # join same net & PID
  depends_on:
    - neko
  environment:
    # restrict which hosts may speak CDP (use allowlist or auth)
    ALLOWED_CDP_ORIGIN: ${ALLOWED_CDP_ORIGIN:-127.0.0.1}
  configs:
    - source: nginx_cdp_conf
      target: /etc/nginx/conf.d/cdp.conf
  ports:
    - "9222:9222/tcp"                 # exposed CDP endpoint proxied to 9223 in container

networks:
proxy:
  ipam:
    config:
      - subnet: 172.31.0.0/16

volumes:
neko-data:
neko-logs:

configs:
supervisord_chromium:
  content: |
    [program:chromium]
    environment=HOME="/home/%(ENV_USER)s",USER="%(ENV_USER)s",DISPLAY="%(ENV_DISPLAY)s"
    command=/usr/bin/chromium \
      --remote-debugging-port=9223 \
      --remote-debugging-address=127.0.0.1 \
      --remote-allow-origins="*" \
      --disable-web-security \
      --disable-features=VizDisplayCompositor \
      --disable-extensions \
      # --no-sandbox \  # uncomment only if you drop SYS_ADMIN
      --disable-dev-shm-usage \
      --enable-automation \
      --disable-background-timer-throttling \
      --disable-backgrounding-occluded-windows \
      --disable-renderer-backgrounding \
      --force-devtools-available \
      --disable-features=TranslateUI \
      --disable-ipc-flooding-protection \
      --enable-blink-features=IdleDetection \
      --headless=new \
      --disable-gpu
    stopsignal=INT
    autorestart=true
    priority=800
    user=%(ENV_USER)s
    stdout_logfile=/var/log/neko/chromium.log
    stdout_logfile_maxbytes=100MB
    stdout_logfile_backups=10
    redirect_stderr=true

    [program:openbox]
    environment=HOME="/home/%(ENV_USER)s",USER="%(ENV_USER)s",DISPLAY="%(ENV_DISPLAY)s"
    command=/usr/bin/openbox --config-file /etc/neko/openbox.xml
    autorestart=true
    priority=300
    user=%(ENV_USER)s
    stdout_logfile=/var/log/neko/openbox.log
    stdout_logfile_maxbytes=100MB
    stdout_logfile_backups=10
    redirect_stderr=true

nginx_cdp_conf:
  content: |
    map $http_upgrade $connection_upgrade {
      default upgrade;
      ''      close;
    }

    upstream chrome {
      server 127.0.0.1:9223;
      keepalive 32;
    }

    server {
      listen 9222;

      # Optional IP allowlist (simple example); extend w/ auth / mTLS as needed
      allow 127.0.0.1;
      allow ::1;
      # env-subst ALLOWED_CDP_ORIGIN could template additional allow lines
      deny all;

      location / {
        proxy_pass http://chrome;

        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection $connection_upgrade;

        proxy_set_header Host $host:$server_port;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

        proxy_read_timeout 86400s;
        proxy_send_timeout 86400s;
        proxy_connect_timeout 7200s;
        proxy_cache off;
        proxy_buffering off;
        proxy_max_temp_file_size 0;
        proxy_request_buffering off;

        proxy_set_header Sec-WebSocket-Extensions $http_sec_websocket_extensions;
        proxy_set_header Sec-WebSocket-Key $http_sec_websocket_key;
        proxy_set_header Sec-WebSocket-Version $http_sec_websocket_version;

        proxy_socket_keepalive on;
        keepalive_timeout 300s;
        keepalive_requests 1000;
      }
    }

neko.m1k1o.net neko.m1k1o.net neko.m1k1o.net neko.m1k1o.net Stack Overflow pptr.dev Reddit

Minimal .env Illustration (override at deploy)

NEKO_USER_PASSWORD=supersecretuser
NEKO_ADMIN_PASSWORD=supersecretadmin
NEKO_PUBLIC_IP=203.0.113.45        # example; or set DNS name in upstream LB/TURN
NEKO_API_TOKEN=$(openssl rand -hex 32)
NEKO_DEBUG=1
ALLOWED_CDP_ORIGIN=10.0.0.0/8      # example ACL range for automation runners

neko.m1k1o.net neko.m1k1o.net neko.m1k1o.net

Playwright Attach Script (Improved)

Key best practices: discover the browser WebSocket endpoint from /json/version; create a context if none returned (some builds start w/ zero pages when headless new); gracefully handle targets; optionally filter to the Neko desktop window by URL. Example:

// attach-neko.js
const { chromium } = require('playwright');

(async () => {
const cdpHttp = process.env.NEKO_CDP_URL || 'http://localhost:9222';

// Attach to existing Chromium exposed by Neko's CDP proxy.
const browser = await chromium.connectOverCDP(cdpHttp);

// In many cases Neko's running Chromium already has a default context.
// If none, create one.
const [defaultContext] = browser.contexts().length
  ? browser.contexts()
  : [await browser.newContext()];

// Reuse first existing page or open a new one.
const page = defaultContext.pages() || await defaultContext.newPage();

await page.goto('https://example.com');
console.log('Neko page title:', await page.title());

// Get raw CDP session if you want low-level control.
const client = await page.context().newCDPSession(page);
const version = await client.send('Browser.getVersion').catch(() => null);
console.log('CDP Browser Version:', version);

// Keep browser open for human co-driving; do NOT close().
})();

Playwright Playwright GitHub

Diagnostic CDP Ping Script (Refined from Your test.js / test4.js)

Below is a leaner diagnostic that:

  1. Fetches /json/version;
  2. Opens WebSocket;
  3. Discovers targets;
  4. Attaches to first non‑extension page;
  5. Evaluates an expression;
  6. Logs failures cleanly.
// cdp-diagnostics.js
const WebSocket = require('ws');
const fetch = (...args) => import('node-fetch').then(({default: f}) => f(...args));

(async () => {
const base = process.env.NEKO_CDP_URL || 'http://localhost:9222';
const version = await (await fetch(`${base}/json/version`)).json();
const wsUrl = version.webSocketDebuggerUrl;

const ws = new WebSocket(wsUrl, { perMessageDeflate: false });
let id = 0;

function send(method, params, sessionId) {
ws.send(JSON.stringify({ id: ++id, method, params, sessionId }));
}

ws.on('open', () => {
console.log('CDP connected');
send('Target.setDiscoverTargets', { discover: true });
});

let firstSession;
ws.on('message', data => {
const msg = JSON.parse(data);
if (msg.method === 'Target.targetCreated') {
const t = msg.params.targetInfo;
if (t.type === 'page' && !t.url.startsWith('chrome-extension://')) {
send('Target.attachToTarget', { targetId: t.targetId, flatten: true });
}
} else if (msg.method === 'Target.attachedToTarget' && !firstSession) {
firstSession = msg.params.sessionId;
send('Runtime.enable', {}, firstSession);
send('Runtime.evaluate', { expression: '1+1' }, firstSession);
} else if (msg.id && msg.result) {
console.log('Result', msg.id, msg.result);
} else if (msg.error) {
console.error('CDP Error', msg.error);
}
});
})();

Stack Overflow Playwright Playwright

Operational Checklist for Playwright‑Augmented Neko

CheckWhyHow to Verify
Chromium started with --remote-debugging-port (localhost)Required for CDP attach; safer than 0.0.0.0curl http://<host>:9222/json/version returns JSON
CDP proxy ACL in placePrevent hostile takeover of your shared sessionrestrict IPs or auth in nginx; test from unauthorized host fails
WebRTC ports reachableAvoid black screens / frozen videowebrtc-internals in client; docker logs ICE candidate errors
SYS_ADMIN vs --no-sandbox decision documentedSecurity posture clarityConfirm container start flags; run chrome://sandbox
Multiuser passwords rotatedPrevent drive‑by adminUse secrets; verify login roles mapping
Proxy timeout > heartbeatPrevent surprise disconnects during long automationNginx proxy_read_timeout >= 120s
Debug logging toggled for incident responseRapid triageNEKO_DEBUG=1, GST_DEBUG=3 when needed
neko.m1k1o.net neko.m1k1o.net neko.m1k1o.net pptr.dev neko.m1k1o.net

Example Hybrid Workflow: Humans Steer, Agents Assist

A common pattern in agentic stacks:

  1. Human opens Neko in browser, logs in as admin (multiuser).
  2. Automation runner (Playwright script / LLM agent) attaches over CDP using service account limited by firewall.
  3. Agent performs scripted setup (login, nav, cookie seeding) then relinquishes; human sees results instantly.
  4. If human taking over triggers UI state changes, agent can poll via CDP events (Target/Runtime) to resume.

This model avoids re‑launching browsers and preserves session continuity Neko already streams to participants. Playwright neko.m1k1o.net GitHub

Deployment Channels & Ecosystem

You can deploy via raw Docker/Compose, room orchestration stacks (neko‑rooms), homelab bundles (Umbrel App Store), or community charts/templates; packaging often pre‑wires reverse proxy + TLS but may lag in env var updates—review and update to v3 syntax after install. Umbrel App Store GitHub neko.m1k1o.net

Troubleshooting Quick Hits

  • Black screen (cursor only) in Chromium flavor → missing SYS_ADMIN or mis‑sandbox; confirm capability or drop sandbox flag. neko.m1k1o.net pptr.dev

  • WebRTC connect stalls / DTLS not started → exposed UDP mismatch or firewall block; check EPR mapping & NAT1TO1; review server logs at debug level. neko.m1k1o.net neko.m1k1o.net

  • Users disconnect behind proxy → heartbeat vs proxy timeout mismatch; ensure proxy_read_timeout >120s and server.proxy enabled. neko.m1k1o.net Reddit

  • CDP connect refused → nginx sidecar not up or ACL blocking; verify /json/version at 9222 and upstream 9223 reachable in container. Stack Overflow neko.m1k1o.net

  • Legacy envs ignored → upgrade to v3 names or set NEKO_LEGACY=true explicitly; review migration matrix. neko.m1k1o.net

Neko v3 WebRTC & WebSocket Control: Frame/State, Keyboard, Mouse (Cited)

TL;DR:

  • All browser control in Neko v3 is mediated over a single /api/ws WebSocket after session authentication.
  • Browser frames are not delivered directly over the WS as video; rather, the WS carries control, signaling, events, and input (mouse/keyboard) JSON, with media frames (video, audio) negotiated via WebRTC ICE as a peer connection.
  • Full workflow: REST login → WS upgrade (/api/ws) → system/init → WebRTC signal/request → ICE handshake → frames sent to client, controls sent from client.
ModeREST CallResponseWS Upgrade Auth
Cookie (default)POST /api/login {username, password}Set-Cookie: NEKO_SESSIONCookie auto-sent
Token (stateless)POST /api/login for {token}Opaque JWT/Bearer?token=... or Bearer header
Legacy (query)(multiuser only) skip REST, ?password=?password in query triggers v2

2. WebSocket Upgrade URL (With/Without Path Prefix)

  • Mainline: wss://host[:port]/api/ws?token=<TOKEN>
  • With path-prefix: e.g. wss://proxy.example.com/neko/api/ws?token=...
  • Alt: cookies or Authorization: Bearer ... supported.
  • Legacy /ws endpoint: only if enabled or in legacy mode.

3. Connection Lifecycle

  1. Upgrade: Gorilla WS server handles /api/ws, performs token/cookie/Bearer/session check.
  2. Init: Server pushes system/init (JSON: session id, settings, role).
  3. Heartbeat: Server sends system/heartbeat; clients may reply with client/heartbeat, but liveness relies on WebSocket ping/pong.
  4. All interaction now flows over the socket: control events (keyboard/mouse), signaling (ICE, SDP), system/broadcast, errors.
  5. All client state (host, cursor, input, session, etc.) is managed by events.

4. How Media (Frames) and Control Flow

Media

  • Video and audio frames do not go over the WebSocket; they are WebRTC media streams (negotiated via signaling on WS).
  • To initiate frame streaming, send: {"event":"signal/request","payload":{"video":{},"audio":{}}}
  • Server replies with ICE candidates/SDP; client opens WebRTC peer connection.
  • Browser’s actual frames (video, audio) arrive via WebRTC MediaStream.
  • Some deployments gate input on a completed WebRTC handshake:
    1. send {"event":"signal/request","payload":{"video":{},"audio":{}}}
    2. perform SDP/ICE (offer/provide → answer; exchange candidates)
    3. only then will control events be honored.

Input: Keyboard/Mouse

Input is sent from client to server as JSON events (current protocol):

  • Pointer move: {"event":"control/move","payload":{"x":123,"y":456}}
  • Mouse scroll: {"event":"control/scroll", "payload": {"delta_x": 0, "delta_y": 120}}
  • Mouse press: {"event":"control/buttonpress","payload":{"x":123,"y":456,"code":1}}
  • Mouse down: {"event":"control/buttondown","payload":{"x":123,"y":456,"code":1}}
  • Mouse up: {"event":"control/buttonup","payload":{"x":123,"y":456,"code":1}}
    • button codes: 1=left, 2=middle, 3=right
  • Key press: {"event":"control/keypress","payload":{"keysym":65}}
  • Key down: {"event":"control/keydown","payload":{"keysym":65}}
  • Key up: {"event":"control/keyup","payload":{"keysym":65}}
    • printable chars use ASCII; control keys use X11 keysyms (e.g., Enter=65293, Esc=65307, Arrows=65361–65364)

Host arbitration:

  • Request: {"event":"control/request"}
  • Release: {"event":"control/release"}
  • Note: many servers will implicitly grant host on first valid control event, but explicit control/request is cleaner.

Coordinates expected by server are pixels. If your client uses normalized [0..1], convert and clamp before send.

Remove legacy examples using: control/click, control/mouse, control/key, control/request_host — these are not recognized by current servers.

Control quick reference (current)

ActionEventPayload schema
Move pointercontrol/move{ "x": int, "y": int }
Mouse scrollcontrol/scroll{ "delta_x": int, "delta_y": int }
Mouse button presscontrol/buttonpress`{ "x": int, "y": int, "code": 1
Mouse button down/upcontrol/buttondown/up`{ "x": int, "y": int, "code": 1
Key presscontrol/keypress{ "keysym": int }
Key down/upcontrol/keydown/up{ "keysym": int }
Request hostcontrol/request{}
Release hostcontrol/release{}

Common keysyms: Enter=65293, Backspace=65288, Tab=65289, Escape=65307, Left=65361, Up=65362, Right=65363, Down=65364, Control=65507. Button codes: 1=left, 2=middle, 3=right.

5. Production-Grade Client Implementation Notes

Building a robust programmatic client requires handling several subtle but critical aspects of the Neko protocol beyond basic event sending.

5.1 Connection Management & Reconnection

A production client must anticipate WebSocket disconnects. Simply reconnecting is insufficient. The correct pattern is a connection manager that, upon a detected drop:

  1. Completely tears down the old state: cancel all background async tasks (like loggers and signal consumers), and close the RTCPeerConnection. This prevents resource leaks and "zombie" tasks.
  2. Initiates a new connection attempt, often with exponential backoff to avoid spamming a downed server.
  3. Upon successful reconnection, rebuilds all necessary components: a new Signaler, new background tasks, and a new RTCPeerConnection for the subsequent WebRTC handshake.

5.2 Dynamic Keyboard Mapping

While a client can maintain a static table of common X11 keysyms, the most robust approach is to listen for the keyboard/map event from the server. This event provides a JSON object mapping key names to their precise keysym values for the server's specific environment.

  • Compatibility: A client should handle both legacy ({...}) and current ({"map": {...}}) payload shapes for this event.
  • Strategy: Maintain a default, built-in keysym map but dynamically update or extend it with the values received from the server at runtime. This ensures maximum compatibility across different keyboard layouts and server versions.

5.3 Liveness and Error Handling

  • Heartbeats: The server periodically sends a system/heartbeat event and also issues WebSocket pings. Clients may optionally send client/heartbeat, but connection liveness primarily depends on WebSocket ping/pong; lack of client/heartbeat alone does not cause disconnection.
  • Error Events: A client should listen for and surface all error/... events. These provide crucial feedback from the server if a sent command was malformed, rejected, or failed, which is essential for debugging.

5.4 Robust Input Handling

  • Host Control: A robust client should monitor the control/host event. If auto-host functionality is desired, the client should track its own session_id (from system/init) and automatically send a control/request if it observes that the host has been dropped (has_host: false) or given to another session.
  • Sequential Double-Click: To reliably simulate a double-click, a client should send two control/buttonpress (or down/up pairs) events sequentially, with a small delay (e.g., 50-100ms) in between. Using async.gather or sending them simultaneously can result in the server's input handler swallowing one of the events.

5.5 Defensive WebRTC Handshake

When parsing the signal/provide or signal/offer events, a client should be defensive about the iceservers array.

  • Strict Field Mapping: Only parse and use the fields relevant to aiortc or the target WebRTC library (typically urls, username, and credential). Ignore any extra fields to avoid errors.
  • URL Flexibility: Be prepared to handle both a single url key and a urls array.

6. References and Further Reading

MosaicML Streaming (MDS) — Practical Guide for Codex Agents

This document explains what the MosaicML Streaming library is, why it’s useful for large, sharded datasets, and how to write and read MDS shards locally or to an S3-compatible store (MinIO, R2, etc.). It includes runnable patterns you can copy into data pipelines.


Why Streaming/MDS?

  • Sharded, resumable, deterministic data loading across workers/nodes, with built-in shuffling and fault tolerance. Useful when the dataset is bigger than local disk/RAM or lives in object storage.
  • MDS is the on-disk format used by Streaming: a set of shard files (e.g., shard.00000.mds[.zstd]) plus an index.json that records metadata/offsets for fast random access.

Key Concepts (10,000-ft view)

  • MDSWriter: builds shards from Python dict samples (columnar schema you define), supports compression and direct upload to remote storage.
  • StreamingDataset: reads from a local cache and/or directly from remote (S3, GCS, etc.), handling shuffling, ordering, and resumption deterministically.
  • S3-compatible endpoints: Use standard AWS creds + S3_ENDPOINT_URL to point to MinIO, Cloudflare R2, etc.

Minimal: Write an MDS dataset

from streaming import MDSWriter
import json, io, numpy as np

# 1) Define the schema: map column -> storage type.
# Supported logical types include 'bytes', 'str', numeric scalars, etc. (see docs)
columns = {
    "frame_table": "bytes",   # packed npy/npz/etc.
    "meta_json":   "str",     # utf-8 JSON string
}

# 2) Writer: local out + optional remote (s3://bucket/prefix).
# Compression & shard size are critical for throughput.
with MDSWriter(
    out="data/mds/train",                 # local staging/cache
    columns=columns,
    compression="zstd",                   # fast decode in training
    hashes="sha1",                        # integrity
    size_limit="128MB",                   # shard target
    remote="s3://my-bucket/datasets/ui-trajs"  # optional: upload shards as they close
) as w:
    for episode in make_episodes():
        # Pack the frame table as bytes (e.g., np.save to a BytesIO)
        buf = io.BytesIO()
        np.save(buf, episode["frame_table"].astype("float32"))
        sample = {
            "frame_table": buf.getvalue(),
            "meta_json":   json.dumps(episode["meta"], ensure_ascii=False),
        }
        w.write(sample)

Notes
	•	remote=... enables automatic upload of completed shards to object storage while writing.
	•	Prefer compression="zstd" for read-time throughput; it’s commonly used in training loops.

⸻

Configure S3 / S3-Compatible Remotes

Set standard AWS creds and (for non-AWS) an endpoint override:

export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_DEFAULT_REGION=us-east-1
# For MinIO / Cloudflare R2 / etc.:
export S3_ENDPOINT_URL="https://<your-endpoint>"   # e.g., https://<accountid>.r2.cloudflarestorage.com

S3_ENDPOINT_URL is the knob for S3-compatible providers.

⸻

Read an MDS dataset (local cache + remote)

from streaming import StreamingDataset
from torch.utils.data import DataLoader

ds = StreamingDataset(
    remote="s3://my-bucket/datasets/ui-trajs",  # remote is optional if data is fully local
    local="data/cache/ui-trajs",                # local cache directory
    shuffle=True, shuffle_seed=1337,            # deterministic across workers
)

def collate(batch):
    # Each item is a dict with your columns.
    return batch

loader = DataLoader(ds, batch_size=8, num_workers=8, collate_fn=collate, persistent_workers=True)
for batch in loader:
    ...

	•	StreamingDataset handles shuffling/resume deterministically across workers and epochs.
	•	Local cache fills on demand; you can pin/cache shards for repeated jobs.

⸻

Practical Tips
	•	Shard size: 64–256 MB is a good default for training throughput; too tiny increases open/seek overhead, too huge hurts parallelism (tune per infra). (General best practice aligned with MDS design. )
	•	Compression: zstd balances ratio & decode speed for training (common choice in large-scale pipelines).
	•	Schema: store big blobs under bytes (e.g., numpy npz) and small metadata as str/scalars—keeps readers simple.
	•	Determinism: set shuffle_seed once for the run; Streaming takes care of global ordering across ranks.

⸻

FAQ

Q: Can I mix multiple datasets?
Yes—construct multiple StreamingDataset streams or concatenate datasets; Streaming supports mixing sources and remote stores. See the docs’ “main concepts” and examples.

Q: How do I resume mid-epoch?
The index and shard metadata let Streaming resume deterministically; you don’t need custom offset logic.

Q: Non-AWS S3?
Use S3_ENDPOINT_URL with your provider’s endpoint (R2/MinIO/etc.).

---

Neko Agent API Mode - Architecture Design Document

Executive Summary

This document outlines the design for a new --api mode for the Neko Agent system that exposes the existing WebRTC-based GUI automation agent as a RESTful API service. The design includes containerization, orchestration with Kubernetes, load balancing, and auto-scaling strategies to handle production workloads.

Current Architecture Analysis

Existing System (src/agent.py)

The current Neko Agent is a WebRTC-based GUI automation system with the following key components:

  • NekoAgent Class: Core agent that manages WebRTC connections and performs GUI automation
  • WebRTC Integration: Uses aiortc for peer-to-peer connections with Neko Chrome containers
  • AI Model: ShowUI-2B (Qwen2VL) for visual reasoning and action planning
  • Action Execution: Supports CLICK, INPUT, SCROLL, SWIPE, TAP, ANSWER actions
  • Signaling: WebSocket-based communication with Neko server
  • Frame Processing: Real-time video frame capture and analysis
  • Metrics: Prometheus integration for monitoring

Current Limitations for API Mode

  1. Single Session: Designed for one agent per WebRTC connection
  2. Blocking Operations: Synchronous task execution model
  3. No Request Queuing: No API endpoint abstraction
  4. Resource Management: No dynamic resource allocation
  5. Scaling Constraints: Manual deployment and management

Proposed API Mode Architecture

1. System Overview

graph TB
    subgraph "Infrastructure Layer"
        LB[Load Balancer<br/>Kubernetes]
        GW[API Gateway<br/>FastAPI]
        AM[Agent Manager<br/>Orchestrator]
    end
    
    subgraph "Processing Layer"
        TQ[Task Queue<br/>Redis/RQ]
        AP[Agent Pods<br/>Multi-Container]
    end
    
    subgraph "Execution Layer"
        NI[Neko Instances<br/>Chrome/Edge]
        GPU[GPU Resources<br/>CUDA/MIG]
    end
    
    subgraph "Agent Pod Layout"
        subgraph "Pod Components"
            NA[Neko Agent API<br/>Core Automation]
            CS[Capture Sidecar<br/>Training Data MDS]
            TS[TTS Sidecar<br/>F5-TTS + WebRTC]
        end
        SR[Shared Resources:<br/>GPU, Storage, Voices, Network]
    end
    
    LB --> GW
    GW --> AM
    GW --> TQ
    AM --> AP
    TQ --> AP
    AP --> NI
    AP --> GPU
    GPU --> AP
    
    AP -.-> NA
    AP -.-> CS
    AP -.-> TS
    NA --- SR
    CS --- SR
    TS --- SR

2. Core Components

2.1 Training Data Capture Service (src/capture.py)

Technology: Python with MosaicML Streaming, WebSocket client, HTTP polling

Purpose: Captures training data from Neko sessions for model improvement and fine-tuning

Responsibilities:

  • Real-time episode recording during agent sessions
  • Screenshot capture at configurable FPS via HTTP polling
  • Action annotation parsing from chat messages
  • MDS (MosaicML Streaming) shard generation with S3 upload
  • Episode lifecycle management with automatic timeout handling

Integration with API Mode:

  • Runs as sidecar container alongside agent instances
  • Monitors WebSocket chat for /start, /stop, and Action: commands
  • Captures synchronized video frames and action annotations
  • Outputs training episodes as ZIP archives in MDS format
  • Supports remote storage (S3) for distributed training pipelines

Key Features:

  • Episode-based Recording: Packages complete task sessions as single training samples
  • Streaming Upload: Direct-to-S3 capability with local/remote storage options
  • Frame Synchronization: Precise timestamp alignment between actions and screenshots
  • Compression: ZSTD compression for efficient storage and transfer
  • 12-Factor Compliance: Configuration via environment variables only

2.2 Text-to-Speech Service (src/yap.py)

Technology: F5-TTS, aiortc WebRTC audio, PyAV, asyncio

Purpose: Provides real-time voice synthesis and audio streaming to Neko sessions

Responsibilities:

  • F5-TTS voice synthesis with custom voice models
  • Real-time audio streaming via WebRTC
  • Voice management and hot-reloading
  • Punctuation-aware text segmentation
  • Low-latency audio pipeline with jitter buffering

Integration with API Mode:

  • Deploys as companion service for agent instances
  • WebRTC audio track injection into Neko sessions
  • Chat command interface for voice control
  • Streaming and batch TTS modes
  • Voice configuration management via JSON registry

Key Features:

  • Multi-Voice Support: Hot-reloadable voice registry with custom models
  • Streaming TTS: /yap:begin ... /yap:end for incremental synthesis
  • Low Latency: Parallel TTS workers with crossfade audio splicing
  • Voice Customization: Live rate/pitch adjustments and style parameters
  • Production Audio: 48kHz WebRTC-compatible output with jitter buffering

2.3 API Gateway Service

Technology: FastAPI with async/await support

Responsibilities:

  • HTTP request handling and validation
  • Authentication and authorization
  • Rate limiting and quotas
  • Request/response transformation
  • API documentation (OpenAPI/Swagger)
  • Health checks and metrics

Key Endpoints:

# Core automation tasks
POST /api/v1/tasks              # Create new automation task
GET  /api/v1/tasks/{task_id}    # Get task status and results
PUT  /api/v1/tasks/{task_id}    # Update task parameters
DELETE /api/v1/tasks/{task_id}  # Cancel running task
GET  /api/v1/tasks              # List tasks with filtering
POST /api/v1/sessions           # Create persistent session

# Training data capture
POST /api/v1/capture/episodes   # Start episode recording
GET  /api/v1/capture/episodes/{episode_id}  # Get episode status
DELETE /api/v1/capture/episodes/{episode_id}  # Stop episode recording
GET  /api/v1/capture/datasets   # List available training datasets
POST /api/v1/capture/datasets/{dataset_id}/upload  # Upload to remote storage
GET  /api/v1/capture/config     # Get capture configuration
PUT  /api/v1/capture/config     # Update capture settings

# Text-to-Speech and voice
POST /api/v1/tts/speak          # Immediate text-to-speech
POST /api/v1/tts/stream/start   # Begin streaming TTS session
POST /api/v1/tts/stream/text    # Add text to streaming session
POST /api/v1/tts/stream/stop    # End streaming TTS session
DELETE /api/v1/tts/stop         # Stop all TTS and clear queue

# Voice management
GET  /api/v1/voices             # List available voices
POST /api/v1/voices             # Add new voice
GET  /api/v1/voices/{voice_id}  # Get voice details
PUT  /api/v1/voices/{voice_id}  # Update voice configuration
DELETE /api/v1/voices/{voice_id}  # Remove voice
POST /api/v1/voices/reload      # Hot-reload voice registry

# System
GET  /api/v1/health             # Health check endpoint
GET  /api/v1/metrics            # Prometheus metrics

Request/Response Format:

// POST /api/v1/tasks - Core automation task
{
  "task": "Search for Python tutorials on YouTube",
  "mode": "web",
  "max_steps": 10,
  "timeout": 300,
  "callback_url": "https://app.example.com/webhook",
  "options": {
    "screen_resolution": "1920x1080",
    "save_screenshots": true,
    "refinement_steps": 5,
    "enable_capture": true,     // Enable training data capture
    "enable_tts": true,         // Enable voice output
    "voice_id": "default"       // Voice for TTS
  }
}

// Response
{
  "task_id": "task_123456",
  "status": "queued",
  "created_at": "2024-01-15T10:30:00Z",
  "estimated_completion": "2024-01-15T10:35:00Z"
}

// POST /api/v1/capture/episodes - Start training capture
{
  "task_description": "Navigate to YouTube and search for tutorials",
  "neko_session_id": "session_abc123",
  "capture_options": {
    "fps": 2.0,
    "jpeg_quality": 85,
    "min_frames": 4,
    "timeout_seconds": 900
  }
}

// Response
{
  "episode_id": "episode_xyz789",
  "status": "recording",
  "started_at": "2024-01-15T10:30:00Z",
  "estimated_size_mb": 120
}

// POST /api/v1/tts/speak - Immediate TTS
{
  "text": "I'm now searching for Python tutorials on YouTube",
  "voice_id": "instructor",
  "params": {
    "rate": 1.0,
    "pitch": 0.0
  }
}

// Response
{
  "request_id": "tts_req_456",
  "status": "speaking",
  "estimated_duration_ms": 3500
}

// POST /api/v1/voices - Add new voice
{
  "voice_id": "instructor",
  "ref_audio": "/voices/instructor.wav",
  "ref_text": "This is an instructional voice for tutorials",
  "styles": ["calm", "explanatory"],
  "params": {
    "rate": 0.95,
    "pitch": -1.0
  }
}

// Response
{
  "voice_id": "instructor",
  "status": "registered",
  "created_at": "2024-01-15T10:30:00Z"
}

2.2 Agent Manager (Orchestrator)

Technology: Python with asyncio, Kubernetes Python client

Responsibilities:

  • Agent lifecycle management
  • Resource allocation and cleanup
  • Task routing to available agents
  • Agent health monitoring
  • Scaling decisions based on queue depth

Architecture:

class AgentManager:
    def __init__(self):
        self.k8s_client = kubernetes.client.ApiClient()
        self.agent_pools = {}
        self.task_queue = RedisQueue()
        self.metrics = PrometheusMetrics()
    
    async def provision_agent(self, requirements: AgentRequirements) -> AgentInstance
    async def scale_agent_pool(self, pool_name: str, replicas: int)
    async def route_task(self, task: Task) -> AgentInstance
    async def cleanup_completed_agents(self)

2.3 Modified Agent Core

Changes to src/agent.py:

class APINekoAgent(NekoAgent):
    def __init__(self, *args, api_mode=True, **kwargs):
        super().__init__(*args, **kwargs)
        self.api_mode = api_mode
        self.current_task = None
        self.task_queue = asyncio.Queue()
        
    async def api_run(self) -> None:
        """API mode main loop - processes tasks from queue"""
        while not self.shutdown.is_set():
            try:
                task = await self.task_queue.get()
                self.current_task = task
                result = await self.execute_task(task)
                await self.report_result(result)
            except Exception as e:
                await self.report_error(e)
    
    async def execute_task(self, task: APITask) -> TaskResult:
        """Execute single API task and return results"""
        self.nav_task = task.description
        # Existing navigation logic with API adaptations
        return TaskResult(task_id=task.id, status="completed", ...)

3. Containerization Strategy

3.1 Multi-Stage Docker Build

# Base image with CUDA support for GPU inference
FROM nvidia/cuda:12.1-runtime-ubuntu22.04 as base

# Python environment setup
RUN apt-get update && apt-get install -y \
    python3.11 python3-pip \
    libgl1-mesa-glx libglib2.0-0 \
    && rm -rf /var/lib/apt/lists/*

# Dependencies stage
FROM base as deps
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Model download stage (optional optimization)
FROM deps as models
RUN python -c "from transformers import Qwen2VLForConditionalGeneration; \
    Qwen2VLForConditionalGeneration.from_pretrained('showlab/ShowUI-2B')"

# Application stage  
FROM models as app
WORKDIR /app
COPY src/ ./src/
COPY pyproject.toml ./
RUN pip install -e .

# API mode entrypoint
CMD ["python", "-m", "agent", "--api", "--port", "8000"]

Capture Service Dockerfile:

FROM python:3.11-slim as base

# System dependencies
RUN apt-get update && apt-get install -y \
    libgl1-mesa-glx libglib2.0-0 \
    && rm -rf /var/lib/apt/lists/*

# Python dependencies
COPY requirements-capture.txt .
RUN pip install --no-cache-dir -r requirements-capture.txt

# Application
WORKDIR /app
COPY src/capture.py ./src/
COPY pyproject.toml ./
RUN pip install -e .

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=10s --retries=3 \
    CMD python -c "import capture; print('ok')"

# Entry point
CMD ["python", "src/capture.py"]

TTS Service Dockerfile:

FROM nvidia/cuda:12.1-runtime-ubuntu22.04 as base

# System dependencies for audio processing
RUN apt-get update && apt-get install -y \
    python3.11 python3-pip \
    libsndfile1 libasound2-dev \
    ffmpeg libavcodec-dev libavformat-dev \
    && rm -rf /var/lib/apt/lists/*

# Python dependencies including F5-TTS
COPY requirements-tts.txt .
RUN pip install --no-cache-dir -r requirements-tts.txt

# Download F5-TTS models (optional optimization)
RUN python -c "from f5_tts.api import F5TTS; F5TTS()"

# Application
WORKDIR /app
COPY src/yap.py ./src/
COPY voices/ ./voices/
COPY pyproject.toml ./
RUN pip install -e .

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=30s --retries=3 \
    CMD python -c "from yap import Settings; s=Settings(); print('ok' if s.validate()==[] else 'fail')"

# Entry point
CMD ["python", "src/yap.py"]

3.2 Container Images

Agent Image: neko-agent-api:latest

  • Contains the modified agent with API capabilities
  • GPU-enabled for model inference
  • Health checks and graceful shutdown
  • Prometheus metrics export

Gateway Image: neko-api-gateway:latest

  • FastAPI application
  • Authentication middleware
  • Rate limiting and request validation
  • OpenAPI documentation

Manager Image: neko-agent-manager:latest

  • Kubernetes orchestration logic
  • Agent lifecycle management
  • Auto-scaling algorithms

Capture Image: neko-capture:latest

  • Training data capture service (src/capture.py)
  • MosaicML Streaming for MDS shard generation
  • S3-compatible storage support for remote upload
  • WebSocket client for Neko session monitoring
  • Screenshot polling and episode management

TTS Image: neko-yap:latest

  • Text-to-speech service (src/yap.py)
  • F5-TTS model inference with GPU acceleration
  • WebRTC audio streaming capabilities
  • Voice management and hot-reloading
  • Real-time audio pipeline with jitter buffering

Voice Registry Sidecar: neko-voices:latest

  • Shared voice model storage
  • Voice file hosting and management
  • Hot-reloadable voice configurations
  • Persistent volume for voice assets

4. Kubernetes Deployment

4.1 Namespace Organization

apiVersion: v1
kind: Namespace
metadata:
  name: neko-agents
  labels:
    app: neko-agent-system
---
# Resource quotas and limits
apiVersion: v1
kind: ResourceQuota
metadata:
  name: neko-agents-quota
  namespace: neko-agents
spec:
  hard:
    requests.nvidia.com/gpu: "20"
    limits.memory: "200Gi"
    persistentvolumeclaims: "10"

4.2 Enhanced Agent Deployment with Sidecars

apiVersion: apps/v1
kind: Deployment
metadata:
  name: neko-agent-pool
  namespace: neko-agents
spec:
  replicas: 3
  selector:
    matchLabels:
      app: neko-agent
  template:
    metadata:
      labels:
        app: neko-agent
    spec:
      nodeSelector:
        accelerator: nvidia-tesla-t4
      volumes:
      - name: voices-storage
        persistentVolumeClaim:
          claimName: neko-voices-pvc
      - name: capture-storage
        emptyDir: {}
      containers:
      # Main agent container
      - name: neko-agent
        image: neko-agent-api:latest
        resources:
          requests:
            nvidia.com/gpu: 1
            memory: "4Gi"
            cpu: "2"
          limits:
            nvidia.com/gpu: 1
            memory: "8Gi"
            cpu: "4"
        env:
        - name: NEKO_API_MODE
          value: "true"
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
        ports:
        - containerPort: 8000
          name: api
        - containerPort: 9000
          name: metrics
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
      
      # Training data capture sidecar
      - name: neko-capture
        image: neko-capture:latest
        env:
        - name: NEKO_WS
          value: "ws://localhost:8000/api/ws"
        - name: CAPTURE_OUT
          value: "/capture/mds"
        - name: CAPTURE_REMOTE
          valueFrom:
            secretKeyRef:
              name: neko-s3-config
              key: remote_uri
              optional: true
        - name: AWS_ACCESS_KEY_ID
          valueFrom:
            secretKeyRef:
              name: neko-s3-config
              key: access_key_id
              optional: true
        - name: AWS_SECRET_ACCESS_KEY
          valueFrom:
            secretKeyRef:
              name: neko-s3-config
              key: secret_access_key
              optional: true
        volumeMounts:
        - name: capture-storage
          mountPath: /capture
        resources:
          requests:
            memory: "512Mi"
            cpu: "0.5"
          limits:
            memory: "1Gi"
            cpu: "1"
      
      # TTS sidecar
      - name: neko-yap
        image: neko-yap:latest
        env:
        - name: YAP_WS
          value: "ws://localhost:8000/api/ws"
        - name: YAP_VOICES_DIR
          value: "/voices"
        - name: YAP_SR
          value: "48000"
        - name: YAP_PARALLEL
          value: "2"
        volumeMounts:
        - name: voices-storage
          mountPath: /voices
        resources:
          requests:
            nvidia.com/gpu: 0.5  # GPU sharing with main agent
            memory: "2Gi"
            cpu: "1"
          limits:
            nvidia.com/gpu: 0.5
            memory: "4Gi"
            cpu: "2"
        livenessProbe:
          exec:
            command:
            - python
            - -c
            - "from yap import Settings; s=Settings(); exit(0 if s.validate()==[] else 1)"
          initialDelaySeconds: 45
          periodSeconds: 30

4.3 Persistent Storage for Voices

# Persistent Volume Claim for voice assets
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: neko-voices-pvc
  namespace: neko-agents
spec:
  accessModes:
    - ReadWriteMany  # Multiple pods can mount
  resources:
    requests:
      storage: 10Gi
  storageClassName: nfs-client  # Use NFS or similar for shared access
---
# Secret for S3 configuration (training data upload)
apiVersion: v1
kind: Secret
metadata:
  name: neko-s3-config
  namespace: neko-agents
type: Opaque
stringData:
  remote_uri: "s3://neko-training-data/episodes"
  access_key_id: "AKIA..."
  secret_access_key: "..."
  region: "us-west-2"

4.4 Service Configuration

# Service for agent API endpoints
apiVersion: v1
kind: Service
metadata:
  name: neko-agent-service
  namespace: neko-agents
spec:
  selector:
    app: neko-agent
  ports:
  - name: api
    port: 8000
    targetPort: 8000
  - name: metrics
    port: 9000
    targetPort: 9000
  type: ClusterIP
---
# Headless service for TTS WebRTC (if needed)
apiVersion: v1
kind: Service
metadata:
  name: neko-tts-service
  namespace: neko-agents
spec:
  selector:
    app: neko-agent
  ports:
  - name: webrtc
    port: 8080
    targetPort: 8080
    protocol: UDP
  clusterIP: None

4.5 Horizontal Pod Autoscaler

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: neko-agent-hpa
  namespace: neko-agents
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: neko-agent-pool
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: nvidia.com/gpu
      target:
        type: Utilization
        averageUtilization: 80
  - type: Pods
    pods:
      metric:
        name: task_queue_length
      target:
        type: AverageValue
        averageValue: "5"

5. Load Balancing and Auto-Scaling Strategy

5.1 Multi-Tier Load Balancing

Tier 1: External Load Balancer

  • Cloud provider load balancer (ALB/GKE/AKS)
  • SSL termination and geographical routing
  • Health checks and failover

Tier 2: Ingress Controller

  • NGINX Ingress with session affinity
  • Rate limiting and request buffering
  • WebSocket upgrade support for agent communication

Tier 3: Service Mesh (Optional)

  • Istio for advanced traffic management
  • Circuit breaker patterns
  • Observability and tracing

5.2 Intelligent Auto-Scaling

Queue-Based Scaling:

class QueueBasedScaler:
    def __init__(self):
        self.queue_threshold_scale_up = 10
        self.queue_threshold_scale_down = 2
        self.avg_task_duration = 30  # seconds
        
    def calculate_desired_replicas(self, queue_length: int, current_replicas: int) -> int:
        if queue_length > self.queue_threshold_scale_up:
            # Predictive scaling based on queue growth rate
            desired = min(current_replicas * 2, max_replicas)
        elif queue_length < self.queue_threshold_scale_down:
            desired = max(current_replicas // 2, min_replicas)
        else:
            desired = current_replicas
        return desired

GPU Utilization Scaling:

  • Monitor GPU memory usage and compute utilization
  • Scale up when average GPU utilization > 80%
  • Scale down when GPU utilization < 30% for 5 minutes

Cost-Optimized Node Selection:

  • Prefer spot instances for batch workloads
  • Use GPU node pools with automatic provisioning
  • Implement preemption handling for graceful task migration

5.3 Advanced Scaling Features

Predictive Scaling (2024 Best Practice):

Model Selection Rationale: For Neko Agent workloads, we recommend a Prophet + XGBoost ensemble approach due to:

  • Variable task durations (30s to 30+ minutes)
  • Bursty request patterns with business hour seasonality
  • Multi-dimensional resource requirements (CPU, GPU, memory)
  • Need for 5-15 minute prediction horizon to account for container startup times

Core Implementation:

import numpy as np
import pandas as pd
from prophet import Prophet
import xgboost as xgb
from sklearn.ensemble import VotingRegressor
from typing import Dict, List, Tuple, Optional
import asyncio
import logging
from dataclasses import dataclass
from datetime import datetime, timedelta

@dataclass
class LoadPrediction:
    timestamp: datetime
    predicted_requests: float
    predicted_gpu_hours: float
    predicted_memory_gb: float
    confidence_interval: Tuple[float, float]
    recommendation: str  # 'scale_up', 'scale_down', 'maintain'

class NekoLoadPredictor:
    """
    Ensemble model for predicting Neko Agent resource demands.
    
    Uses Prophet for seasonality/trends + XGBoost for complex patterns.
    Optimized for multi-container pods with GPU sharing constraints.
    """
    
    def __init__(self, prediction_horizon_minutes: int = 10):
        self.horizon = prediction_horizon_minutes
        self.prophet_model = None
        self.xgb_model = None
        self.ensemble_model = None
        self.feature_columns = [
            'hour_of_day', 'day_of_week', 'is_weekend', 'is_holiday',
            'queue_depth', 'avg_task_duration', 'active_pods',
            'gpu_utilization_pct', 'memory_utilization_pct',
            'requests_last_5min', 'requests_last_15min', 'requests_last_hour',
            'tts_requests_ratio', 'capture_enabled_ratio'
        ]
        self.logger = logging.getLogger(__name__)
        
    def _engineer_features(self, 
                          historical_metrics: pd.DataFrame, 
                          current_time: datetime) -> pd.DataFrame:
        """
        Engineer features for prediction model.
        
        :param historical_metrics: Time series of system metrics
        :param current_time: Current timestamp for prediction
        :return: Engineered feature DataFrame
        """
        df = historical_metrics.copy()
        
        # Time-based features
        df['hour_of_day'] = df['timestamp'].dt.hour
        df['day_of_week'] = df['timestamp'].dt.dayofweek
        df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
        df['is_holiday'] = df['timestamp'].dt.date.isin(self._get_holidays()).astype(int)
        
        # Rolling window features
        df['requests_last_5min'] = df['request_count'].rolling('5min').sum()
        df['requests_last_15min'] = df['request_count'].rolling('15min').sum()
        df['requests_last_hour'] = df['request_count'].rolling('60min').sum()
        
        # Task complexity indicators
        df['avg_task_duration'] = df['total_task_seconds'].rolling('15min').sum() / df['request_count'].rolling('15min').sum()
        df['tts_requests_ratio'] = df['tts_requests'] / df['request_count'].clip(lower=1)
        df['capture_enabled_ratio'] = df['capture_requests'] / df['request_count'].clip(lower=1)
        
        # Resource utilization
        df['gpu_utilization_pct'] = df['gpu_memory_used'] / df['gpu_memory_total'] * 100
        df['memory_utilization_pct'] = df['memory_used'] / df['memory_total'] * 100
        
        return df.fillna(0)
    
    def _get_holidays(self) -> List[datetime]:
        """Get list of holidays that might affect traffic patterns."""
        # Simplified - in production, use holidays library
        return [
            datetime(2024, 1, 1),   # New Year
            datetime(2024, 7, 4),   # Independence Day
            datetime(2024, 12, 25), # Christmas
            # Add more holidays based on user base geography
        ]
    
    async def train_models(self, historical_data: pd.DataFrame):
        """
        Train the ensemble prediction model.
        
        :param historical_data: Historical metrics with features and targets
        """
        self.logger.info("Training predictive scaling models...")
        
        # Prepare data
        df = self._engineer_features(historical_data, datetime.now())
        
        # Prophet for seasonality and trends
        prophet_df = df[['timestamp', 'request_count']].rename(
            columns={'timestamp': 'ds', 'request_count': 'y'}
        )
        
        self.prophet_model = Prophet(
            daily_seasonality=True,
            weekly_seasonality=True,
            yearly_seasonality=False,  # Need more data
            changepoint_prior_scale=0.05,  # Conservative for stability
            interval_width=0.8
        )
        
        # Add business hours regressor
        prophet_df['business_hours'] = (
            (prophet_df['ds'].dt.hour >= 9) & 
            (prophet_df['ds'].dt.hour < 17) & 
            (prophet_df['ds'].dt.dayofweek < 5)
        ).astype(int)
        
        self.prophet_model.add_regressor('business_hours')
        self.prophet_model.fit(prophet_df)
        
        # XGBoost for complex patterns
        feature_df = df[self.feature_columns].fillna(0)
        target_requests = df['request_count'].values
        target_gpu_hours = df['gpu_hours_consumed'].values
        target_memory = df['memory_gb_peak'].values
        
        self.xgb_model = {
            'requests': xgb.XGBRegressor(
                n_estimators=100,
                max_depth=6,
                learning_rate=0.1,
                subsample=0.8,
                random_state=42
            ),
            'gpu_hours': xgb.XGBRegressor(
                n_estimators=100,
                max_depth=6,
                learning_rate=0.1,
                subsample=0.8,
                random_state=42
            ),
            'memory_gb': xgb.XGBRegressor(
                n_estimators=100,
                max_depth=6,
                learning_rate=0.1,
                subsample=0.8,
                random_state=42
            )
        }
        
        # Train XGBoost models
        self.xgb_model['requests'].fit(feature_df, target_requests)
        self.xgb_model['gpu_hours'].fit(feature_df, target_gpu_hours)
        self.xgb_model['memory_gb'].fit(feature_df, target_memory)
        
        self.logger.info("Model training completed successfully")
    
    async def predict_load(self, 
                          current_metrics: Dict[str, float], 
                          prediction_time: datetime) -> LoadPrediction:
        """
        Predict future load using ensemble model.
        
        :param current_metrics: Current system metrics
        :param prediction_time: Time to predict for
        :return: Load prediction with confidence intervals
        """
        if not (self.prophet_model and self.xgb_model):
            raise ValueError("Models not trained. Call train_models() first.")
        
        # Prophet prediction (seasonal/trend)
        future_df = pd.DataFrame({
            'ds': [prediction_time],
            'business_hours': [
                1 if (9 <= prediction_time.hour < 17 and prediction_time.weekday() < 5) else 0
            ]
        })
        
        prophet_forecast = self.prophet_model.predict(future_df)
        prophet_requests = max(0, prophet_forecast['yhat'].iloc[0])
        confidence_lower = prophet_forecast['yhat_lower'].iloc[0]
        confidence_upper = prophet_forecast['yhat_upper'].iloc[0]
        
        # XGBoost prediction (complex patterns)
        current_features = pd.DataFrame([{
            'hour_of_day': prediction_time.hour,
            'day_of_week': prediction_time.weekday(),
            'is_weekend': 1 if prediction_time.weekday() >= 5 else 0,
            'is_holiday': 1 if prediction_time.date() in [h.date() for h in self._get_holidays()] else 0,
            'queue_depth': current_metrics.get('queue_depth', 0),
            'avg_task_duration': current_metrics.get('avg_task_duration', 180),
            'active_pods': current_metrics.get('active_pods', 1),
            'gpu_utilization_pct': current_metrics.get('gpu_utilization_pct', 0),
            'memory_utilization_pct': current_metrics.get('memory_utilization_pct', 0),
            'requests_last_5min': current_metrics.get('requests_last_5min', 0),
            'requests_last_15min': current_metrics.get('requests_last_15min', 0),
            'requests_last_hour': current_metrics.get('requests_last_hour', 0),
            'tts_requests_ratio': current_metrics.get('tts_requests_ratio', 0.3),
            'capture_enabled_ratio': current_metrics.get('capture_enabled_ratio', 0.8),
        }])
        
        xgb_requests = max(0, self.xgb_model['requests'].predict(current_features)[0])
        xgb_gpu_hours = max(0, self.xgb_model['gpu_hours'].predict(current_features)[0])
        xgb_memory_gb = max(0, self.xgb_model['memory_gb'].predict(current_features)[0])
        
        # Ensemble: Weight Prophet (seasonality) and XGBoost (patterns)
        # Use higher Prophet weight during business hours, XGBoost for off-hours
        prophet_weight = 0.7 if current_features['is_weekend'].iloc[0] == 0 else 0.4
        xgb_weight = 1 - prophet_weight
        
        final_requests = prophet_weight * prophet_requests + xgb_weight * xgb_requests
        
        # Generate scaling recommendation
        current_capacity = current_metrics.get('active_pods', 1) * current_metrics.get('pods_max_requests_per_hour', 20)
        recommendation = self._generate_recommendation(
            predicted_load=final_requests,
            current_capacity=current_capacity,
            current_utilization=current_metrics.get('current_utilization_pct', 0)
        )
        
        return LoadPrediction(
            timestamp=prediction_time,
            predicted_requests=final_requests,
            predicted_gpu_hours=xgb_gpu_hours,
            predicted_memory_gb=xgb_memory_gb,
            confidence_interval=(confidence_lower, confidence_upper),
            recommendation=recommendation
        )
    
    def _generate_recommendation(self, 
                               predicted_load: float, 
                               current_capacity: float,
                               current_utilization: float) -> str:
        """Generate scaling recommendation based on prediction."""
        utilization_threshold_up = 80.0    # Scale up if predicted > 80%
        utilization_threshold_down = 30.0  # Scale down if predicted < 30%
        
        predicted_utilization = (predicted_load / current_capacity) * 100
        
        # Conservative scaling to avoid thrashing
        if predicted_utilization > utilization_threshold_up and current_utilization > 60:
            return 'scale_up'
        elif predicted_utilization < utilization_threshold_down and current_utilization < 40:
            return 'scale_down'
        else:
            return 'maintain'

class PredictiveScaler:
    """
    Production-ready predictive scaler for Neko Agent pods.
    """
    
    def __init__(self, 
                 k8s_client,
                 namespace: str = "neko-agents",
                 deployment_name: str = "neko-agent-pool"):
        self.predictor = NekoLoadPredictor()
        self.k8s_client = k8s_client
        self.namespace = namespace
        self.deployment_name = deployment_name
        self.logger = logging.getLogger(__name__)
        self.scaling_history = []
        self.min_replicas = 2
        self.max_replicas = 50
        
    async def collect_metrics(self) -> Dict[str, float]:
        """Collect current system metrics for prediction."""
        # This would integrate with Prometheus/Grafana
        # Simplified example:
        return {
            'queue_depth': await self._get_queue_depth(),
            'active_pods': await self._get_active_pods(),
            'gpu_utilization_pct': await self._get_gpu_utilization(),
            'memory_utilization_pct': await self._get_memory_utilization(),
            'requests_last_5min': await self._get_recent_requests(5),
            'requests_last_15min': await self._get_recent_requests(15),
            'requests_last_hour': await self._get_recent_requests(60),
            'avg_task_duration': await self._get_avg_task_duration(),
            'tts_requests_ratio': await self._get_tts_ratio(),
            'capture_enabled_ratio': await self._get_capture_ratio(),
        }
    
    async def predict_and_scale(self):
        """Main prediction and scaling loop."""
        try:
            current_metrics = await self.collect_metrics()
            prediction_time = datetime.now() + timedelta(minutes=10)
            
            prediction = await self.predictor.predict_load(current_metrics, prediction_time)
            
            self.logger.info(
                f"Load prediction: {prediction.predicted_requests:.1f} requests, "
                f"recommendation: {prediction.recommendation}"
            )
            
            if prediction.recommendation == 'scale_up':
                await self._scale_up(prediction)
            elif prediction.recommendation == 'scale_down':
                await self._scale_down(prediction)
            
            # Record scaling decision for model improvement
            self.scaling_history.append({
                'timestamp': datetime.now(),
                'prediction': prediction,
                'action_taken': prediction.recommendation,
                'current_metrics': current_metrics
            })
            
        except Exception as e:
            self.logger.error(f"Predictive scaling error: {e}")
    
    async def _scale_up(self, prediction: LoadPrediction):
        """Scale up agent pods based on prediction."""
        current_replicas = await self._get_current_replicas()
        
        # Calculate required replicas with safety margin
        requests_per_pod_per_hour = 20  # Conservative estimate
        required_replicas = max(
            current_replicas + 1,  # Minimum increment
            int(np.ceil(prediction.predicted_requests / requests_per_pod_per_hour * 1.2))  # 20% safety margin
        )
        
        target_replicas = min(required_replicas, self.max_replicas)
        
        if target_replicas > current_replicas:
            await self._update_replicas(target_replicas)
            self.logger.info(f"Scaled up from {current_replicas} to {target_replicas} replicas")
    
    async def _scale_down(self, prediction: LoadPrediction):
        """Scale down agent pods based on prediction."""
        current_replicas = await self._get_current_replicas()
        
        # Conservative scale-down to avoid thrashing
        if len(self.scaling_history) > 0:
            recent_scale_up = any(
                h['action_taken'] == 'scale_up' 
                for h in self.scaling_history[-3:]  # Last 3 decisions
            )
            if recent_scale_up:
                self.logger.info("Skipping scale-down due to recent scale-up")
                return
        
        requests_per_pod_per_hour = 15  # More conservative for scale-down
        required_replicas = max(
            self.min_replicas,
            int(np.ceil(prediction.predicted_requests / requests_per_pod_per_hour))
        )
        
        target_replicas = max(required_replicas, current_replicas - 1)  # Maximum decrement
        
        if target_replicas < current_replicas:
            await self._update_replicas(target_replicas)
            self.logger.info(f"Scaled down from {current_replicas} to {target_replicas} replicas")
    
    async def _get_current_replicas(self) -> int:
        """Get current number of replicas from Kubernetes."""
        # Implementation would use Kubernetes API
        pass
    
    async def _update_replicas(self, target_replicas: int):
        """Update deployment replica count."""
        # Implementation would use Kubernetes API
        pass
    
    # Additional helper methods for metrics collection...
    async def _get_queue_depth(self) -> float:
        # Redis queue length
        pass
    
    async def _get_gpu_utilization(self) -> float:
        # NVIDIA DCGM metrics
        pass

Model Training Pipeline:

class ModelTrainingPipeline:
    """
    Automated training pipeline for predictive scaling models.
    Runs weekly to incorporate new patterns and seasonal changes.
    """
    
    def __init__(self, prometheus_client, model_storage_path: str):
        self.prometheus = prometheus_client
        self.storage_path = model_storage_path
        self.logger = logging.getLogger(__name__)
    
    async def collect_training_data(self, days_back: int = 30) -> pd.DataFrame:
        """Collect historical metrics from Prometheus for training."""
        end_time = datetime.now()
        start_time = end_time - timedelta(days=days_back)
        
        queries = {
            'request_count': 'sum(rate(neko_api_requests_total[5m])) * 300',
            'gpu_hours_consumed': 'sum(neko_gpu_utilization_percent) / 100 * 5/60',
            'memory_gb_peak': 'max(neko_memory_usage_bytes) / 1024/1024/1024',
            'total_task_seconds': 'sum(neko_task_duration_seconds)',
            'tts_requests': 'sum(rate(neko_tts_requests_total[5m])) * 300',
            'capture_requests': 'sum(rate(neko_capture_episodes_total[5m])) * 300',
            # Add more metrics as needed
        }
        
        historical_data = []
        current_time = start_time
        
        while current_time < end_time:
            row = {'timestamp': current_time}
            for metric_name, query in queries.items():
                try:
                    result = await self.prometheus.query(query, time=current_time)
                    row[metric_name] = float(result[0]['value'][1]) if result else 0.0
                except Exception as e:
                    self.logger.warning(f"Failed to get {metric_name}: {e}")
                    row[metric_name] = 0.0
            
            historical_data.append(row)
            current_time += timedelta(minutes=5)
        
        return pd.DataFrame(historical_data)
    
    async def train_and_evaluate(self):
        """Train models and evaluate performance."""
        self.logger.info("Starting model training pipeline...")
        
        # Collect data
        training_data = await self.collect_training_data()
        
        # Split train/test (last 20% for testing)
        split_idx = int(len(training_data) * 0.8)
        train_data = training_data[:split_idx]
        test_data = training_data[split_idx:]
        
        # Train models
        predictor = NekoLoadPredictor()
        await predictor.train_models(train_data)
        
        # Evaluate performance
        mae_scores = []
        mape_scores = []
        
        for _, row in test_data.iterrows():
            if row.name % 12 == 0:  # Every hour
                current_metrics = {
                    'queue_depth': 0,  # Simplified for evaluation
                    'active_pods': 3,
                    'gpu_utilization_pct': 50,
                    # ... other metrics
                }
                
                prediction = await predictor.predict_load(
                    current_metrics, 
                    row['timestamp'] + timedelta(minutes=10)
                )
                
                actual = row['request_count']
                predicted = prediction.predicted_requests
                
                mae_scores.append(abs(actual - predicted))
                if actual > 0:
                    mape_scores.append(abs(actual - predicted) / actual * 100)
        
        mae = np.mean(mae_scores)
        mape = np.mean(mape_scores)
        
        self.logger.info(f"Model evaluation - MAE: {mae:.2f}, MAPE: {mape:.2f}%")
        
        # Save model if performance is acceptable
        if mape < 20:  # 20% MAPE threshold
            await self._save_model(predictor)
            self.logger.info("Model saved successfully")
        else:
            self.logger.warning("Model performance below threshold, keeping previous version")
    
    async def _save_model(self, predictor: NekoLoadPredictor):
        """Save trained model to persistent storage."""
        import joblib
        model_data = {
            'prophet_model': predictor.prophet_model,
            'xgb_model': predictor.xgb_model,
            'trained_at': datetime.now(),
            'feature_columns': predictor.feature_columns
        }
        joblib.dump(model_data, f"{self.storage_path}/neko_scaling_model.pkl")

Integration with Kubernetes HPA:

# Custom HPA with predictive scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: neko-agent-predictive-hpa
  namespace: neko-agents
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: neko-agent-pool
  minReplicas: 2
  maxReplicas: 50
  metrics:
  - type: External
    external:
      metric:
        name: neko_predicted_load
        selector:
          matchLabels:
            deployment: neko-agent-pool
      target:
        type: Value
        value: "20"  # Target requests per pod
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60  # React quickly to spikes
      policies:
      - type: Percent
        value: 50   # Max 50% increase per minute
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300  # Conservative scale-down
      policies:
      - type: Percent
        value: 10   # Max 10% decrease per 5 minutes
        periodSeconds: 300

Multi-Region Deployment:

  • Deploy agent pools across multiple availability zones
  • Use regional load balancers with latency-based routing
  • Cross-region task replication for disaster recovery

6. Monitoring and Observability

6.1 Metrics Collection

System Metrics:

  • Request rate, latency, and error rate
  • Queue depth and processing time
  • Resource utilization (CPU, GPU, memory)
  • Agent lifecycle events

Business Metrics:

  • Task success/failure rates by type
  • Average task completion time
  • User adoption and API usage patterns
  • Cost per task execution

Custom Prometheus Metrics:

# Agent-specific metrics
ACTIVE_AGENTS = Gauge('neko_active_agents_total', 'Number of active agent instances')
TASK_DURATION = Histogram('neko_task_duration_seconds', 'Task execution time', ['task_type'])
QUEUE_SIZE = Gauge('neko_task_queue_size', 'Number of tasks in queue')
GPU_UTILIZATION = Gauge('neko_gpu_utilization_percent', 'GPU utilization per agent', ['agent_id'])

6.2 Logging Strategy

Structured Logging:

{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "INFO",
  "agent_id": "agent-pod-xyz",
  "task_id": "task_123456",
  "event": "task_completed",
  "duration": 45.2,
  "actions_executed": 8,
  "success": true
}

Log Aggregation: ELK Stack (Elasticsearch, Logstash, Kibana) or equivalent Distributed Tracing: OpenTelemetry with Jaeger for request flow tracking

7. Security Considerations

7.1 API Security

Authentication: JWT tokens with role-based access control Authorization: Resource-level permissions and quota enforcement
Input Validation: Request schema validation and sanitization Rate Limiting: Per-user and per-endpoint limits

7.2 Container Security

Image Scanning: Vulnerability scanning in CI/CD pipeline Runtime Security: Pod Security Standards (PSS) and AppArmor profiles Secret Management: Kubernetes secrets with encryption at rest Network Policies: Microsegmentation within the cluster

7.3 Model Security

Model Integrity: Checksum verification for downloaded models Inference Security: Input validation and output filtering Resource Isolation: GPU memory isolation between agents

8. Implementation Roadmap

Phase 1: Core API Development (4 weeks)

  • Implement FastAPI gateway with basic endpoints
  • Modify agent.py for API mode support
  • Create Docker containers for all components
  • Basic Kubernetes deployment manifests

Phase 2: Training Data & TTS Integration (3 weeks)

  • Deploy capture.py as sidecar service with MDS integration
  • Implement yap.py TTS service with F5-TTS and WebRTC
  • Configure voice management and persistent storage
  • Set up S3 integration for training data upload
  • Add new API endpoints for capture and TTS

Phase 3: Scaling Infrastructure (3 weeks)

  • Implement agent manager and orchestration logic
  • Set up Redis task queue and job processing
  • Configure HPA with custom metrics for multi-container pods
  • Load testing and performance optimization with sidecars

Phase 4: Production Hardening (3 weeks)

  • Implement monitoring and alerting for all services
  • Security hardening and compliance (including voice data)
  • Documentation and API reference (capture + TTS endpoints)
  • CI/CD pipeline integration with multi-stage builds

Phase 5: Advanced Features (4 weeks)

  • Predictive scaling algorithms with multi-container awareness
  • Multi-region deployment with voice model replication
  • Advanced monitoring and cost optimization for GPU sharing
  • Performance analytics and training data insights

9. Cost Analysis

9.1 Resource Requirements

Per Enhanced Agent Pod (Main + Sidecars):

  • GPU: 1x NVIDIA T4 or A10 (shared: 1.0 agent + 0.5 TTS) ($0.35-$1.10/hour)
  • CPU: 3.5-7 cores (2-4 agent + 1-2 TTS + 0.5-1 capture) ($0.14-$0.28/hour)
  • Memory: 6.5-13GB (4-8 agent + 2-4 TTS + 0.5-1 capture) ($0.0065-$0.013/hour)
  • Storage: 60GB SSD + 10GB voice models ($0.007/hour)

Additional Components per Pod:

  • Capture sidecar: Minimal overhead, primarily I/O and storage
  • TTS sidecar: GPU sharing, ~50% memory overhead
  • Voice storage: Shared across pods via NFS/PVC

Total Cost per Enhanced Pod/Hour: ~$0.52-$1.42 (+30% for sidecars)

Supporting Infrastructure:

  • Load balancer: $0.025/hour
  • Redis cluster: $0.10-$0.30/hour
  • Monitoring stack: $0.05-$0.15/hour

9.2 Cost Optimization Strategies

  1. Spot Instances: 60-90% cost reduction for fault-tolerant workloads
  2. Right-sizing: Dynamic resource allocation based on workload
  3. Reserved Instances: 30-60% discount for predictable baseline load
  4. Multi-tenancy: Share GPU resources across multiple small tasks

10. Conclusion

The proposed API mode architecture transforms the existing Neko Agent from a single-session WebRTC client into a comprehensive, cloud-native AI automation platform. The integration of capture.py and yap.py creates a complete solution with training data collection and real-time voice interaction capabilities.

Key Benefits:

  • Horizontal Scalability: Support for hundreds of concurrent automation tasks with integrated sidecars
  • Training Data Pipeline: Automated collection of high-quality training episodes in MosaicML format
  • Voice Interaction: Real-time TTS with F5-TTS and custom voice models via WebRTC
  • Cost Efficiency: Dynamic scaling with GPU sharing between agent and TTS workloads
  • Developer Experience: RESTful API with training and voice management endpoints
  • Production Ready: Monitoring, security, and operational excellence across all services
  • Future Proof: Built on modern container orchestration with ML/AI workflow patterns

Enhanced Capabilities:

  • Self-Improving: Continuous training data collection enables model refinement
  • Multimodal: Visual automation + audio output for richer user experiences
  • Enterprise Ready: S3 integration, voice asset management, and compliance features
  • Flexible Deployment: Sidecar pattern allows optional feature enablement per workload

The implementation leverages proven 2024 best practices for AI workload orchestration, GPU resource management, and microservices architecture, while adding cutting-edge training data collection and voice synthesis capabilities.

Total Implementation Estimate: 17 weeks for full production deployment (including training/TTS features) Expected Performance: 100+ concurrent tasks, <2s API response time, 99.9% availability, <500ms TTS latency Estimated Operating Cost: $0.52-$1.42 per enhanced pod-hour, with potential 50-60% reduction through optimization Training Data Output: ~2-5GB per hour of operation, ready for distributed training pipelines

References

Glossary

General Terms

Neko Server : A containerized remote desktop solution that provides browser-based access to desktop environments via WebRTC.

WebRTC : Web Real-Time Communication protocol used for peer-to-peer audio, video, and data transmission in web browsers.

ICE (Interactive Connectivity Establishment) : Protocol for establishing peer-to-peer connections across NATs and firewalls.

SDP (Session Description Protocol) : Protocol for describing media sessions and connection parameters.

Component-Specific Terms

Manual Control CLI

REPL (Read-Eval-Print Loop) : Interactive command-line interface for real-time control and testing.

Signaler : WebSocket connection manager that handles message routing and reconnection.

Broker : Event distribution system that routes incoming messages to appropriate topic queues.

Normalized Coordinates : Coordinate system using 0.0-1.0 range instead of pixel values for resolution independence.

Host Control : Exclusive mouse and keyboard control permission on the Neko server.

Core Agent

Action Space : Set of possible actions the agent can perform (click, type, scroll, etc.).

Observation Space : Visual and contextual information the agent receives from the environment.

Training Data : Collected sequences of observations and actions used for model training.

Capture Service

MDS (MosaicML Dataset) : Efficient dataset format optimized for streaming and distributed training.

Trajectory : Sequence of state-action pairs representing a complete task execution.

Temporal Alignment : Synchronization of visual frames with corresponding action timestamps.

TTS Service

Voice Synthesis : Process of generating spoken audio from text input.

Audio Streaming : Real-time transmission of audio data over WebRTC channels.

Voice Activity Detection (VAD) : Technology for detecting presence of human speech in audio.

Technical Terms

Async/Await : Python asynchronous programming pattern for concurrent execution.

Task Cancellation : Graceful shutdown mechanism for asyncio background tasks.

Exponential Backoff : Retry strategy with increasing delays between attempts.

Event Loop : Core of asyncio that manages and executes asynchronous operations.

Queue (asyncio.Queue) : Thread-safe queue implementation for passing data between async tasks.

Protocol Terms

Event Payload : Data structure containing the actual content of WebSocket messages.

Heartbeat : Periodic messages sent to maintain connection and detect timeouts.

Session ID : Unique identifier for each client connection to the Neko server.

Control Protocol : Set of message types and formats for remote desktop interaction.

Development Terms

mdBook : Documentation generator that creates books from Markdown files.

PEP 257 : Python Enhancement Proposal defining docstring conventions.

PEP 287 : Python Enhancement Proposal defining reStructuredText format for docstrings.

Type Hints : Python annotations that specify expected types for function parameters and return values.