Neko Agent Documentation
Welcome to the Neko Agent documentation! This project provides an AI-powered automation agent that interfaces with Neko servers to perform GUI automation tasks through WebRTC connections.
What is Neko Agent?
Neko Agent is a Python-based automation system that:
- Connects to Neko servers via WebRTC for real-time GUI interaction
- Uses AI vision models (ShowUI-2B/Qwen2VL) for visual reasoning and action planning
- Captures training data automatically during automation sessions
- Provides voice synthesis capabilities via F5-TTS and WebRTC audio
- Supports various actions like clicking, typing, scrolling, and navigation
Key Components
src/agent.py
- Core automation agent with WebRTC integrationsrc/capture.py
- Training data capture service using MosaicML Streamingsrc/yap.py
- Text-to-speech service with F5-TTS and voice management
Getting Started
To get started with Neko Agent:
- Set up the development environment using Nix flakes
- Configure a Neko server for GUI automation targets
- Run the agent to begin automation tasks
- Optionally enable capture for training data collection
See the Getting Started guide for detailed setup instructions.
Architecture
The system is designed around WebRTC-based communication with Neko containers, allowing for:
- Real-time video frame analysis and action execution
- Automated training data collection in MosaicML format
- Voice feedback through WebRTC audio streams
- Scalable deployment patterns for production use
For technical details, see the Architecture Overview.
Getting Started
This guide will help you set up and run the Neko Agent system for AI-powered GUI automation.
Prerequisites
- Nix with flakes enabled - for development environment
- Neko server - target environment for automation
- NVIDIA GPU (optional) - for optimal AI model performance
Development Environment Setup
The project uses Nix flakes for reproducible development environments. Choose the appropriate shell based on your needs:
Available Development Shells
# Basic CPU environment
nix develop
# GPU-accelerated environment (recommended)
nix develop .#gpu
# Shell with additional AI CLI tools
nix develop .#ai
# Docker-compose helper environment for the bundled Neko stack
nix develop .#neko
# Documentation tooling
nix develop .#docs
# Optimised variants
nix develop .#cpu-opt
nix develop .#gpu-opt
# Trusted Execution Environment tooling
nix develop .#tee
GPU Environment (Recommended)
For the best performance with AI models:
nix develop .#gpu
This provides:
- CUDA 12.8 toolkit and libraries
- PyTorch with CUDA support
- All required Python dependencies
- Pre-configured environment variables
Neko Server Setup
The agent requires a running Neko server to connect to. You can either:
Option 1: Use Docker Compose (Included)
# Enter the Neko environment
nix develop .#neko
# Start Neko services
neko-services up
# View logs
neko-services logs
# Stop services
neko-services down
This starts a Neko server at http://localhost:8080
with Chrome ready for automation.
Option 2: External Neko Server
Configure connection to an existing Neko server by setting environment variables:
export NEKO_URL="https://your-neko-server.com"
export NEKO_USER="your-username"
export NEKO_PASS="your-password"
Running the Agent
Basic Usage
# Run with a simple task
uv run src/agent.py --task "Navigate to google.com and search for 'AI automation'"
# Run with REST authentication (the agent performs the login handshake)
uv run src/agent.py \
--task "Your automation task" \
--neko-url "http://localhost:8080" \
--username "user" \
--password "password"
# Keep the agent online and accept new tasks from chat
uv run src/agent.py --online --neko-url "http://localhost:8080" --username user --password password
With Training Data Capture
To enable training data collection during automation:
# Terminal 1: Start capture service
uv run src/capture.py
# Terminal 2: Run agent (capture watches chat messages for /start and /stop)
uv run src/agent.py --task "Your task description"
See Training Data Capture for detailed capture configuration.
With Voice Output
For voice feedback during automation:
# Terminal 1: Start TTS service (F5-TTS + WebRTC audio)
uv run src/yap.py
# Terminal 2: Run the agent – leave audio enabled (default) so the browser hears YAP
uv run src/agent.py --task "Describe what you see"
Configuration
Environment Variables
Create a .env
file in the project root (copy .env.example
first):
# Neko server connection
NEKO_URL=http://localhost:8080
NEKO_USER=user
NEKO_PASS=password
# Agent behaviour
NEKO_TASK="Default task description"
NEKO_MODE=web
NEKO_MAX_STEPS=8
NEKO_AUDIO=1
REFINEMENT_STEPS=5
NEKO_ICE_POLICY=strict
NEKO_RTCP_KEEPALIVE=0
NEKO_SKIP_INITIAL_FRAMES=5
NEKO_FORCE_EXIT_GUARD_MS=0
NEKO_LOGLEVEL=INFO
NEKO_LOG_FORMAT=text
NEKO_METRICS_PORT=9000
# FRAME_SAVE_PATH="./tmp/frame.png"
# CLICK_SAVE_PATH="./tmp/actions"
# Capture settings (optional)
CAPTURE_OUT=./data/mds
CAPTURE_FPS=2.0
CAPTURE_REMOTE=s3://your-bucket/training-data
# TTS settings (optional)
YAP_VOICES_DIR=./voices
YAP_SR=48000
YAP_PARALLEL=2
YAP_MAX_CHARS=350
GPU Configuration
For optimal GPU usage:
# Set specific GPU
export CUDA_VISIBLE_DEVICES=0
# Configure memory allocation
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
# Set target architecture
export TORCH_CUDA_ARCH_LIST=8.6
Verification
Test your setup:
# Check dependencies
python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"
# Test agent connection
uv run src/agent.py --healthcheck
# Explore capture options (the script exits when interrupted)
uv run src/capture.py --help
# Test TTS (if enabled)
uv run src/yap.py --healthcheck
Next Steps
- Read the Architecture Overview to understand the system design
- Explore Training Data Capture to collect data for model improvement
- Check the Developer Guide for technical details
- Review Neko Integration for advanced server configuration
Troubleshooting
Common Issues
CUDA not detected:
# Verify NVIDIA drivers
nvidia-smi
# Check CUDA installation in shell
echo $CUDA_HOME
Neko connection failed:
# Test Neko server accessibility
curl http://localhost:8080/health
# Check WebSocket endpoint
curl -i -N -H "Connection: Upgrade" -H "Upgrade: websocket" \
http://localhost:8080/api/ws
Python import errors:
# Regenerate environment
nix develop .#gpu --command python -c "import streaming; import torch; print('OK')"
For more help, see the Developer Guide or check the project issues.
Manual Control Guide
The Manual Control CLI is an interactive tool that lets you remotely control any Neko server through your terminal. Think of it as a command-line remote desktop that you can use for testing, administration, or automating tasks on hosted desktop environments.
Quick Start
Prerequisites
- Python 3.8 or newer
- Access to a Neko server (either local or hosted)
- Network connectivity to the Neko server
Launching the CLI
The manual control tool ships as src/manual.py
inside this repository. From a Nix shell you can run:
uv run src/manual.py --help
First Connection
The easiest way to connect is using your Neko server credentials:
uv run src/manual.py \
--neko-url "https://your-neko-server.com" \
--username "your-username" \
--password "your-password"
If successful, you'll see:
Starting Neko manual CLI. Type 'help' for commands. Ctrl+D or 'quit' to exit.
neko>
Basic Usage
Getting Help
Type help
at any time to see all available commands:
neko> help
This displays a comprehensive list of all commands with their syntax.
Essential Commands
Mouse Control
# Move mouse cursor to specific coordinates
neko> move 100 200
# Click at current mouse position
neko> click
# Move and click in one command
neko> tap 300 400
# Right-click
neko> rclick
# Double-click
neko> dblclick
Keyboard Input
# Type text
neko> text "Hello, World!"
# Press specific keys
neko> key Enter
neko> key Escape
neko> key F5
# Click somewhere and then type
neko> input 500 300 "username@example.com"
Navigation
# Scroll in different directions
neko> scroll down 3
neko> scroll up
neko> scroll left 2
# Drag/swipe gestures
neko> swipe 100 100 300 300
CLI switches
manual.py
mirrors the agent’s authentication behaviour: provide either a full WebSocket URL (--ws
) or REST credentials (--neko-url
, --username
, --password
). Additional flags include:
Flag | Purpose |
---|---|
--norm | Interpret coordinates as 0.0–1.0 floats instead of pixels |
--size 1920x1080 | Override the reported screen size when using pixel coordinates |
--no-auto-host | Do not automatically request host control on connect |
--no-media | Skip WebRTC negotiation (only for debugging signalling) |
--no-audio | Disable audio subscription |
By default the CLI learns the keyboard layout advertised by the server and stores logs under NEKO_LOGFILE
if that environment variable is set.
Normalised coordinates example
uv run src/manual.py --norm \
--neko-url "https://your-server.com" \
--username "admin" --password "secret"
# Now coordinates are between 0 and 1
neko> move 0.5 0.5 # Center of screen
neko> tap 0.1 0.9 # Bottom-left area
Common Use Cases
Website Testing
Test web applications by automating browser interactions:
# Navigate to a website
neko> tap 400 60 # Click address bar
neko> text "https://example.com"
neko> key Enter
# Fill out a form
neko> tap 300 200 # Click username field
neko> text "testuser"
neko> key Tab # Move to next field
neko> text "password123"
neko> key Enter # Submit form
# Test navigation
neko> tap 500 300 # Click a link
neko> key F5 # Refresh page
neko> key Alt+Left # Go back
Application Testing
Test desktop applications:
# Open application menu
neko> key Super # Windows/Super key
neko> text "calculator"
neko> key Enter
# Use the application
neko> tap 200 300 # Click number 5
neko> tap 250 350 # Click plus
neko> tap 200 300 # Click number 5
neko> tap 300 400 # Click equals
System Administration
Perform administrative tasks:
# Open terminal
neko> key Ctrl+Alt+t
# Run system commands
neko> text "sudo apt update"
neko> key Enter
neko> text "your-password" # If prompted
neko> key Enter
# Navigate file system
neko> text "cd /var/log"
neko> key Enter
neko> text "ls -la"
neko> key Enter
File Management
Work with files and folders:
# Open file manager
neko> key Super
neko> text "files"
neko> key Enter
# Navigate and create files
neko> key Ctrl+n # New folder
neko> text "TestFolder"
neko> key Enter
neko> dblclick # Enter folder
neko> rclick # Right-click for context menu
neko> text "n" # New file (if available)
Advanced Features
Clipboard Operations
# Copy, cut, and paste
neko> select_all # Select all text
neko> copy # Copy selection
neko> tap 400 500 # Click elsewhere
neko> paste # Paste content
# Paste specific text
neko> paste "This is custom text to paste"
Session Management
If you have admin rights, you can manage other users:
# List all connected users
neko> sessions
# Force take control from another user
neko> force-take
# Kick a specific user (use session ID from 'sessions' command)
neko> kick abc123-def456-789
# Release control
neko> force-release
Screen Size Management
# Check current screen size
neko> size
# Set virtual screen size (affects coordinate scaling)
neko> size 1920x1080
Raw Protocol Access
For advanced users, send custom commands:
# Send custom JSON messages to the server
neko> raw '{"event":"system/stats"}'
neko> raw '{"event":"control/scroll","payload":{"delta_x":0,"delta_y":240}}'
Configuration
Environment Variables
Set these to avoid typing credentials each time:
export NEKO_URL="https://your-neko-server.com"
export NEKO_USER="your-username"
export NEKO_PASS="your-password"
export NEKO_SIZE="1920x1080"
# Now you can just run:
uv run src/manual.py
Command Line Options
Option | Description | Example |
---|---|---|
--ws | Direct WebSocket URL | wss://neko.example.com/api/ws?token=... |
--neko-url | Neko server URL (for REST login) | https://neko.example.com |
--username | Login username | admin |
--password | Login password | secretpass |
--norm | Use 0-1 coordinates | (flag only) |
--size | Virtual screen size | 1920x1080 |
--no-auto-host | Don't auto-request control | (flag only) |
--no-media | Skip video/audio setup | (flag only) |
--no-audio | Disable audio | (flag only) |
Logging
Enable detailed logging for troubleshooting:
export NEKO_LOGLEVEL="DEBUG"
export NEKO_LOGFILE="/tmp/neko-manual.log"
uv run src/manual.py --neko-url "..." --username "..." --password "..."
# In another terminal, watch the logs:
tail -f /tmp/neko-manual.log
Tips and Best Practices
Timing and Delays
For automation scripts, you may need to add delays between actions:
# The tool doesn't have built-in delays, but you can use shell scripting:
echo -e "tap 300 200\ntext hello\nkey Enter" | uv run src/manual.py --neko-url "..." --username "..." --password "..."
Screen Resolution
Always check the screen size when starting:
neko> size
size 1920x1080 normalized=false
This helps you understand the coordinate system you're working with.
Connection Issues
If you have connection problems:
- Check credentials: Make sure username/password are correct
- Verify URL: Ensure the Neko server URL is accessible
- Network connectivity: Test basic HTTP access to the server
- Enable debug logging: Use
NEKO_LOGLEVEL=DEBUG
for detailed logs
Multiple Users
Be aware that Neko servers can have multiple users connected:
- Only one user can have "host" control at a time
- Use
sessions
command to see who else is connected - Use
host
andunhost
to request/release control - Admin users can use
force-take
andforce-release
Automation Scripts
For repetitive tasks, consider creating shell scripts:
#!/bin/bash
# test-website.sh
export NEKO_URL="https://neko.example.com"
export NEKO_USER="admin"
export NEKO_PASS="password"
{
echo "tap 400 60" # Click address bar
echo "text https://example.com" # Type URL
echo "key Enter" # Navigate
sleep 3 # Wait for page load
echo "tap 300 200" # Click login button
echo "text testuser" # Username
echo "key Tab" # Next field
echo "text testpass" # Password
echo "key Enter" # Submit
} | uv run src/manual.py
Troubleshooting
Common Issues
"Connection failed"
- Verify the Neko server is running and accessible
- Check that the URL, username, and password are correct
- Ensure no firewall is blocking the connection
"Command not found: manual.py"
- Make sure you're in the correct directory
- Use the full path:
python /path/to/src/manual.py
"No host control"
- Another user may have control; use
sessions
to check - Try
host
command to request control - Admin users can use
force-take
Commands not working
- Verify you have host control
- Check if coordinates are correct for the screen size
- Use
size
command to verify resolution
High latency
- Network delay between you and the Neko server
- Try reducing the number of rapid commands
- Consider using the server closer to your location
Getting Help
If you encounter issues:
- Enable debug logging with
NEKO_LOGLEVEL=DEBUG
- Check the log output for error messages
- Verify basic connectivity to the Neko server
- Review the Developer Guide for technical details
The Manual Control CLI is a powerful tool for remote desktop automation and testing. With practice, you can efficiently control any Neko server environment from your terminal.
Training Data Capture User Guide
The Neko capture tool records user interface interactions as training data for machine learning models. It captures screenshots and action sequences from Neko browser sessions, packaging them into datasets ready for training computer vision and UI automation models.
Quick Start
Prerequisites
- Running Neko server with WebSocket API enabled
- Python environment with
mosaicml-streaming
package installed - (Optional) S3-compatible storage credentials for remote upload
Basic Setup
-
Install dependencies:
pip install streaming requests websockets
-
Set your Neko connection:
export NEKO_URL="https://your-neko-server.com" export NEKO_USER="your-username" export NEKO_PASS="your-password"
-
Start capturing:
python src/capture.py
-
Record an episode in Neko chat:
/start Navigate to login page and enter credentials Action: {"type": "click", "coordinate": [150, 200]} Action: {"type": "type", "text": "username"} /stop
That's it! Your episode is now saved as training data in ./data/mds/
.
What Gets Captured
The capture tool records complete "episodes" of UI interaction:
- Screenshots: JPEG images captured at regular intervals (default: 2 FPS)
- Actions: Structured annotations of user interactions (clicks, typing, etc.)
- Metadata: Task descriptions, timestamps, screen dimensions
- Context: Complete session information for reproducible training
Each episode becomes a single training sample packaged as:
episode.zip
├── meta.json # Episode metadata and schema
├── frames/
│ ├── 000000.jpg # Sequential screenshots
│ ├── 000001.jpg
│ └── ...
├── frames.ndjson # Frame timing information
└── actions.ndjson # Action annotations
Architecture
Data Flow
- Connect to Neko WebSocket API (
/api/ws
) for chat monitoring - Listen for task boundaries (
/start
and/stop
commands) and action annotations - Capture screenshots at configurable FPS via HTTP endpoint (
/api/room/screen/shot.jpg
) - Package episodes as ZIP archives containing metadata, frames, and action sequences
- Write to MDS shards with automatic S3 upload and shard rotation
Episode Structure
Each episode is packaged as a ZIP archive containing:
episode.zip
├── meta.json # Episode metadata and schema
├── frames/
│ ├── 000000.jpg # Sequential JPEG screenshots
│ ├── 000001.jpg
│ └── ...
├── frames.ndjson # Frame index with timestamps
└── actions.ndjson # Action annotations with timestamps
Configuration
All configuration is handled via environment variables for 12-factor compliance:
Connection Settings
# REST login (preferred method)
export NEKO_URL="https://neko.example.com"
export NEKO_USER="username"
export NEKO_PASS="password"
# OR direct WebSocket URL (bypasses REST)
export NEKO_WS="wss://neko.example.com/api/ws?token=..."
Output Configuration
# Local MDS directory
export CAPTURE_OUT="./data/mds"
# Remote storage (optional)
export CAPTURE_REMOTE="s3://bucket/prefix"
export CAPTURE_KEEP_LOCAL=0 # 0=delete local after upload, 1=keep
# MDS shard settings
export CAPTURE_COMPRESSION="zstd:6" # Compression algorithm
export CAPTURE_SHARD_SIZE="512mb" # Size before shard rotation
export CAPTURE_HASHES="sha1" # Integrity checking
Capture Parameters
export CAPTURE_FPS=2 # Screenshots per second
export CAPTURE_JPEG_QUALITY=85 # JPEG quality (0-100)
export CAPTURE_MIN_FRAMES=4 # Minimum frames to save episode
export CAPTURE_EPISODE_TIMEOUT=900 # Auto-end after N seconds
S3 Authentication
For S3 or S3-compatible storage:
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_DEFAULT_REGION="us-east-1"
export AWS_SESSION_TOKEN="..." # Optional for temporary credentials
# For S3-compatible endpoints (MinIO, R2, etc.)
export S3_ENDPOINT_URL="https://s3.example.com"
Common Use Cases
Training UI Automation Models
Capture complete workflows for training models to automate repetitive tasks:
# Set up for high-quality capture
export CAPTURE_FPS=3
export CAPTURE_JPEG_QUALITY=90
Then in Neko chat:
/start Fill out customer registration form
Action: {"type": "click", "coordinate": [245, 156], "element": "first_name_field"}
Action: {"type": "type", "text": "John"}
Action: {"type": "click", "coordinate": [245, 186], "element": "last_name_field"}
Action: {"type": "type", "text": "Doe"}
Action: {"type": "click", "coordinate": [245, 216], "element": "email_field"}
Action: {"type": "type", "text": "john.doe@example.com"}
Action: {"type": "click", "coordinate": [300, 350], "element": "submit_button"}
/stop
Collecting Web Navigation Data
Record browsing sessions for training navigation models:
# Lower FPS for longer sessions
export CAPTURE_FPS=1
export CAPTURE_EPISODE_TIMEOUT=1800 # 30 minutes
Then in Neko chat:
/start Research product reviews for laptop purchase
Action: {"type": "navigate", "url": "https://amazon.com"}
Action: {"type": "type", "text": "gaming laptop", "element": "search_box"}
Action: {"type": "click", "coordinate": [400, 45], "element": "search_button"}
Action: {"type": "click", "coordinate": [200, 180], "element": "product_link"}
Action: {"type": "scroll", "direction": "down", "amount": 500}
/stop
Capturing Error Recovery Workflows
Document how to handle and recover from errors:
In Neko chat:
/start Handle login failure and password reset
Action: {"type": "type", "text": "wrong_password", "element": "password_field"}
Action: {"type": "click", "coordinate": [300, 200], "element": "login_button"}
Action: {"type": "click", "coordinate": [250, 150], "element": "forgot_password_link"}
Action: {"type": "type", "text": "user@example.com", "element": "email_field"}
Action: {"type": "click", "coordinate": [200, 180], "element": "send_reset_button"}
/stop
Configuration Guide
Basic Configuration
For most users, these environment variables are sufficient:
# Required: Neko connection
export NEKO_URL="https://your-neko-server.com"
export NEKO_USER="your-username"
export NEKO_PASS="your-password"
# Optional: Local storage (defaults to ./data/mds)
export CAPTURE_OUT="/path/to/training/data"
# Optional: Capture quality
export CAPTURE_FPS=2 # Screenshots per second
export CAPTURE_JPEG_QUALITY=85 # Image quality (0-100)
Cloud Storage Setup
AWS S3
export CAPTURE_REMOTE="s3://my-training-bucket/ui-episodes"
export AWS_ACCESS_KEY_ID="AKIA..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_DEFAULT_REGION="us-east-1"
MinIO/Self-hosted S3
export CAPTURE_REMOTE="s3://training-data/episodes"
export S3_ENDPOINT_URL="https://minio.mycompany.com"
export AWS_ACCESS_KEY_ID="minioadmin"
export AWS_SECRET_ACCESS_KEY="minioadmin"
Cloudflare R2
export CAPTURE_REMOTE="s3://my-r2-bucket/training"
export S3_ENDPOINT_URL="https://your-account.r2.cloudflarestorage.com"
export AWS_ACCESS_KEY_ID="your-r2-token"
export AWS_SECRET_ACCESS_KEY="your-r2-secret"
Advanced Configuration
Fine-tune capture behavior for specific use cases:
# Episode management
export CAPTURE_MIN_FRAMES=10 # Minimum frames to save episode
export CAPTURE_EPISODE_TIMEOUT=600 # Auto-stop after 10 minutes
export CAPTURE_KEEP_LOCAL=1 # Keep local copies when uploading
# Storage optimization
export CAPTURE_COMPRESSION="zstd:9" # Maximum compression
export CAPTURE_SHARD_SIZE="1gb" # Larger shards for fewer files
export CAPTURE_HASHES="sha1,md5" # Multiple integrity checks
# Debugging
export CAPTURE_LOGLEVEL="DEBUG"
Running the Capture Tool
Command Line Usage
Basic usage with environment variables:
python src/capture.py
Override settings with command line flags:
python src/capture.py \
--neko-url https://neko.example.com \
--username myuser \
--password mypass \
--out ./custom/data/path \
--remote s3://mybucket/training-data \
--fps 3.0 \
--jpeg-quality 95 \
--episode-timeout 1200
Direct WebSocket Connection
Skip REST authentication with a direct WebSocket URL:
export NEKO_WS="wss://neko.example.com/api/ws?token=your-jwt-token"
python src/capture.py
Implementation Details
Core Classes
EpisodeBuffer
(src/capture.py:161
)
- Manages temporary storage for a single episode
- Handles frame and action storage during capture
- Finalizes episode data into ZIP archive format
NekoSession
(src/capture.py:342
)
- WebSocket client for Neko server communication
- Handles authentication via REST API
- Manages screenshot polling and message queuing
Capture
(src/capture.py:542
)
- Main orchestrator coordinating capture workflow
- Processes chat events for episode boundaries
- Manages episode lifecycle and MDS writing
Episode Lifecycle
- Start Detection: Chat message matching
/start <task>
pattern - Frame Capture: Screenshot polling thread captures frames at specified FPS
- Action Parsing: Chat messages matching
Action: {...}
are parsed and stored - End Conditions:
- Manual
/stop
command - Episode timeout (default 900 seconds)
- Application shutdown
- Manual
- Finalization: Episode packaged and written to MDS if minimum frame count met
MDS Schema
Each MDS record contains:
Column | Type | Description |
---|---|---|
episode_id | str | Unique episode identifier |
task | str | Task description from /start command |
payload | bytes | ZIP archive containing episode data |
num_frames | int | Number of screenshot frames captured |
num_actions | int | Number of action annotations |
started_at | str | Episode start timestamp (ISO 8601) |
ended_at | str | Episode end timestamp (ISO 8601) |
screen_w | int | Screen width in pixels |
screen_h | int | Screen height in pixels |
agent | json | Agent metadata and configuration |
Error Handling and Resilience
Network Resilience
- Automatic WebSocket reconnection with exponential backoff
- Screenshot polling continues through temporary network issues
- MDS writer handles partial uploads and resumes interrupted transfers
Data Integrity
- SHA1 hashing for shard integrity verification
- Episode validation before MDS writing
- Graceful handling of malformed action annotations
Resource Management
- Temporary episode directories cleaned up after packaging
- Queue overflow protection prevents memory exhaustion
- Configurable episode timeouts prevent runaway captures
Integration with Training Pipelines
Loading Data
from streaming import StreamingDataset
import zipfile
import json
# Create streaming dataset
dataset = StreamingDataset(
remote="s3://bucket/training-data",
local="./cache",
shuffle=True
)
# Process episodes
for sample in dataset:
episode_id = sample['episode_id']
task = sample['task']
payload = sample['payload']
# Extract episode contents
with zipfile.ZipFile(io.BytesIO(payload)) as zf:
meta = json.loads(zf.read('meta.json'))
frames_index = [json.loads(line) for line in zf.read('frames.ndjson').decode().splitlines()]
actions = [json.loads(line) for line in zf.read('actions.ndjson').decode().splitlines()]
# Load frame images
for frame_info in frames_index:
frame_data = zf.read(frame_info['file'])
# Process frame...
Random Window Sampling
The episode-as-record format enables efficient random window sampling for sequence models:
def sample_windows(episode_payload, window_size=32):
with zipfile.ZipFile(io.BytesIO(episode_payload)) as zf:
frames_index = [json.loads(line) for line in zf.read('frames.ndjson').decode().splitlines()]
if len(frames_index) < window_size:
return None
start_idx = random.randint(0, len(frames_index) - window_size)
window = frames_index[start_idx:start_idx + window_size]
# Load frames for this window
frames = []
for frame_info in window:
frame_data = zf.read(frame_info['file'])
frames.append(frame_data)
return frames
Troubleshooting
Quick Diagnostics
First, check if the capture tool can connect to your Neko server:
# Test basic connectivity
curl -I https://your-neko-server.com/api/health
# Test WebSocket endpoint (if available)
python -c "
import websockets
import asyncio
asyncio.run(websockets.connect('wss://your-neko-server.com/api/ws'))
print('WebSocket connection successful')
"
Common Issues & Solutions
❌ Connection Issues
Problem: Connection refused
or Unable to connect to Neko server
Solutions:
- Verify Neko server is running:
curl https://your-neko-server.com
- Check firewall settings - ensure ports 80/443 and WebSocket ports are open
- Verify URL format:
https://
(nothttp://
) for secure connections - Test from the same network as your Neko server first
Problem: SSL certificate verification failed
Solutions:
- For self-signed certificates, set:
export PYTHONHTTPSVERIFY=0
(development only) - Or add certificate to system trust store
- Use IP address instead of hostname if DNS issues
❌ Authentication Issues
Problem: Authentication failed
or 401 Unauthorized
Solutions:
- Double-check username/password:
echo $NEKO_USER $NEKO_PASS
- Verify user has WebSocket API access in Neko admin panel
- Try direct WebSocket token if available:
# Get token via REST API first curl -X POST https://your-neko-server.com/api/login \ -H "Content-Type: application/json" \ -d '{"username":"user","password":"pass"}' # Use token directly export NEKO_WS="wss://your-neko-server.com/api/ws?token=YOUR_TOKEN"
❌ Episode Recording Issues
Problem: Episodes not being saved/captured
Solutions:
-
Check command format: Commands must be exact:
/start task description here ✅ /Start task description ❌ (wrong case) / start task description ❌ (extra space)
-
Verify action format: Actions must be valid JSON:
Action: {"type": "click", "coordinate": [100, 200]} ✅ Action: {type: "click", coordinate: [100, 200]} ❌ (missing quotes) Action {"type": "click"} ❌ (missing colon)
-
Check minimum frames: Episodes with fewer than
CAPTURE_MIN_FRAMES
are discarded:export CAPTURE_MIN_FRAMES=1 # Save all episodes
-
Enable debug logging:
export CAPTURE_LOGLEVEL=DEBUG python src/capture.py
❌ Storage Issues
Problem: Permission denied
writing to local directory
Solutions:
- Check directory permissions:
ls -la ./data/
- Create directory manually:
mkdir -p ./data/mds
- Use a different path:
export CAPTURE_OUT=/tmp/capture-data
Problem: S3 upload failures
Solutions:
-
Verify credentials:
aws sts get-caller-identity # Test AWS credentials
-
Check bucket permissions: Ensure your credentials can write to the bucket
-
Test endpoint connectivity:
# For MinIO/R2 curl -I $S3_ENDPOINT_URL
-
Debug with minimal config:
unset CAPTURE_REMOTE # Disable S3, use local only python src/capture.py
❌ Performance Issues
Problem: High memory usage or slow performance
Solutions:
-
Reduce capture rate:
export CAPTURE_FPS=1 # Lower frame rate export CAPTURE_JPEG_QUALITY=70 # Lower image quality
-
Shorter episodes:
export CAPTURE_EPISODE_TIMEOUT=300 # 5 minutes max
-
Monitor resources:
# Watch memory usage watch 'ps aux | grep capture.py' # Check disk space df -h ./data/mds/
Debugging Commands
View Current Configuration
python src/capture.py --help # See all options
env | grep CAPTURE # Show capture settings
env | grep NEKO # Show connection settings
Test Components Individually
Test REST Authentication:
python -c "
import requests, os
r = requests.post(f'{os.getenv(\"NEKO_URL\")}/api/login',
json={'username': os.getenv('NEKO_USER'),
'password': os.getenv('NEKO_PASS')})
print(f'Status: {r.status_code}')
print(f'Response: {r.text}')
"
Test Screenshot Endpoint:
python -c "
import requests, os
r = requests.get(f'{os.getenv(\"NEKO_URL\")}/api/room/screen/shot.jpg')
print(f'Status: {r.status_code}')
print(f'Content-Type: {r.headers.get(\"content-type\")}')
with open('/tmp/test_screenshot.jpg', 'wb') as f:
f.write(r.content)
print('Screenshot saved to /tmp/test_screenshot.jpg')
"
Log Analysis
Enable detailed logging to diagnose issues:
export CAPTURE_LOGLEVEL=DEBUG
python src/capture.py 2>&1 | tee capture.log
Key log messages to look for:
REST login ok
- Authentication successfulScreen size: 1920x1080
- WebSocket connection establishedEpisode [id] started
- Episode recording beganEpisode [id] committed
- Episode saved successfullyWS loop error
- WebSocket connection issuesShot.jpg non-OK
- Screenshot endpoint problems
Getting Help
If you're still having issues:
- Check the logs with
CAPTURE_LOGLEVEL=DEBUG
- Verify your environment with
env | grep -E "(NEKO|CAPTURE|AWS)"
- Test components separately using the debug commands above
- Create a minimal test case:
# Simplest possible configuration export NEKO_URL="https://your-server.com" export NEKO_USER="testuser" export NEKO_PASS="testpass" export CAPTURE_OUT="/tmp/test-capture" export CAPTURE_LOGLEVEL="DEBUG" python src/capture.py
Performance Tuning
For optimal performance in different scenarios:
High-Quality Training Data
export CAPTURE_FPS=3
export CAPTURE_JPEG_QUALITY=95
export CAPTURE_COMPRESSION="zstd:3" # Faster compression
export CAPTURE_SHARD_SIZE="256mb" # Smaller shards for faster upload
Long Sessions/Low Memory
export CAPTURE_FPS=1
export CAPTURE_JPEG_QUALITY=70
export CAPTURE_EPISODE_TIMEOUT=1800
export CAPTURE_COMPRESSION="zstd:9" # Maximum compression
Local Development
unset CAPTURE_REMOTE # No S3 upload
export CAPTURE_COMPRESSION="none" # Faster local writes
export CAPTURE_LOGLEVEL="INFO"
Best Practices
Episode Design
Keep episodes focused: Record single, complete tasks rather than mixing multiple workflows:
# Good: focused task
/start Complete user registration process
# ... perform registration steps ...
/stop
# Poor: mixed tasks
/start Do registration then check email then browse products
Use descriptive task names: Help your training pipeline understand the data:
/start Handle payment form validation errors
/start Navigate product catalog with filters
/start Recover from session timeout during checkout
Include error scenarios: Capture both success and failure paths:
/start Login with invalid credentials and recover
/start Handle network interruption during file upload
/start Deal with form validation errors
Action Annotation Guidelines
Be consistent with action types: Use standardized action schemas:
{"type": "click", "coordinate": [x, y], "element": "button_id"}
{"type": "type", "text": "input text", "element": "field_id"}
{"type": "scroll", "direction": "down", "amount": 300}
{"type": "navigate", "url": "https://example.com"}
{"type": "wait", "duration": 2.0, "reason": "page_load"}
Include context in actions: Add semantic information when possible:
{"type": "click", "coordinate": [200, 100], "element": "login_button", "intent": "submit_form"}
{"type": "type", "text": "user@example.com", "element": "email_field", "intent": "enter_credentials"}
Storage Management
Choose appropriate compression: Balance speed vs storage:
- Development:
CAPTURE_COMPRESSION="none"
(fastest) - Production:
CAPTURE_COMPRESSION="zstd:6"
(balanced) - Archive:
CAPTURE_COMPRESSION="zstd:9"
(smallest)
Optimize shard sizes for your infrastructure:
- Fast networks:
CAPTURE_SHARD_SIZE="1gb"
(fewer files) - Slow networks:
CAPTURE_SHARD_SIZE="128mb"
(faster uploads) - Mobile/edge:
CAPTURE_SHARD_SIZE="64mb"
(reliable transfers)
Use appropriate retention policies:
# Keep local copies for immediate access
export CAPTURE_KEEP_LOCAL=1
# Or upload and clean for space efficiency
export CAPTURE_KEEP_LOCAL=0
Advanced Usage
Continuous Capture Workflows
For long-running capture sessions, use systemd or docker for reliability:
Systemd service (/etc/systemd/system/neko-capture.service
):
[Unit]
Description=Neko Training Data Capture
After=network.target
[Service]
Type=simple
User=capture
WorkingDirectory=/opt/neko-capture
Environment=NEKO_URL=https://neko.example.com
Environment=NEKO_USER=capture-bot
EnvironmentFile=/etc/neko-capture/config
ExecStart=/usr/bin/python3 src/capture.py
Restart=always
RestartSec=30
[Install]
WantedBy=multi-user.target
Docker deployment:
FROM python:3.11-slim
RUN pip install streaming requests websockets
COPY src/capture.py /app/
ENV NEKO_URL=https://neko.example.com
CMD ["python", "/app/capture.py"]
Multi-instance Scaling
Run multiple capture instances for parallel data collection:
# Instance 1: UI automation tasks
export CAPTURE_OUT="./data/ui-automation"
export CAPTURE_REMOTE="s3://training/ui-automation"
python src/capture.py &
# Instance 2: Navigation tasks
export CAPTURE_OUT="./data/navigation"
export CAPTURE_REMOTE="s3://training/navigation"
python src/capture.py &
# Instance 3: Error handling
export CAPTURE_OUT="./data/error-recovery"
export CAPTURE_REMOTE="s3://training/error-recovery"
python src/capture.py &
Integration with Training Pipelines
Streaming data loader example:
from streaming import StreamingDataset
import torch
from torch.utils.data import DataLoader
class UIDataset(StreamingDataset):
def __getitem__(self, idx):
sample = super().__getitem__(idx)
# Extract and process episode
payload = sample['payload']
episode = self.process_episode(payload)
return {
'frames': episode['frames'],
'actions': episode['actions'],
'task': sample['task']
}
# Use with PyTorch
dataset = UIDataset(remote="s3://training/episodes")
loader = DataLoader(dataset, batch_size=32, num_workers=4)
Quality Assurance
Validate episodes before training:
def validate_episode(episode_data):
"""Check episode quality before using for training"""
frames = episode_data['frames']
actions = episode_data['actions']
# Check minimum length
if len(frames) < 10:
return False, "Too few frames"
# Check action/frame ratio
if len(actions) / len(frames) < 0.1:
return False, "Too few actions relative to frames"
# Check for valid coordinates
for action in actions:
if action.get('type') == 'click':
coord = action.get('coordinate', [])
if not (0 <= coord[0] <= 1920 and 0 <= coord[1] <= 1080):
return False, "Invalid click coordinates"
return True, "Valid episode"
Monitor capture quality:
# Check recent episodes
python -c "
from streaming import StreamingDataset
ds = StreamingDataset(local='./data/mds')
for i, sample in enumerate(ds):
if i >= 10: break
print(f'Episode {sample[\"episode_id\"]}: {sample[\"num_frames\"]} frames, {sample[\"num_actions\"]} actions')
"
Finetuning on Captured Episodes
This guide explains how to finetune the agent’s ShowUI/Qwen2VL model using the
episodes recorded by src/capture.py
(MosaicML Streaming, aka MDS).
All commands assume you are inside the Nix development shell (nix develop
).
What It Trains
The training script turns each annotated action in an episode into a supervised example:
- Prompt: the same chat template used by the agent (system + task + image and optional short action history).
- Target: the JSON action string (e.g.,
{"action":"CLICK", ...}
) observed in chat asAction: {...}
during capture.
This directly teaches the model to emit the agent’s expected action schema.
Environment Variables
Set these in .env
(already scaffolded in .env.example
):
# Data sources (MDS)
TRAIN_LOCAL="./data/mds" # defaults to CAPTURE_OUT
# TRAIN_REMOTE="s3://bucket/prefix" # optional remote MDS
TRAIN_CACHE="./data/cache" # local cache for StreamingDataset
# Output
TRAIN_OUTPUT="./checkpoints" # where to save model + processor
# Logging
TRAIN_LOGLEVEL=INFO # DEBUG|INFO|WARNING|ERROR
TRAIN_LOG_FORMAT=text # text|json
# Optimization
TRAIN_EPOCHS=1
TRAIN_BATCH=1
TRAIN_ACCUM=1
TRAIN_LR=5e-6
TRAIN_WD=0.0
TRAIN_MAX_STEPS=0 # 0 disables cap
TRAIN_MAX_SAMPLES_PER_EPOCH=0 # 0 disables per-epoch cap
TRAIN_HISTORY_STEPS=0 # include N prior actions in prompt
SEED=1337
The model ID and image sizes inherit from the agent defaults:
REPO_ID="showlab/ShowUI-2B"
SIZE_SHORTEST_EDGE=224
SIZE_LONGEST_EDGE=1344
Running Training
# Basic run (uses .env)
python src/train.py
# With overrides
TRAIN_BATCH=2 TRAIN_EPOCHS=1 python src/train.py --output ./ckpt/showui-ft
# Using just
just train
Checkpoints are written under TRAIN_OUTPUT
(epoch-*
and final/
). Use them
with the agent by pointing REPO_ID
to the checkpoint folder.
Notes
- The script streams episodes with
streaming.StreamingDataset
(MDS). It can read from a local directory and/or a remote (S3/R2/MinIO) when configured. - Mixed precision is enabled on CUDA automatically. On Apple MPS it uses the default precision.
- This is a minimal reference trainer for reproducible fine-tuning. For large jobs, consider adding gradient checkpointing, deepspeed, or PEFT adapters in your environment.
Voice Synthesis (YAP)
YAP (Yet Another Presenter) provides real-time text-to-speech capabilities for your Neko automation sessions. Convert text messages into natural-sounding speech that plays directly in the browser.
What is YAP?
YAP transforms text into speech using advanced AI voice synthesis (F5-TTS) and streams the audio through WebRTC to your Neko browser session. This enables:
- Voice announcements during automation tasks
- Interactive conversations through chat commands
- Live narration of automation steps
- Multiple voice personas with custom characteristics
Quick Start
1. Prerequisites
Before using YAP, ensure you have:
- A running Neko server (see Getting Started)
- GPU environment for optimal performance:
nix develop .#gpu
- Basic voice files in the
./voices
directory
2. Start YAP Service
# Connect to local Neko server
export NEKO_URL="http://localhost:8080"
export NEKO_USER="user"
export NEKO_PASS="password"
uv run src/yap.py
You should see:
[12:34:56] yap INFO - WS connected
[12:34:56] yap INFO - RTC answer sent; audio track live.
[12:34:56] yap INFO - Voices reloaded (1 entries)
3. Test Voice Output
In your Neko browser chat, type:
/yap Hello! I am your voice assistant.
You should hear the text spoken through the browser audio.
Voice Commands
YAP responds to commands in the Neko chat interface:
Immediate Speech
Speak text immediately:
/yap Good morning! Ready to start automation.
/yap The task has been completed successfully.
Streaming Mode
For longer conversations or live narration:
/yap:begin
I'm starting the automation task now...
Navigating to the website...
Filling out the form...
Submitting the data...
/yap:end
In streaming mode, YAP processes text incrementally as you type, enabling natural conversation flow.
Stop/Clear Queue
Cancel current speech and clear the queue:
/yap:stop
Voice Management
Default Voice Setup
YAP needs at least one voice configured. Create a basic setup:
-
Create voices directory:
mkdir -p voices
-
Add a reference audio file:
# Record or copy a 3-10 second WAV file cp your-voice-sample.wav voices/default.wav
-
YAP will auto-create voices.json on first run with default settings.
Adding New Voices
Add voices through chat commands:
/yap:voice add --spk alice --ref ./voices/alice.wav --ref-text "Hello, my name is Alice" --styles "friendly,calm"
Parameters:
--spk
: Voice ID/name--ref
: Path to reference audio file (WAV format, 3-10 seconds)--ref-text
: Transcript of the reference audio--styles
: Comma-separated style tags--rate
: Speech speed (0.5-2.0, default 1.0)--pitch
: Pitch shift in semitones (-12 to +12, default 0.0)
Switching Voices
Change the active voice and parameters:
/yap:voice set --spk alice
/yap:voice set --spk bob --rate 1.2 --pitch -0.5
/yap:voice set --rate 0.8
Reload Voice Configuration
After manually editing voices/voices.json
:
/yap:voice reload
Configuration
Basic Settings
Set these environment variables before starting YAP:
# Connection (choose one method)
export YAP_WS="wss://demo.neko.com/api/ws?token=your_token"
# OR
export NEKO_URL="https://demo.neko.com"
export NEKO_USER="username"
export NEKO_PASS="password"
# Voice directory
export YAP_VOICES_DIR="./voices"
export YAP_SPK_DEFAULT="default"
Audio Quality
# Audio format (recommended settings)
export YAP_SR=48000 # Sample rate (Hz)
export YAP_AUDIO_CHANNELS=1 # Channels (1=mono, 2=stereo)
export YAP_FRAME_MS=20 # WebRTC frame size
# Processing
export YAP_PARALLEL=2 # TTS worker threads
export YAP_MAX_CHARS=350 # Max characters per chunk
export YAP_OVERLAP_MS=30 # Audio crossfade overlap
Performance Tuning
# For faster response (lower quality)
export YAP_MAX_CHARS=200
export YAP_PARALLEL=4
# For better quality (higher latency)
export YAP_MAX_CHARS=500
export YAP_OVERLAP_MS=50
# Buffer management
export YAP_JITTER_MAX_SEC=6.0 # Audio buffer size
Voice Configuration File
YAP stores voice settings in voices/voices.json
:
{
"default": {
"ref_audio": "./voices/default.wav",
"ref_text": "This is my default voice sample.",
"styles": ["neutral"],
"params": {
"rate": 1.0,
"pitch": 0.0
}
},
"alice": {
"ref_audio": "./voices/alice.wav",
"ref_text": "Hello, my name is Alice and I sound friendly.",
"styles": ["friendly", "energetic"],
"params": {
"rate": 1.1,
"pitch": 0.2
}
},
"narrator": {
"ref_audio": "./voices/narrator.wav",
"ref_text": "I will be narrating the automation process.",
"styles": ["professional", "clear"],
"params": {
"rate": 0.9,
"pitch": -0.3
}
}
}
Voice Parameters
- ref_audio: Path to reference WAV file (3-10 seconds, clear speech)
- ref_text: Exact transcript of the reference audio
- styles: Descriptive tags (friendly, professional, calm, energetic)
- rate: Speech speed multiplier (0.5=slow, 1.0=normal, 2.0=fast)
- pitch: Pitch adjustment in semitones (negative=lower, positive=higher)
Usage Scenarios
Automation Announcements
Start YAP alongside your automation agent:
# Terminal 1: Start YAP
uv run src/yap.py
# Terminal 2: Run automation with voice (audio is enabled by default)
uv run src/agent.py --task "Fill out contact form"
The agent can announce progress:
/yap Starting automation task: Fill out contact form
/yap Navigating to the website...
/yap Form submitted successfully!
Interactive Sessions
Use YAP for live interaction during manual control:
# Terminal 1: Start YAP
uv run src/yap.py
# Terminal 2: Manual control
python src/manual.py
Then control both automation and voice through chat:
!click 100 200
/yap Clicked on the submit button
!type "Hello World"
/yap Entered text in the field
Multi-Voice Conversations
Set up different voices for different purposes:
# Set up voices
/yap:voice add --spk system --ref ./voices/system.wav --ref-text "System notification" --rate 0.9
/yap:voice add --spk user --ref ./voices/user.wav --ref-text "User interaction" --rate 1.1
# Use in conversation
/yap:voice set --spk system
/yap System initialization complete.
/yap:voice set --spk user
/yap Thank you for the update!
Troubleshooting
No Audio Output
Check browser audio permissions:
- Click the browser's address bar lock icon
- Ensure "Sound" is allowed
- Check browser volume settings
Verify WebRTC connection:
- Open browser developer tools (F12)
- Go to Console tab
- Look for WebRTC connection messages
- Check for audio stream indicators
Test connection:
uv run src/yap.py --healthcheck
Poor Audio Quality
Check reference audio:
- Use high-quality WAV files (16-bit, 22kHz+)
- 3-10 second samples with clear speech
- No background noise or music
- Single speaker only
Adjust processing:
# Increase overlap for smoother transitions
export YAP_OVERLAP_MS=50
# Reduce chunk size for lower latency
export YAP_MAX_CHARS=250
High Latency
Optimize for speed:
# Increase parallel workers
export YAP_PARALLEL=4
# Reduce chunk size
export YAP_MAX_CHARS=200
# Reduce buffer size
export YAP_JITTER_MAX_SEC=3.0
Check GPU usage:
# Monitor GPU usage while YAP is running
nvidia-smi -l 1
Connection Issues
WebSocket connection failed:
# Test WebSocket endpoint
curl -i -N -H "Connection: Upgrade" -H "Upgrade: websocket" \
http://localhost:8080/api/ws
Authentication failed:
# Test REST login
curl -X POST http://localhost:8080/api/login \
-H "Content-Type: application/json" \
-d '{"username":"user","password":"password"}'
Check firewall/network:
- Ensure ports 8080 (HTTP) and WebRTC ports are accessible
- Test with STUN server connectivity
- Check for corporate proxy/firewall blocking WebRTC
Debug Mode
Enable detailed logging:
export YAP_LOGLEVEL=DEBUG
export YAP_LOG_FORMAT=json
uv run src/yap.py 2>&1 | jq .
Look for:
- WebSocket connection status
- WebRTC negotiation progress
- TTS processing times
- Audio buffer status
Advanced Usage
Custom Voice Training
For better voice quality, record multiple reference samples:
-
Record varied samples:
# Different emotions/styles voices/alice-happy.wav voices/alice-serious.wav voices/alice-excited.wav
-
Test different samples:
/yap:voice set --spk alice-happy /yap I'm so excited about this automation! /yap:voice set --spk alice-serious /yap Please review these results carefully.
Integration with Automation
Modify automation scripts to include voice feedback:
# In your automation script
import requests
def announce(text):
"""Send voice announcement to YAP"""
requests.post(f"{neko_url}/api/chat", json={
"message": f"/yap {text}"
})
# Use in automation
announce("Starting login process")
agent.click_login_button()
announce("Login successful, proceeding to dashboard")
Docker Deployment
Deploy YAP as a container service:
FROM python:3.9-slim
# Install system dependencies
RUN apt-get update && apt-get install -y ffmpeg
# Install Python dependencies
COPY requirements.txt .
RUN pip install -r requirements.txt
# Copy application and voices
COPY src/ /app/src/
COPY voices/ /app/voices/
WORKDIR /app
# Configuration
ENV YAP_VOICES_DIR=/app/voices
ENV YAP_SR=48000
ENV YAP_PARALLEL=2
ENTRYPOINT ["python", "src/yap.py"]
Next Steps
- Learn about Training Data Capture to improve voice models
- Explore Core Agent for automation integration
- Read TTS Service Technical Details for advanced configuration
- Check Neko Integration for server setup options
Related Guides
- Getting Started - Initial system setup
- Training Data Capture - Data collection for improvements
- Manual Control CLI - Interactive testing
- Architecture Overview - System design
Architecture Overview
This document provides a comprehensive overview of the Neko Agent system architecture, covering the core components, data flow, and integration patterns.
System Overview
graph TB subgraph "Development Environment" Agent["`**Neko Agent** (src/agent.py) • WebRTC Client • AI Vision Model • Action Execution`"] Capture["`**Capture Service** (src/capture.py) • Training Data Collection • MosaicML Streaming • S3 Upload`"] TTS["`**TTS Service** (src/yap.py) • F5-TTS Synthesis • Voice Management • WebRTC Audio`"] end subgraph "Neko Server" Chrome["`**Chrome Container** • GUI Target • WebRTC Server • Screen Sharing`"] NekoAPI["`**Neko API** • Authentication • WebSocket Signaling • Session Management`"] end subgraph "External Services" S3["`**S3 Storage** • Training Data • Model Artifacts • Voice Assets`"] Models["`**AI Models** • ShowUI-2B (Vision) • F5-TTS (Audio) • Custom Voices`"] end Agent <-->|WebRTC Data| Chrome Agent <-->|Control API| NekoAPI Agent <-->|Screenshots| Capture Agent <-->|Voice Commands| TTS Capture -->|MDS Shards| S3 TTS <-->|Audio Stream| Chrome TTS <-->|Voice Models| Models Agent <-->|Vision Inference| Models
Core Components
1. Neko Agent (src/agent.py
)
The main automation agent that orchestrates GUI interactions through AI-powered decision making.
Key Features:
- WebRTC Integration: Real-time connection to Neko Chrome containers
- AI Vision: ShowUI-2B (Qwen2VL) model for visual reasoning and action planning
- Action Execution: Supports CLICK, INPUT, SCROLL, SWIPE, TAP, ANSWER actions
- Prometheus Metrics: Built-in observability and performance monitoring
Architecture:
class NekoAgent:
def __init__(self):
self.model = ShowUI2B()
self.webrtc = WebRTCClient()
self.metrics = PrometheusMetrics()
async def run(self, task: str):
# 1. Connect to Neko via WebRTC
await self.webrtc.connect()
# 2. Capture current screen
frame = await self.webrtc.get_frame()
# 3. AI model analyzes screen and plans action
action = await self.model.analyze(frame, task)
# 4. Execute action through WebRTC
await self.webrtc.execute_action(action)
# 5. Repeat until task complete
2. Capture Service (src/capture.py
)
Automated training data collection system that records agent sessions for model improvement.
Key Features:
- Episode Recording: Packages complete automation sessions as training samples
- MosaicML Streaming: Generates MDS shards for efficient distributed training
- S3 Integration: Direct upload to cloud storage with configurable backends
- WebSocket Monitoring: Real-time capture of agent actions and screen states
Data Flow:
sequenceDiagram participant User participant Agent participant Capture participant Neko participant S3 User->>Agent: Start automation task Agent->>Capture: Begin episode recording loop Every Action Agent->>Neko: Execute action Neko->>Agent: Screen update Agent->>Capture: Log action + timestamp Capture->>Capture: Store frame + metadata end Agent->>Capture: End episode Capture->>Capture: Generate ZIP archive Capture->>S3: Upload MDS shard
3. TTS Service (src/yap.py
)
Real-time text-to-speech system providing voice feedback during automation.
Key Features:
- F5-TTS Integration: High-quality voice synthesis with custom voice models
- WebRTC Audio: Direct audio streaming to Neko sessions
- Voice Management: Hot-reloadable voice registry with style parameters
- Streaming Support: Both immediate and incremental text processing
Audio Pipeline:
graph LR Text["`Text Input /yap commands`"] --> Segment["`Segmentation Punctuation-aware chunks`"] Segment --> Workers["`Parallel Workers F5-TTS synthesis Rate/pitch control`"] Workers --> Splicer["`Audio Splicer Crossfade blending Jitter buffering`"] Splicer --> WebRTC["`WebRTC Output 48kHz audio stream Real-time delivery`"]
Data Flow and Integration
WebRTC Communication
The system uses WebRTC for low-latency communication with Neko containers:
# WebRTC connection establishment
async def connect_to_neko():
# 1. REST API authentication
token = await authenticate(neko_url, username, password)
# 2. WebSocket signaling
ws_url = f"wss://{host}/api/ws?token={token}"
signaling = await websockets.connect(ws_url)
# 3. WebRTC peer connection
pc = RTCPeerConnection()
# 4. Add video track (for screen capture)
@pc.on("track")
async def on_track(track):
if track.kind == "video":
frame = await track.recv()
# Process frame for AI analysis
# 5. Exchange SDP offer/answer
await exchange_sdp(pc, signaling)
Training Data Format
The capture service generates training samples in this format:
episode_xyz789.zip
├── meta.json # Episode metadata
├── frames/ # Screenshot sequence
│ ├── 000000.jpg # Frame 0 (t=0.0s)
│ ├── 000001.jpg # Frame 1 (t=0.5s)
│ └── ...
├── frames.ndjson # Frame index with timestamps
└── actions.ndjson # Action sequence with timestamps
MDS Shard Structure:
# MosaicML Streaming format
columns = {
"episode_id": "str",
"task": "str",
"payload": "bytes", # ZIP archive
"num_frames": "int",
"num_actions": "int",
"started_at": "str",
"ended_at": "str",
"screen_w": "int",
"screen_h": "int"
}
Voice Model Management
The TTS service manages voices through a JSON registry:
{
"instructor": {
"ref_audio": "/voices/instructor.wav",
"ref_text": "This is an instructional voice for tutorials",
"styles": ["calm", "explanatory"],
"params": {
"rate": 0.95,
"pitch": -1.0
}
}
}
Deployment Patterns
Development Mode
# Terminal 1: Neko server
nix develop .#neko
neko-services up
# Terminal 2: Agent
nix develop .#gpu
uv run src/agent.py --task "automation task"
# Terminal 3: Capture (optional)
uv run src/capture.py
# Terminal 4: TTS (optional)
uv run src/yap.py
Production Considerations
For production deployments, consider:
- Container Orchestration: Deploy components as separate containers
- GPU Resource Management: Efficient sharing between agent and TTS
- Storage Strategy: S3-compatible backends for training data
- Monitoring: Prometheus metrics and health checks
- Scaling: Multiple agent instances with load balancing
See the API Mode Design for detailed production architecture plans.
Error Handling and Resilience
Connection Resilience
# Exponential backoff for WebRTC reconnection
async def connect_with_retry():
backoff = 1.0
max_retries = 5
for attempt in range(max_retries):
try:
return await establish_webrtc_connection()
except ConnectionError:
await asyncio.sleep(backoff)
backoff = min(30.0, backoff * 2)
raise MaxRetriesExceeded()
Graceful Degradation
- Capture failures: Agent continues without training data collection
- TTS unavailable: Silent operation mode
- Model errors: Fallback to basic automation patterns
- Network issues: Local caching and offline capabilities
Performance Characteristics
Latency Profile
- Frame processing: ~100-200ms (GPU inference)
- Action execution: ~50-100ms (WebRTC round-trip)
- Voice synthesis: ~200-500ms (F5-TTS generation)
- Training data write: ~10-50ms (local storage)
Resource Usage
- GPU Memory: 2-4GB (ShowUI-2B + F5-TTS)
- System Memory: 1-2GB (Python runtime + buffers)
- Network: 5-10 Mbps (WebRTC video + audio)
- Storage: Variable (training data accumulation)
Security Considerations
Authentication
- JWT tokens for Neko API access
- WebSocket secure connections (WSS)
- S3 IAM roles for training data upload
Data Privacy
- Local frame processing (no external API calls)
- Configurable training data retention
- Optional data encryption at rest
Network Security
- TLS/SSL for all external connections
- WebRTC DTLS encryption
- Firewall-friendly STUN/TURN protocols
Future Architecture Evolution
The current architecture is designed to support future enhancements:
- API Mode: RESTful service with horizontal scaling
- Multi-Modal: Vision + audio + text understanding
- Distributed Training: Cloud-native ML pipelines
- Edge Deployment: Lightweight inference containers
See API Mode Design for the complete next-generation architecture.
Component Documentation
This section provides detailed documentation for each component of the Neko Agent system.
Overview
The Neko Agent system consists of four main components:
- Core Agent (
src/agent.py
) - Main automation engine - Capture Service (
src/capture.py
) - Training data collection - Manual Control CLI (
src/manual.py
) - Interactive remote control interface - TTS Service (
src/yap.py
) - Voice synthesis and audio
Each component is designed to work independently while providing seamless integration when used together.
Component Interaction
graph TD Agent[Core Agent] --> Neko[Neko Server] Agent -.->|Optional| Capture[Capture Service] Agent -.->|Optional| TTS[TTS Service] Manual[Manual Control CLI] --> Neko Manual -.->|Admin API| Sessions[Session Management] Capture --> Storage[Training Data Storage] TTS --> Audio[WebRTC Audio Stream] Neko --> Chrome[Chrome Container] Audio --> Chrome
Development Workflow
When developing with multiple components:
- Start Neko Server (if using local setup)
- Launch Core Agent for basic automation
- Add Capture Service for training data collection
- Use Manual Control CLI for testing and debugging
- Add TTS Service for voice feedback
Each component has its own configuration and can be enabled/disabled as needed.
Next Steps
- Review individual component documentation in the subsections
- See Development Setup for configuration details
- Check Architecture Overview for system design
Core Agent (src/agent.py
)
The core agent (src/agent.py
) is the main automation engine of the Neko Agent system. It provides AI-powered GUI automation through WebRTC connections to Neko servers, using the ShowUI-2B vision model for intelligent action planning and execution.
Overview
The agent operates as a production-ready, 12-Factor App compliant service that:
- Connects to Neko servers via WebRTC for real-time GUI interaction
- Uses ShowUI-2B (Qwen2VL) for visual reasoning and action planning
- Executes precise actions with iterative refinement for improved accuracy
- Supports multiple modes including web automation and mobile device control
- Provides comprehensive metrics via Prometheus for monitoring and observability
- Handles graceful shutdown with proper resource cleanup
Architecture
graph TB subgraph "Agent Core" Agent[NekoAgent] Settings[Settings] Signaler[Signaler] end subgraph "Frame Sources" WebRTC[WebRTCFrameSource] Lite[LiteFrameSource] end subgraph "AI Processing" Model[ShowUI-2B Model] Processor[AutoProcessor] Executor[ThreadPoolExecutor] end subgraph "External Services" Neko[Neko Server] Metrics[Prometheus Metrics] Storage[Frame/Click Storage] end Agent --> Signaler Agent --> WebRTC Agent --> Lite Agent --> Model Agent --> Processor Signaler <-->|WebSocket| Neko WebRTC <-->|WebRTC| Neko Lite <-->|HTTP Polling| Neko Agent --> Metrics Agent --> Storage Model --> Executor
Core Classes
1. Settings
Purpose: Centralized configuration management following 12-Factor App principles.
Key Features:
- Environment variable-driven configuration
- Validation with error reporting
- Support for both development and production deployments
- Port priority handling (
$PORT
overridesNEKO_METRICS_PORT
)
Configuration Options:
Category | Environment Variable | Default | Description |
---|---|---|---|
Model | REPO_ID | showlab/ShowUI-2B | Hugging Face model repository |
SIZE_SHORTEST_EDGE | 224 | Image resize shortest edge | |
SIZE_LONGEST_EDGE | 1344 | Image resize longest edge | |
Network | NEKO_WS | wss://neko.example.com/api/ws | WebSocket URL |
NEKO_ICE_POLICY | strict | ICE candidate policy | |
NEKO_STUN_URL | stun:stun.l.google.com:19302 | STUN server | |
NEKO_TURN_URL | - | TURN server (optional) | |
Behavior | NEKO_MAX_STEPS | 8 | Maximum automation steps |
REFINEMENT_STEPS | 5 | Click refinement iterations | |
NEKO_AUDIO | 1 | Enable audio streaming | |
Logging | NEKO_LOGLEVEL | INFO | Log level |
NEKO_LOG_FORMAT | text | Log format (text/json) | |
NEKO_LOGFILE | - | Log file path (optional) | |
Storage | FRAME_SAVE_PATH | - | Frame screenshot storage |
CLICK_SAVE_PATH | - | Action visualization storage | |
OFFLOAD_FOLDER | ./offload | Model cache directory | |
Metrics | PORT / NEKO_METRICS_PORT | 9000 | Prometheus metrics port |
Example Usage:
# Load configuration from environment
settings = Settings.from_env()
# Validate configuration
errors = settings.validate()
if errors:
for error in errors:
print(f"Configuration error: {error}")
sys.exit(1)
2. NekoAgent
Purpose: Main orchestration class that manages WebRTC connections, AI processing, and action execution.
Initialization:
agent = NekoAgent(
model=model, # ShowUI-2B model instance
processor=processor, # AutoProcessor for model inputs
ws_url="wss://neko.example.com/api/ws",
nav_task="Navigate to google.com and search for AI",
nav_mode="web", # "web" or "phone"
settings=settings,
logger=logger,
max_steps=10,
refinement_steps=3,
audio=True,
online=False # False=single task, True=multi-task chat mode
)
Key Methods:
async def run()
Main execution loop that handles connection lifecycle, reconnection logic, and task processing.
Flow:
- Signal handling - Register SIGINT/SIGTERM handlers
- Connection establishment - WebSocket connection with backoff
- Media negotiation - WebRTC SDP offer/answer exchange
- Frame processing - Continuous screen capture and AI analysis
- Action execution - Translate AI decisions to control commands
- Graceful shutdown - Clean resource cleanup
async def _navigate_once(img, history, step)
Performs single navigation step using AI model inference.
Parameters:
img
: Current screen image (PIL Image)history
: List of previous actions for contextstep
: Current step number for logging
Process:
- Prepare inputs - Format image and chat template for model
- Model inference - Generate action prediction using ShowUI-2B
- Action parsing - Extract structured action from model output
- Iterative refinement - Crop and re-infer for precise click coordinates
- Action execution - Send control commands via WebSocket
Refinement Logic:
for i in range(self.refinement_steps):
# Initial inference on full image or cropped region
action = await model_inference(current_img)
if action.type == "CLICK":
# Crop around predicted click location
current_img, crop_box = self._crop_image(original_img, action.position)
# Re-infer on cropped image for better precision
else:
break # No refinement for non-click actions
async def _execute_action(action, size)
Translates high-level actions into low-level control commands.
Supported Actions:
Action | Description | Parameters |
---|---|---|
CLICK | Mouse click at coordinates | position: [x, y] (normalized) |
INPUT | Type text string | value: str |
SCROLL | Scroll in direction | direction: "up/down/left/right" |
SWIPE | Touch swipe gesture | startPosition: [x, y] , endPosition: [x, y] |
TAP | Mobile tap action | position: [x, y] |
ANSWER | Task completion signal | value: "final answer" |
Example Action Execution:
# CLICK action
action = {
"action": "CLICK",
"position": [0.5, 0.3], # Normalized coordinates (50%, 30%)
"value": None
}
# Converts to pixel coordinates and sends control command
x, y = int(0.5 * screen_width), int(0.3 * screen_height)
await signaler.send({
"event": "control/buttonpress",
"payload": {"x": x, "y": y, "code": 1} # Left click
})
3. Frame Sources
The agent supports multiple frame capture mechanisms:
WebRTCFrameSource
Purpose: Real-time frame capture via WebRTC video streams.
Features:
- Low latency - Direct WebRTC video track access
- High quality - Full resolution screen capture
- Automatic buffering - Handles frame timing and delivery
- Error resilience - Graceful handling of stream interruptions
LiteFrameSource
Purpose: HTTP-based frame capture for environments where WebRTC is unavailable.
Features:
- Simple protocol - HTTP GET requests for screenshots
- Fallback mode - Works when WebRTC fails or is blocked
- Configurable polling - Adjustable frame rate
- Resource efficient - Lower bandwidth than video streams
4. Signaler
Purpose: WebSocket communication layer for control and signaling.
Features:
- Event-driven messaging - Pub/sub pattern with topic routing
- Automatic reconnection - Exponential backoff strategy
- Concurrent I/O - Separate read/write loops
- Message queuing - Buffered sending with backpressure handling
Event Types:
signal/*
- WebRTC signaling (offer/answer/ICE)control/*
- GUI control commands (mouse/keyboard)chat/*
- Text messaging for online modesystem/*
- Session management and status
Action Space and Navigation Modes
Web Mode (nav_mode="web"
)
Optimized for: Desktop web browsers, web applications
Action Space:
Available Actions:
1. CLICK(position): Click at normalized coordinates [x, y] where 0 <= x, y <= 1
2. INPUT(value): Type the specified text string
3. SCROLL(direction): Scroll in direction (up, down, left, right)
4. ANSWER(value): Provide final answer when task is complete
Examples:
- CLICK(position=[0.5, 0.3]): Click at center-left of screen
- INPUT(value="search query"): Type text into focused element
- SCROLL(direction="down"): Scroll page downward
- ANSWER(value="Task completed successfully"): Signal completion
Phone Mode (nav_mode="phone"
)
Optimized for: Mobile device interfaces, touch interactions
Action Space:
Available Actions:
1. TAP(position): Tap at normalized coordinates [x, y] where 0 <= x, y <= 1
2. INPUT(value): Enter text using virtual keyboard
3. SWIPE(startPosition, endPosition): Swipe from start to end coordinates
4. SCROLL(direction): Scroll content in specified direction
5. ANSWER(value): Provide final answer when task is complete
Examples:
- TAP(position=[0.8, 0.1]): Tap near top-right corner
- SWIPE(startPosition=[0.5, 0.8], endPosition=[0.5, 0.2]): Swipe up
- SCROLL(direction="down"): Scroll content downward
AI Model Integration
ShowUI-2B (Qwen2VL)
Model Details:
- Architecture: Qwen2VL-based multimodal model specialized for GUI understanding
- Input: RGB images + text instructions
- Output: Structured action commands
- Performance: ~100-200ms inference time on GPU
Processing Pipeline:
# 1. Image preprocessing
image = resize_and_validate_image(frame, settings, logger)
# 2. Chat template preparation
content = [
{"type": "text", "text": system_prompt},
{"type": "text", "text": f"Task: {nav_task}"},
{"type": "text", "text": f"Action history: {history}"},
{"type": "image", "image": image, "size": {...}}
]
# 3. Model inference
inputs = processor(text=[text], images=[image], return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128)
# 4. Action parsing
action = safe_parse_action(decoded_output, nav_mode="web")
Iterative Refinement: The agent implements a novel refinement strategy for improved click accuracy:
- Initial prediction on full screen image
- Crop extraction around predicted click location (50% of original size)
- Re-inference on cropped region for sub-pixel precision
- Coordinate transformation back to full screen space
This typically improves click accuracy by 15-25% over single-shot inference.
Operation Modes
Offline Mode (Single Task)
Usage: Automated execution of a single, predefined task.
agent = NekoAgent(
model=model,
processor=processor,
ws_url=ws_url,
nav_task="Find the current weather in San Francisco",
nav_mode="web",
online=False, # Single task mode
max_steps=10
)
await agent.run() # Executes task and exits
Behavior:
- Immediate control - Requests host control on connection
- Task execution - Runs navigation loop until completion or max steps
- Automatic exit - Terminates after task completion
- Host control maintained - Keeps control throughout session
Online Mode (Multi-Task)
Usage: Interactive agent that responds to chat commands for multiple tasks.
agent = NekoAgent(
model=model,
processor=processor,
ws_url=ws_url,
nav_task="", # No initial task
nav_mode="web",
online=True, # Multi-task chat mode
max_steps=15
)
await agent.run() # Runs indefinitely, responding to chat
Behavior:
- On-demand control - Only requests control when active task starts
- Chat monitoring - Listens for task commands in chat messages
- Task queuing - Handles multiple sequential tasks
- Persistent session - Remains connected between tasks
- Host control release - Releases control when idle
Chat Commands:
Start a new task:
User: Navigate to amazon.com and find wireless headphones under $100
The agent will:
1. Request host control
2. Execute the navigation task
3. Release host control when complete
4. Wait for next command
Metrics and Observability
Prometheus Metrics
The agent exports comprehensive metrics for monitoring and alerting:
# Connection metrics
reconnects = Counter('neko_agent_reconnects_total', 'WebSocket reconnection attempts')
frames_processed = Counter('neko_agent_frames_processed_total', 'Video frames processed')
# Performance metrics
inference_latency = Histogram('neko_agent_inference_seconds', 'AI model inference time')
action_execution_time = Histogram('neko_agent_action_execution_seconds', 'Action execution time')
# Quality metrics
actions_executed = Counter('neko_agent_actions_executed_total', 'Actions executed by type', ['action_type'])
parse_errors = Counter('neko_agent_parse_errors_total', 'Action parsing failures')
Example Queries:
# Average inference latency over 5 minutes
rate(neko_agent_inference_seconds_sum[5m]) / rate(neko_agent_inference_seconds_count[5m])
# Action success rate
1 - (rate(neko_agent_parse_errors_total[5m]) / rate(neko_agent_frames_processed_total[5m]))
# Actions per minute by type
rate(neko_agent_actions_executed_total[1m]) * 60
Logging
Structured Logging: Supports both text and JSON formats for different environments.
# Text format (development)
export NEKO_LOG_FORMAT=text
# Output: 2024-01-15 10:30:00 INFO Model inference completed (step=3) | Duration: 0.15s | Device: GPU
# JSON format (production)
export NEKO_LOG_FORMAT=json
# Output: {"timestamp":"2024-01-15T10:30:00Z","level":"INFO","message":"Model inference completed","step":3,"duration":0.15,"device":"GPU"}
Log Levels:
DEBUG
- Detailed execution flow, frame processing detailsINFO
- Task progress, action execution, performance metricsWARNING
- Recoverable errors, fallback activationsERROR
- Critical failures, connection issues
Configuration Examples
Development Environment
# Basic development setup
export NEKO_WS="ws://localhost:8080/api/ws"
export NEKO_LOGLEVEL="DEBUG"
export NEKO_LOG_FORMAT="text"
export FRAME_SAVE_PATH="./debug/frames"
export CLICK_SAVE_PATH="./debug/actions"
export NEKO_MAX_STEPS="15"
export REFINEMENT_STEPS="3"
Production Environment
# Production configuration
export NEKO_WS="wss://neko.prod.example.com/api/ws"
export NEKO_LOGLEVEL="INFO"
export NEKO_LOG_FORMAT="json"
export NEKO_LOGFILE="/var/log/neko-agent.log"
export PORT="8080" # Metrics server port
export NEKO_MAX_STEPS="20"
export REFINEMENT_STEPS="5"
export NEKO_RTCP_KEEPALIVE="1" # For NAT traversal
export NEKO_FORCE_EXIT_GUARD_MS="5000" # Force exit after 5s
GPU Optimization
# GPU-specific settings
export CUDA_VISIBLE_DEVICES="0"
export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"
export TORCH_CUDA_ARCH_LIST="8.6" # For RTX 30xx series
export OFFLOAD_FOLDER="/fast-ssd/model-cache" # SSD for model loading
Error Handling and Recovery
Connection Resilience
# Automatic reconnection with exponential backoff
async def connect_with_backoff(self):
backoff = 1.0
max_backoff = 60.0
while not self.shutdown.is_set():
try:
await self._establish_connection()
return
except Exception as e:
logger.warning(f"Connection failed: {e}, retrying in {backoff}s")
await asyncio.sleep(backoff)
backoff = min(max_backoff, backoff * 1.5)
Inference Timeout Handling
# Model inference with timeout and cancellation
try:
async with asyncio.timeout(120.0): # 2 minute timeout
outputs = await model_inference_task
except asyncio.TimeoutError:
model_inference_task.cancel()
logger.error("Model inference timeout, skipping frame")
return None
Frame Source Fallback
# Automatic fallback from WebRTC to HTTP polling
if webrtc_frame_source.failed:
logger.warning("WebRTC frame source failed, falling back to HTTP polling")
self.frame_source = LiteFrameSource(settings, logger)
await self.frame_source.start(session_id)
Performance Optimization
GPU Memory Management
# Efficient GPU memory usage
torch.cuda.empty_cache() # Clear cache between tasks
model.half() # Use FP16 for faster inference
torch.backends.cudnn.benchmark = True # Optimize for fixed input sizes
Frame Processing Pipeline
# Asynchronous frame processing
async def process_frames():
async for frame in frame_source:
# Non-blocking processing
task = asyncio.create_task(self._navigate_once(frame, history, step))
# Rate limiting to prevent overwhelming the model
await asyncio.sleep(0.1) # 10 FPS maximum
Model Loading Optimization
# Efficient model initialization
model = Qwen2VLForConditionalGeneration.from_pretrained(
repo_id,
torch_dtype=torch.float16, # Half precision
device_map="auto", # Automatic device placement
offload_folder=settings.offload_folder, # Disk offloading for large models
trust_remote_code=True
)
Usage Examples
Basic Web Automation
# Simple web navigation task
import asyncio
from agent import NekoAgent, Settings, setup_logging
async def main():
settings = Settings.from_env()
logger = setup_logging(settings)
# Load AI model
model = Qwen2VLForConditionalGeneration.from_pretrained("showlab/ShowUI-2B")
processor = AutoProcessor.from_pretrained("showlab/ShowUI-2B")
# Create and run agent
agent = NekoAgent(
model=model,
processor=processor,
ws_url="ws://localhost:8080/api/ws",
nav_task="Go to google.com and search for 'machine learning tutorials'",
nav_mode="web",
settings=settings,
logger=logger
)
await agent.run()
if __name__ == "__main__":
asyncio.run(main())
Interactive Online Mode
# Multi-task interactive agent
agent = NekoAgent(
model=model,
processor=processor,
ws_url="wss://neko.example.com/api/ws",
nav_task="", # Start without a task
nav_mode="web",
online=True, # Enable chat-based task assignment
settings=settings,
logger=logger
)
# Agent will respond to chat messages like:
# "Please navigate to amazon.com and find wireless keyboards under $50"
# "Go to the weather website and check the forecast for tomorrow"
await agent.run()
Custom Action Refinement
# Agent with custom refinement settings
agent = NekoAgent(
model=model,
processor=processor,
ws_url=ws_url,
nav_task=task,
nav_mode="web",
refinement_steps=5, # More refinement for better accuracy
max_steps=20, # Allow longer task sequences
settings=settings,
logger=logger
)
Testing and Debugging
Health Check
# Validate configuration
uv run src/agent.py --healthcheck
# Expected output:
# Configuration validation: PASSED
# Model loading: PASSED
# WebSocket connectivity: PASSED
# All systems ready
Debug Mode
# Enable comprehensive debugging
export NEKO_LOGLEVEL="DEBUG"
export FRAME_SAVE_PATH="./debug/frames"
export CLICK_SAVE_PATH="./debug/actions"
uv run src/agent.py --task "test navigation" --max-steps 3
Frame Analysis
The agent can save frames and action visualizations for debugging:
# Saved frame: ./debug/frames/frame_001_1642261234.567.png
# Saved action: ./debug/actions/action_step_001_1642261234.567_CLICK.png
Action visualizations include:
- Red circle - Predicted click location
- Green overlay - Text input areas
- Blue arrows - Scroll directions
- Purple box - Crop regions during refinement
Integration with Other Components
With Capture Service
# The agent automatically integrates with capture service
# No additional configuration needed - capture service monitors WebSocket
export CAPTURE_OUT="./training-data"
export CAPTURE_REMOTE="s3://training-bucket/episodes"
# Start capture in background
uv run src/capture.py &
# Run agent (capture will automatically record)
uv run src/agent.py --task "automation task"
With TTS Service
# Enable voice announcements during automation
export YAP_VOICES_DIR="./voices"
# Start TTS service
uv run src/yap.py &
# Run agent with voice enabled (audio is negotiated automatically)
uv run src/agent.py --task "automation task"
Future Enhancements
The agent architecture is designed to support planned enhancements:
- API Mode - RESTful service with horizontal scaling
- Multi-Modal Input - Voice commands and text instructions
- Advanced Vision Models - Support for newer GUI understanding models
- Distributed Inference - Model sharding across multiple GPUs
- Custom Action Types - Extensible action framework
- Real-Time Learning - Online adaptation from user feedback
See API Mode Design for detailed future architecture plans.
Capture Service
Manual Control CLI
The Manual Control CLI (src/manual.py
) provides a production-ready command-line interface for manually interacting with Neko v3 servers. This tool implements the complete WebRTC signaling handshake required by modern Neko servers and offers a comprehensive REPL for remote desktop control.
Overview
The Manual CLI is designed for:
- Manual testing of Neko server functionality
- Remote administration tasks on hosted desktops
- Development and debugging of Neko integrations
- Quality assurance testing of desktop applications
Key Features
- Full WebRTC Signaling: Complete SDP/ICE handshake for control acceptance
- Robust Authentication: REST login with automatic token management
- Auto-reconnection: Intelligent reconnection with exponential backoff
- REPL Interface: Interactive command-line with comprehensive controls
- Multiple Input Methods: Mouse, keyboard, scrolling, and gesture support
- Admin Commands: Session management and force control operations
- Coordinate Systems: Support for both pixel and normalized (0..1) coordinates
Architecture
Core Components
graph TD CLI[ManualCLI] --> Signaler[WebSocket Signaler] CLI --> Broker[Event Broker] Signaler --> WS[WebSocket Connection] Signaler --> Tasks[Background Tasks] Broker --> SysQ[System Queue] Broker --> SigQ[Signal Queue] Broker --> MiscQ[Misc Queue] CLI --> PC[RTCPeerConnection] PC --> ICE[ICE Negotiation] PC --> SDP[SDP Exchange] WS --> Neko[Neko Server] ICE --> Neko SDP --> Neko
Class Hierarchy
Broker
- Event routing and message distributionSignaler
- WebSocket connection managementManualCLI
- Main application controller and REPL
Usage
Basic Connection
# Direct WebSocket connection
python manual.py --ws "wss://neko.example.com/api/ws?token=abc123"
# REST authentication
python manual.py --neko-url "https://neko.example.com" \
--username "admin" --password "password"
Configuration Options
Flag | Environment | Description |
---|---|---|
--ws | NEKO_WS | Direct WebSocket URL with token |
--neko-url | NEKO_URL | Base URL for REST authentication |
--username | NEKO_USER | Login username |
--password | NEKO_PASS | Login password |
--size | NEKO_SIZE | Virtual screen size (e.g., "1920x1080") |
--norm | - | Use normalized 0..1 coordinates |
--no-auto-host | - | Disable automatic host control requests |
--no-media | - | Skip WebRTC signaling (may break control) |
--no-audio | - | Disable audio streaming |
Environment Variables
# Logging configuration
export NEKO_LOGLEVEL=DEBUG
export NEKO_LOGFILE=/tmp/neko-manual.log
# Connection parameters
export NEKO_URL=https://neko.example.com
export NEKO_USER=admin
export NEKO_PASS=secretpassword
export NEKO_SIZE=1920x1080
REPL Commands
Navigation and Control
Mouse Operations
move X Y # Move cursor to coordinates
click [button] # Click at current position (left/right/middle)
lclick # Left click shortcut
rclick # Right click shortcut
dblclick # Double click with proper timing
hover X Y # Move to coordinates without clicking
tap X Y [button] # Move and click in one command
swipe x1 y1 x2 y2 # Drag gesture between two points
Keyboard Input
key <KeyName> # Press a specific key (Escape, F5, Control, etc.)
text "some text" # Type text at current cursor focus
input X Y "text" # Click at coordinates and type text
enter # Press Enter/Return key
Scrolling
scroll <direction> [amount] # Scroll up/down/left/right (amount defaults to 1)
Clipboard Operations
copy # Send copy command
cut # Send cut command
select_all # Select all content
paste [text] # Paste clipboard or specific text
Connection Management
host # Request host control
unhost # Release host control
size [WxH] # Show or set virtual screen size
raw '{"event":"...","payload":{}}' # Send raw JSON message
Administrative Commands
force-take # Force take host control (admin only)
force-release # Force release host control (admin only)
kick <sessionId> # Force disconnect a session (admin only)
sessions # List all connected users and session IDs
Utility Commands
help # Show command reference
quit / exit # Close connection and exit
API Reference
Core Classes
Broker
Event routing system that distributes incoming WebSocket messages to appropriate topic queues.
class Broker:
def __init__(self)
def topic_queue(self, topic: str, maxsize: int = 512) -> asyncio.Queue
def publish(self, msg: Dict[str, Any]) -> None
Topic Routing:
signal/
events →signal
queue (WebRTC signaling)system/
,control/
,screen/
,keyboard/
,session/
,error/
→system
queue- All others →
misc
queue
Signaler
WebSocket connection manager with automatic reconnection and background message processing.
class Signaler:
def __init__(self, url: str, **wsopts)
async def connect_with_backoff(self) -> "Signaler"
async def close(self) -> None
async def send(self, msg: Dict[str, Any]) -> None
Key Features:
- Exponential backoff reconnection (1s → 30s max)
- Separate read/write loops for optimal performance
- Graceful shutdown with task cleanup
- Message queuing during disconnection
ManualCLI
Main application controller that orchestrates connection management, WebRTC signaling, and user interaction.
class ManualCLI:
def __init__(self, ws: str, width: int, height: int,
normalized: bool, auto_host: bool,
request_media: bool, base_url: Optional[str] = None,
token: Optional[str] = None, audio: bool = True)
async def start(self) -> None
async def handle(self, line: str) -> None
Utility Functions
parse_size(s: str) -> Tuple[int, int]
Parse screen size string (e.g., "1920x1080") into width/height tuple.
ws_from_rest_login(neko_url: str, username: str, password: str, *, timeout: float = 10.0) -> Tuple[str, str, str]
Perform REST authentication and derive WebSocket URL with token.
name_keysym(name: str) -> int
Map key names to X11 keysym codes for keyboard events.
Protocol Implementation
WebRTC Signaling Flow
- Initial Request: Send
signal/request
with video/audio preferences - Offer Reception: Wait for
signal/offer
orsignal/provide
from server - ICE Configuration: Parse and validate ICE servers (strict mapping)
- SDP Exchange: Set remote description, create answer, send local description
- ICE Candidates: Process incoming candidates and send local candidates
- Media Handling: Log received tracks without processing (control-only)
Connection Lifecycle
sequenceDiagram participant C as Client participant S as Signaler participant N as Neko Server participant R as RTC Connection C->>S: connect_with_backoff() S->>N: WebSocket connection S->>C: Connection established C->>N: signal/request (media preferences) C->>N: control/request (host control) N->>S: signal/offer (SDP + ICE servers) S->>R: Create RTCPeerConnection R->>N: signal/answer (SDP response) loop ICE Negotiation N->>R: signal/candidate R->>N: signal/candidate end R->>C: Connection ready loop Event Processing N->>S: system/heartbeat S->>N: client/heartbeat C->>N: control/move, control/click, etc. N->>S: control/host, screen/updated, etc. end
Error Handling and Reconnection
The CLI implements robust error handling:
- WebSocket Errors: Automatic reconnection with exponential backoff
- Authentication Failures: Clear error messages and exit
- Protocol Errors: Graceful degradation and retry
- Resource Cleanup: Proper task cancellation and connection shutdown
Integration Examples
Basic Automation Script
import asyncio
from manual import ManualCLI, ws_from_rest_login
async def automate_task():
# Get WebSocket URL via REST login
ws_url, base_url, token = ws_from_rest_login(
"https://neko.example.com",
"admin",
"password"
)
# Create CLI instance
cli = ManualCLI(
ws=ws_url,
width=1920,
height=1080,
normalized=False,
auto_host=True,
request_media=True,
base_url=base_url,
token=token
)
# Start connection and wait for ready
await cli.start()
# Perform automation tasks
await cli.handle("move 100 100")
await cli.handle("click")
await cli.handle("text 'Hello World'")
await cli.handle("key Enter")
if __name__ == "__main__":
asyncio.run(automate_task())
Custom Event Handler
class CustomCLI(ManualCLI):
async def _event_logger(self):
"""Override event logger with custom handling."""
q = self.signaler.broker.topic_queue("system")
while self.running and self.signaler and not self.signaler._closed.is_set():
try:
msg = await q.get()
ev = msg.get("event", "")
payload = msg.get("payload", {})
# Custom event processing
if ev == "control/host":
print(f"Host control changed: {payload}")
elif ev == "session/created":
print(f"New session: {payload}")
# Call parent handler
await super()._event_logger()
except asyncio.CancelledError:
break
Security Considerations
Authentication
- Always use HTTPS/WSS in production
- Store credentials securely (environment variables, not code)
- Implement proper token rotation for long-running sessions
Network Security
- Validate SSL certificates in production
- Use VPN or private networks for sensitive operations
- Monitor connection logs for suspicious activity
Access Control
- Limit admin command usage to authorized users
- Implement session timeouts for inactive connections
- Log all control operations for audit trails
Troubleshooting
Common Issues
Connection Failures
# Enable debug logging
export NEKO_LOGLEVEL=DEBUG
export NEKO_LOGFILE=/tmp/debug.log
# Check WebSocket connectivity
python manual.py --ws "wss://neko.example.com/api/ws?token=test"
Authentication Errors
# Verify REST API access
curl -X POST https://neko.example.com/api/login \
-H "Content-Type: application/json" \
-d '{"username":"admin","password":"password"}'
WebRTC Issues
- ICE Failures: Check firewall and NAT configuration
- SDP Errors: Verify codec compatibility
- Media Problems: Use
--no-audio
flag for control-only mode
Performance Issues
- High Latency: Check network connectivity and server load
- Dropped Connections: Reduce
ping_interval
in WebSocket options - Memory Leaks: Monitor task cleanup during reconnection
Debug Commands
# Raw protocol inspection
await cli.handle('raw \'{"event":"system/stats"}\'')
# Connection diagnostics
await cli.handle('raw \'{"event":"signal/stats"}\'')
# Session information
await cli.handle('sessions')
Development
Testing
# Unit tests
python -m pytest tests/test_manual.py
# Integration tests with local Neko server
docker-compose up neko
python manual.py --neko-url http://localhost:8080 --username admin --password neko
# Load testing
python scripts/stress_test_manual.py --connections 10 --duration 300
Contributing
When modifying manual.py
:
- Maintain Protocol Compatibility: Don't break existing WebRTC signaling
- Add Comprehensive Tests: Cover new REPL commands and edge cases
- Update Documentation: Keep this guide and docstrings current
- Follow PEP Standards: Code must comply with PEP 257/287 for docstrings
- Handle Errors Gracefully: All new features must have proper error handling
Extending Functionality
Adding New Commands
# In ManualCLI.handle() method
if cmd == "my_command":
await self._handle_my_command(rest)
return
# Implement handler method
async def _handle_my_command(self, args) -> None:
"""Handle the 'my_command' REPL command.
:param args: Command arguments from user input
:type args: list
"""
if not args:
print("usage: my_command <parameter>")
return
# Command implementation
await self._safe_send({
"event": "custom/my_event",
"payload": {"param": args[0]}
})
Custom Event Processing
# In _event_logger() method
elif ev == "custom/my_event":
self._handle_custom_event(payload)
def _handle_custom_event(self, payload: dict) -> None:
"""Process custom server events.
:param payload: Event payload from server
:type payload: dict
"""
# Custom event handling logic
pass
This documentation provides comprehensive coverage of the Manual Control CLI, from basic usage to advanced development scenarios. The tool serves as both a practical utility for Neko server interaction and a reference implementation for WebRTC signaling protocols.
TTS Service (yap.py)
The YAP (Yet Another Presenter) service provides real-time text-to-speech functionality through WebRTC audio streaming. It connects to Neko servers and responds to chat commands for immediate or streaming speech synthesis.
Overview
YAP transforms text messages into high-quality speech using F5-TTS and broadcasts the audio through WebRTC. It supports both immediate synthesis and streaming mode for real-time conversation scenarios.
Key Features
- Real-time TTS: Low-latency speech synthesis using F5-TTS
- WebRTC Integration: Direct audio streaming to Neko browser sessions
- Voice Management: Hot-reloadable voice configurations with custom parameters
- Streaming Mode: Incremental text processing for live conversations
- Chat Commands: Interactive control through Neko chat interface
- Parallel Processing: Multi-threaded TTS workers for improved performance
- Audio Pipeline: Crossfade splicing and jitter buffering for smooth playback
Architecture
graph TD A[Chat Commands] --> B[Command Parser] B --> C[Stream Assembler] C --> D[Text Segmentation] D --> E[TTS Pipeline] E --> F[Parallel F5-TTS Workers] F --> G[Audio Splicer] G --> H[PCM Queue] H --> I[WebRTC Audio Track] I --> J[Neko Browser] K[Voice Manager] --> E L[WebSocket Signaler] --> A L --> M[WebRTC Peer Connection] M --> I
Installation
Dependencies
# Core dependencies
pip install aiortc av websockets numpy
# F5-TTS (required for speech synthesis)
pip install git+https://github.com/SWivid/F5-TTS.git
# Optional resampling backends (first available will be used)
pip install torchaudio # Preferred
# OR
pip install scipy # Alternative
# Linear fallback is built-in
System Requirements
- Python 3.8+
- CUDA-capable GPU (recommended for F5-TTS)
- WebRTC-compatible environment
Configuration
YAP follows 12-factor app principles with environment-based configuration:
Connection Settings
Variable | Default | Description |
---|---|---|
YAP_WS / NEKO_WS | - | Direct WebSocket URL |
NEKO_URL | - | REST API base URL |
NEKO_USER | - | Username for REST authentication |
NEKO_PASS | - | Password for REST authentication |
Audio Settings
Variable | Default | Description |
---|---|---|
YAP_SR | 48000 | Output sample rate (Hz) |
YAP_AUDIO_CHANNELS | 1 | Audio channels (1=mono, 2=stereo) |
YAP_FRAME_MS | 20 | WebRTC frame size (10/20/30/40/60ms) |
YAP_JITTER_MAX_SEC | 6.0 | PCM buffer maximum duration |
TTS Pipeline Settings
Variable | Default | Description |
---|---|---|
YAP_PARALLEL | 2 | Parallel TTS worker threads |
YAP_CHUNK_TARGET_SEC | 3.0 | Target chunk duration hint |
YAP_MAX_CHARS | 350 | Maximum characters per chunk |
YAP_OVERLAP_MS | 30 | Audio crossfade overlap |
Voice Settings
Variable | Default | Description |
---|---|---|
YAP_VOICES_DIR | ./voices | Voice configuration directory |
YAP_SPK_DEFAULT | default | Default speaker ID |
Logging & ICE Settings
Variable | Default | Description |
---|---|---|
YAP_LOGLEVEL | INFO | Log level (DEBUG/INFO/WARNING/ERROR) |
YAP_LOG_FORMAT | text | Log format (text/json) |
YAP_STUN_URL | stun:stun.l.google.com:19302 | STUN server |
YAP_ICE_POLICY | strict | ICE server policy (strict/all) |
Usage
Basic Usage
# Direct WebSocket connection
export YAP_WS="wss://demo.neko.com/api/ws?token=your_token"
python src/yap.py
# REST API authentication
export NEKO_URL="https://demo.neko.com"
export NEKO_USER="username"
export NEKO_PASS="password"
python src/yap.py
Command Line Options
python src/yap.py --help
# Common options
python src/yap.py \
--ws "wss://host/api/ws?token=..." \
--sr 48000 \
--channels 1 \
--parallel 4 \
--loglevel DEBUG
Health Check
# Validate configuration
python src/yap.py --healthcheck
Chat Commands
YAP responds to commands in Neko chat:
Immediate Speech
/yap Hello, this is immediate speech synthesis!
Streaming Mode
/yap:begin
This text will be processed incrementally...
More text gets added and synthesized in chunks...
/yap:end
Voice Control
# Switch active voice and parameters
/yap:voice set --spk alice --rate 1.2 --pitch 0.5
# Add new voice
/yap:voice add --spk bob --ref /path/to/reference.wav --ref-text "Reference text" --styles "calm,friendly"
# Reload voice configurations
/yap:voice reload
Queue Management
# Stop current synthesis and clear queue
/yap:stop
Voice Management
Voice Configuration (voices.json
)
{
"default": {
"ref_audio": "./voices/default.wav",
"ref_text": "This is a default reference recording.",
"styles": ["calm", "neutral"],
"params": {
"rate": 1.0,
"pitch": 0.0
}
},
"alice": {
"ref_audio": "./voices/alice_sample.wav",
"ref_text": "Hello, I'm Alice and this is my voice.",
"styles": ["friendly", "energetic"],
"params": {
"rate": 1.1,
"pitch": 0.2
}
}
}
Voice Parameters
- rate: Speech speed multiplier (0.5-2.0)
- pitch: Pitch shift in semitones (-12 to +12)
- styles: Descriptive tags for voice characteristics
- ref_audio: Path to reference audio file (WAV format)
- ref_text: Transcript of reference audio
Hot Reloading
Voice configurations can be updated without restarting:
- Edit
voices.json
in the voices directory - Send
/yap:voice reload
command in chat - Changes take effect immediately
Technical Implementation
Audio Pipeline
- Text Segmentation: Smart punctuation-aware chunking
- Parallel Synthesis: Multi-threaded F5-TTS workers
- Audio Splicer: Crossfade blending between chunks
- PCM Queue: Jitter-buffered audio streaming
- WebRTC Track: Real-time audio transmission
Text Processing
# Punctuation-aware segmentation
chunks = segment_text("Hello world! How are you today?", max_chars=50)
# Result: ["Hello world!", "How are you today?"]
# Streaming assembly with opportunistic emission
assembler = StreamAssembler(max_chars=100)
ready_chunks = assembler.feed("Partial text...")
final_chunks = assembler.flush()
Audio Formats
- Input: F5-TTS generated audio (variable sample rate)
- Processing: Float32 waveforms with resampling
- Output: 16-bit PCM at configured sample rate
- WebRTC: Opus-encoded frames for transmission
Resampling Backends
YAP automatically selects the best available resampling backend:
- torchaudio (preferred): High-quality GPU acceleration
- scipy: CPU-based signal processing
- linear (fallback): Simple interpolation
Performance Tuning
Parallel Workers
# Increase for better throughput (GPU memory permitting)
export YAP_PARALLEL=4
Chunk Sizing
# Smaller chunks = lower latency, higher overhead
export YAP_MAX_CHARS=200
# Larger chunks = higher latency, better efficiency
export YAP_MAX_CHARS=500
Audio Buffer
# Reduce for lower latency (risk of underruns)
export YAP_JITTER_MAX_SEC=3.0
# Increase for stability (higher latency)
export YAP_JITTER_MAX_SEC=10.0
Troubleshooting
Common Issues
No audio output
- Check WebRTC connection status in browser developer tools
- Verify audio permissions in browser
- Confirm STUN/TURN server connectivity
High latency
- Reduce
YAP_MAX_CHARS
for smaller chunks - Decrease
YAP_JITTER_MAX_SEC
buffer size - Increase
YAP_PARALLEL
workers (if GPU permits)
Audio artifacts
- Increase
YAP_OVERLAP_MS
for smoother crossfades - Check F5-TTS reference audio quality
- Verify sample rate consistency
Connection failures
- Validate WebSocket URL and authentication
- Check firewall settings for WebRTC ports
- Test with simpler STUN configuration
Debug Logging
export YAP_LOGLEVEL=DEBUG
export YAP_LOG_FORMAT=json
python src/yap.py 2>&1 | jq .
Health Validation
# Check configuration and dependencies
python src/yap.py --healthcheck
API Reference
Core Classes
YapApp
Main application coordinator handling WebSocket signaling, WebRTC setup, and command processing.
app = YapApp(settings, logger)
await app.run() # Main event loop
TTSPipeline
Manages parallel TTS synthesis with crossfade splicing.
pipeline = TTSPipeline(voices, tts, pcmq, sr_out, ch_out, overlap_ms, parallel, logger, max_chars)
await pipeline.speak_text("Hello world", speaker="alice")
VoiceManager
Hot-reloadable voice configuration registry.
voices = VoiceManager(voices_dir, default_speaker, logger)
voice = voices.get("alice") # Get voice config
voices.reload() # Hot reload from JSON
PCMQueue
Thread-safe jitter buffer for WebRTC audio streaming.
pcmq = PCMQueue(sr=48000, channels=1, max_sec=6.0, logger=logger)
pcmq.push(audio_samples) # Producer
samples = pcmq.pull(frame_size) # Consumer
Key Functions
segment_text(text, max_chars)
Intelligent text segmentation respecting punctuation boundaries.
apply_rate_pitch(wave, sr, rate, pitch)
Apply naive prosody transformations (rate/pitch adjustments).
_resample(wave, sr_from, sr_to)
Multi-backend audio resampling with automatic fallback.
Integration Examples
Docker Deployment
FROM python:3.9-slim
# Install system dependencies
RUN apt-get update && apt-get install -y \
ffmpeg \
&& rm -rf /var/lib/apt/lists/*
# Install Python dependencies
COPY requirements.txt .
RUN pip install -r requirements.txt
# Copy application
COPY src/ /app/src/
COPY voices/ /app/voices/
WORKDIR /app
# Configuration via environment
ENV YAP_SR=48000
ENV YAP_PARALLEL=2
ENV YAP_VOICES_DIR=/app/voices
ENTRYPOINT ["python", "src/yap.py"]
Docker Compose
version: '3.8'
services:
yap:
build: .
environment:
- NEKO_URL=http://neko:8080
- NEKO_USER=admin
- NEKO_PASS=password
- YAP_PARALLEL=4
- YAP_LOGLEVEL=INFO
volumes:
- ./voices:/app/voices
depends_on:
- neko
Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: yap-tts
spec:
replicas: 1
selector:
matchLabels:
app: yap-tts
template:
metadata:
labels:
app: yap-tts
spec:
containers:
- name: yap
image: neko-agent/yap:latest
env:
- name: NEKO_URL
valueFrom:
secretKeyRef:
name: neko-config
key: url
- name: NEKO_USER
valueFrom:
secretKeyRef:
name: neko-config
key: username
- name: NEKO_PASS
valueFrom:
secretKeyRef:
name: neko-config
key: password
- name: YAP_PARALLEL
value: "4"
resources:
requests:
memory: "2Gi"
cpu: "1000m"
limits:
memory: "4Gi"
cpu: "2000m"
volumeMounts:
- name: voices
mountPath: /app/voices
volumes:
- name: voices
persistentVolumeClaim:
claimName: yap-voices
Future Enhancements
- Voice Cloning: Real-time voice learning from user samples
- Emotion Control: Dynamic emotional expression parameters
- SSML Support: Advanced speech markup for prosody control
- Multi-language: Automatic language detection and synthesis
- Audio Effects: Real-time audio processing (reverb, filters)
- Metrics: Prometheus/OpenTelemetry integration for monitoring
Related Components
- Core Agent: Main automation engine that can trigger TTS
- Capture Service: Training data collection for voice models
- Neko Integration: Browser environment and WebRTC setup
Source Reference
File: src/yap.py:1
Lines of Code: ~1550
Dependencies: aiortc, av, websockets, numpy, F5-TTS
Development Setup
This guide covers the development environment setup, coding standards, and contribution workflow for the Neko Agent project.
Development Environment
Prerequisites
- Nix with flakes - Package management and development shells
- Git - Version control
- NVIDIA GPU (optional but recommended) - For AI model acceleration
Setting Up
-
Clone the repository:
git clone <repository-url> cd neko-agent
-
Enter development environment:
# For GPU development (recommended) nix develop .#gpu # For CPU-only development nix develop
-
Verify setup:
python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"
Available Development Shells
Shell | Purpose | Key Packages |
---|---|---|
default | Basic Python development | PyTorch CPU, transformers, websockets |
gpu | GPU-accelerated development | CUDA 12.8, PyTorch GPU, all dependencies |
docs | Documentation development | mdBook, Sphinx, preprocessing tools |
neko | Neko server management | Docker, Colima, compose tools |
ai | AI tool integration | Claude Code CLI, OpenAI tools |
Project Structure
├── src/ # Source code
│ ├── agent.py # Core automation agent
│ ├── capture.py # Training data capture
│ └── yap.py # TTS service
├── docs/ # Documentation
│ ├── src/ # mdBook source files
│ └── book.toml # Documentation configuration
├── voices/ # Voice models and assets
├── data/ # Training data output
├── overlays/ # Nix package overlays
├── nix/ # Nix configuration
└── flake.nix # Development environment
Coding Standards
Python Style
- Follow PEP 8 for code style
- Use type hints for all function signatures
- Write docstrings for public functions and classes
- Use async/await for I/O operations
Example:
async def process_frame(frame: np.ndarray, task: str) -> Optional[Action]:
"""
Process a video frame and determine the next action.
:param frame: RGB frame data as numpy array
:param task: Natural language task description
:return: Action to execute, or None if task complete
"""
# Implementation here
pass
Error Handling
- Use specific exceptions rather than generic Exception
- Log errors with context for debugging
- Implement graceful degradation where possible
try:
result = await risky_operation()
except SpecificError as e:
logger.warning(f"Operation failed: {e}, falling back to default")
result = default_fallback()
Configuration Management
- Use environment variables for configuration
- Provide sensible defaults in code
- Document all configuration options
NEKO_URL = os.environ.get("NEKO_URL", "http://localhost:8080")
MAX_RETRIES = int(os.environ.get("MAX_RETRIES", "3"))
Testing
Running Tests
# Run all tests
python -m pytest
# Run with coverage
python -m pytest --cov=src
# Run specific test file
python -m pytest tests/test_agent.py
Test Structure
- Unit tests for individual functions
- Integration tests for component interaction
- End-to-end tests for full workflows
Writing Tests
import pytest
from unittest.mock import AsyncMock, patch
@pytest.mark.asyncio
async def test_agent_action_execution():
"""Test that agent executes actions correctly."""
agent = NekoAgent()
with patch.object(agent, 'webrtc_client') as mock_client:
mock_client.execute_action = AsyncMock()
action = ClickAction(x=100, y=200)
await agent.execute_action(action)
mock_client.execute_action.assert_called_once_with(action)
Documentation
Building Documentation
# Enter docs environment
nix develop .#docs
# Serve documentation locally
cd docs && mdbook serve --open
# Build static documentation
cd docs && mdbook build
Documentation Standards
- Write in Markdown using mdBook extensions
- Include code examples for all APIs
- Use mermaid diagrams for architecture
- Keep docs up-to-date with code changes
Git Workflow
Branch Strategy
main
- Stable release branchdev
- Development integration branchfeature/*
- Feature development branchesfix/*
- Bug fix branches
Commit Guidelines
- Use conventional commits format
- Write descriptive messages explaining the why
- Keep commits atomic - one logical change per commit
Examples:
feat(agent): add support for swipe actions
fix(capture): handle missing frames gracefully
docs(api): update WebRTC connection examples
refactor(tts): improve voice loading performance
Pull Request Process
- Create feature branch from
dev
- Implement changes following coding standards
- Add/update tests for new functionality
- Update documentation if needed
- Submit PR with clear description
- Address review feedback
- Merge after approval
Debugging
Logging Configuration
# Set log levels
export AGENT_LOGLEVEL=DEBUG
export CAPTURE_LOGLEVEL=INFO
export YAP_LOGLEVEL=WARNING
Common Debug Tasks
WebRTC connection issues:
# Check Neko server status
curl http://localhost:8080/health
# Test WebSocket endpoint
wscat -c ws://localhost:8080/api/ws?token=<token>
AI model problems:
# Check GPU availability
python -c "import torch; print(torch.cuda.is_available())"
# Test model loading
python -c "from transformers import Qwen2VLForConditionalGeneration; print('OK')"
Training data capture:
# Verify MDS output
python -c "from streaming import StreamingDataset; print('MDS OK')"
# Check S3 connectivity
aws s3 ls s3://your-bucket/
Performance Optimization
Profiling
# Profile with cProfile
python -m cProfile -o profile.stats src/agent.py
# Analyze with snakeviz
snakeviz profile.stats
GPU Memory Management
# Monitor GPU usage
nvidia-smi -l 1
# Set memory allocation strategy
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
WebRTC Optimization
- Reduce frame rate for lower bandwidth
- Adjust video quality based on task complexity
- Use hardware encoding when available
Contributing
Getting Started
- Read the documentation to understand the system
- Set up development environment
- Pick an issue from the project board
- Ask questions if anything is unclear
Areas for Contribution
- New action types (drag, hover, etc.)
- Additional AI models integration
- Performance improvements
- Documentation enhancements
- Test coverage expansion
Code Review Guidelines
- Be constructive in feedback
- Explain the reasoning behind suggestions
- Test the changes locally when possible
- Approve promptly for good contributions
Release Process
Version Management
- Semantic versioning (MAJOR.MINOR.PATCH)
- Tag releases in Git
- Update changelog for each release
Deployment
- Build Docker images for production
- Test in staging environment
- Deploy with rolling updates
- Monitor metrics post-deployment
Support
Getting Help
- Check documentation first
- Search existing issues on GitHub
- Ask in discussions for general questions
- Open issues for bugs or feature requests
Community Guidelines
- Be respectful and inclusive
- Help others when you can
- Share knowledge and experiences
- Follow the code of conduct
Nix Flake Development Environment
The Neko Agent project uses a sophisticated Nix flake to provide reproducible, cross-platform development environments with specialized configurations for different use cases. This document provides comprehensive documentation of all flake features, development shells, and usage patterns.
Overview
The flake (flake.nix
) is designed around multiple specialized development environments that cater to different aspects of the project:
- AI/ML Development - GPU-accelerated environments with CUDA support
- Documentation - Publishing and development of project documentation
- Container Operations - Docker and Neko server management
- Performance Optimization - CPU-optimized builds with architecture-specific flags
- TEE Deployment - Trusted Execution Environment deployment with attestation
- Registry Management - Multi-registry container deployment support
- Cross-Platform Support - Works on x86_64-Linux and aarch64-Darwin (Apple Silicon)
graph TB subgraph "Nix Flake Architecture" Flake[flake.nix] Inputs[External Inputs] Overlays[Custom Overlays] Shells[Development Shells] Packages[Docker Images] Apps[Utility Apps] end subgraph "External Dependencies" Nixpkgs[nixpkgs/nixos-unstable] MLPkgs[nixvital/ml-pkgs] end subgraph "Custom Overlays" WebRTC[WebRTC Stack] ML[ML Libraries] Audio[Audio Processing] Optimization[CPU Optimization] end subgraph "Development Shells" Default[default] GPU[gpu] AI[ai] Neko[neko] Docs[docs] CPUOpt[cpu-opt] GPUOpt[gpu-opt] end Flake --> Inputs Flake --> Overlays Flake --> Shells Flake --> Packages Flake --> Apps Inputs --> Nixpkgs Inputs --> MLPkgs Overlays --> WebRTC Overlays --> ML Overlays --> Audio Overlays --> Optimization Shells --> Default Shells --> GPU Shells --> AI Shells --> Neko Shells --> Docs Shells --> CPUOpt Shells --> GPUOpt Shells --> TEE
Flake Inputs
External Dependencies
inputs = {
nixpkgs.url = "github:NixOS/nixpkgs/nixos-unstable";
ml-pkgs.url = "github:nixvital/ml-pkgs";
};
Input | Source | Purpose |
---|---|---|
nixpkgs | nixos-unstable | Latest packages and system libraries |
ml-pkgs | nixvital/ml-pkgs | Specialized ML/AI packages (PyTorch, CUDA) |
Why nixos-unstable?
- Latest packages - Access to newest versions of AI/ML libraries
- CUDA support - Most recent NVIDIA driver and toolkit support (CUDA 12.8)
- Python ecosystem - Up-to-date Python packages for transformers and WebRTC
- Security updates - Timely security patches for all dependencies
Build Metadata and Reproducibility
The flake includes comprehensive build metadata for reproducible builds and attestation:
buildInfo = rec {
timestamp = "${year}-${month}-${day}T${hour}:${minute}:${second}Z";
revision = self.rev or self.dirtyRev or "unknown";
shortRev = builtins.substring 0 8 revision;
version = if (self ? rev) then shortRev else "${shortRev}-dirty";
nixpkgsRev = nixpkgs.rev or "unknown";
imageMetadata = {
"org.opencontainers.image.title" = "Neko Agent";
"org.opencontainers.image.created" = timestamp;
"org.opencontainers.image.revision" = revision;
"dev.neko.build.reproducible" = "true";
};
};
Custom Overlay System
The flake uses a comprehensive overlay system to provide packages not available in standard Nixpkgs:
WebRTC and Media Stack
nekoOverlays = [
(import ./overlays/pylibsrtp.nix) # Secure RTP protocol
(import ./overlays/aioice.nix) # Async ICE implementation
(import ./overlays/aiortc.nix) # WebRTC for Python
# ... more overlays
];
Overlay | Package | Purpose |
---|---|---|
pylibsrtp.nix | pylibsrtp | Secure Real-time Transport Protocol for WebRTC |
aioice.nix | aioice | Asynchronous ICE (Interactive Connectivity Establishment) |
aiortc.nix | aiortc | WebRTC implementation for Python with media support |
AI/ML and Audio Processing
Overlay | Package | Purpose |
---|---|---|
streaming.nix | streaming | MosaicML Streaming for training data |
f5-tts.nix | f5-tts | F5-TTS voice synthesis model |
vocos.nix | vocos | Neural vocoder for audio generation |
ema-pytorch.nix | ema-pytorch | Exponential Moving Average for PyTorch |
transformers-stream-generator.nix | transformers-stream-generator | Streaming text generation |
bitsandbytes.nix | bitsandbytes | 8-bit optimizers for PyTorch |
Pi-Zero PyTorch Dependencies
The flake includes comprehensive packaging for pi-zero-pytorch and its dependencies:
Overlay | Package | Purpose |
---|---|---|
pi-zero-pytorch/pi-zero-pytorch.nix | pi-zero-pytorch | Main π0 implementation in PyTorch |
pi-zero-pytorch/einx.nix | einx | Universal tensor operations with Einstein notation |
pi-zero-pytorch/x-transformers.nix | x-transformers | Transformer architectures library |
pi-zero-pytorch/rotary-embedding-torch.nix | rotary-embedding-torch | Rotary positional embeddings |
pi-zero-pytorch/accelerated-scan.nix | accelerated-scan | Accelerated scan operations |
pi-zero-pytorch/bidirectional-cross-attention.nix | bidirectional-cross-attention | Cross-attention mechanisms |
pi-zero-pytorch/hl-gauss-pytorch.nix | hl-gauss-pytorch | Gaussian operations for ML |
pi-zero-pytorch/evolutionary-policy-optimization.nix | evolutionary-policy-optimization | Evolution strategies |
Performance Optimization
Overlay | Package | Purpose |
---|---|---|
cached-path.nix | cached-path | Efficient file caching utilities |
znver2-flags.nix | nekoZnver2Env | AMD Zen2 CPU optimization flags |
vmm-cli.nix | vmm-cli | Virtual machine management CLI |
Example Znver2 Optimization:
# Generated environment variables for AMD Zen2 CPUs
export NIX_CFLAGS_COMPILE="-O3 -pipe -march=znver2 -mtune=znver2 -fno-plt"
export RUSTFLAGS="-C target-cpu=znver2 -C target-feature=+sse2,+sse4.2,+avx,+avx2,+fma,+bmi1,+bmi2"
External ML Packages
ml-pkgs.overlays.torch-family # Provides torch-bin, torchvision-bin, etc.
Benefits:
- Pre-compiled binaries - Faster setup without compilation
- CUDA integration - Proper CUDA toolkit linkage
- Consistent versions - Matching PyTorch ecosystem versions
Development Shells
1. Default Shell (default
)
Purpose: Basic Python development with CPU-only PyTorch.
Usage:
nix develop
# or
nix develop .#default
Includes:
- Python Environment: PyTorch CPU, Transformers, WebRTC stack
- System Tools: FFmpeg, Git, Curl, Just, pkg-config
- Node.js Ecosystem: Node 20, NPM for AI tools
- AI CLI Tools: OpenAI Codex, Anthropic Claude Code (auto-installed)
Python Packages:
# Core ML/AI
transformers
torch (CPU)
torchvision
pillow
accelerate
# WebRTC and networking
websockets
av (PyAV for video processing)
pylibsrtp
aioice
aiortc
# Data and streaming
streaming (MosaicML)
f5-tts
numpy
scipy
zstandard
xxhash
tqdm
# Monitoring
prometheus-client
When to Use:
- Initial project setup and exploration
- Development on systems without NVIDIA GPUs
- Testing compatibility with CPU-only environments
- CI/CD pipelines where GPU access is unavailable
2. GPU Shell (gpu
)
Purpose: GPU-accelerated development with CUDA 12.8 support.
Usage:
nix develop .#gpu
NVIDIA hosts: When running outside NixOS you will typically need
nixGL
to expose the system GPU. Use:NIXPKGS_ALLOW_UNFREE=1 nix run --impure github:nix-community/nixGL#nixGLNvidia -- nix develop .#gpu
This wraps the GPU shell with the right OpenGL/EGL libraries from the host driver.
Additional Features over Default:
- CUDA Toolkit 12.8 - Complete CUDA development environment
- cuDNN and NCCL - Optimized neural network and communication libraries
- GPU-enabled PyTorch - Tensor operations on NVIDIA GPUs
- Environment Variables - Automatic CUDA path and library configuration
CUDA Environment Setup:
# Automatically configured
export CUDA_HOME=/nix/store/.../cuda-12.8
export CUDA_PATH=$CUDA_HOME
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
# GPU control
export NVIDIA_VISIBLE_DEVICES=all
export NVIDIA_DRIVER_CAPABILITIES=compute,utility
export CUDA_MODULE_LOADING=LAZY
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
Verification Commands:
# Check CUDA installation
nvidia-smi
nvcc --version
# Test PyTorch GPU support
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
python -c "import torch; print(f'CUDA devices: {torch.cuda.device_count()}')"
When to Use:
- AI model inference and training
- GPU-accelerated image/video processing
- Development requiring CUDA libraries
- Performance-critical workloads
3. AI Shell (ai
)
Purpose: Lightweight environment focused on AI development tools.
Usage:
nix develop .#ai
Includes:
- Core System Tools - FFmpeg, Git, networking utilities
- Node.js Environment - Node 20, NPM
- AI CLI Tools - Automatic installation of OpenAI and Anthropic CLIs
- Minimal Footprint - No heavy ML libraries, faster startup
AI Tools Installed:
# OpenAI Codex CLI
npm install -g @openai/codex
# Anthropic Claude Code CLI
npm install -g @anthropic-ai/claude-code
Environment Setup:
# NPM global packages in project directory
export NPM_CONFIG_PREFIX=$PWD/.npm-global
export PATH=$NPM_CONFIG_PREFIX/bin:$PATH
When to Use:
- AI-assisted development workflows
- Code generation and review tasks
- Integration with AI development services
- Quick environment for AI tool testing
4. Neko Shell (neko
)
Purpose: Container and Neko server management.
Usage:
nix develop .#neko
Container Stack:
- Colima - Lightweight Docker runtime for macOS/Linux
- Docker & Docker Compose - Container orchestration
- Docker Buildx - Multi-platform image building
- Networking Tools - curl, jq for API interaction
Custom Scripts:
# Neko service management script
neko-services up # Start Neko server
neko-services down # Stop services
neko-services logs # View container logs
neko-services status # Check service status
neko-services restart # Restart services
neko-services update # Pull latest images and restart
Colima Configuration:
# Automatically configured VM
colima start --vm-type vz --cpu 2 --memory 4 \
--mount-type sshfs --mount "~:w"
Docker Environment:
# Automatic Docker socket configuration
export DOCKER_HOST="unix://$HOME/.colima/default/docker.sock"
When to Use:
- Neko server development and testing
- Container image building and deployment
- Docker-based development workflows
- Local testing of production deployments
5. Documentation Shell (docs
)
Purpose: Documentation development, building, and publishing.
Usage:
nix develop .#docs
Documentation Stack:
- mdBook - Rust-based documentation generator
- mdBook Extensions:
mdbook-mermaid
- Diagram supportmdbook-linkcheck
- Link validationmdbook-toc
- Table of contents generation
- Sphinx - Python documentation with reStructuredText support
- Node.js - For additional tooling and preprocessing
Python Documentation Tools:
sphinx # Documentation generator
sphinx-rtd-theme # Read the Docs theme
myst-parser # Markdown support for Sphinx
sphinxcontrib-mermaid # Mermaid diagrams in Sphinx
Available Commands:
# From inside docs/
mdbook serve --open # Development server with live reload
mdbook build # Build static documentation
mdbook test # Test code examples and links
# Sphinx alternative
sphinx-build -b html source build/
When to Use:
- Writing and editing project documentation
- Building documentation for deployment
- Testing documentation changes locally
- Contributing to API reference and guides
6. CPU-Optimized Shell (cpu-opt
)
Purpose: Performance-optimized CPU development.
Usage:
nix develop .#cpu-opt
Optimization Features:
- Architecture-Specific Compilation - Znver2 flags for AMD CPUs
- Optimized Python Environment - Performance-tuned package builds
- Compiler Optimizations - -O3, -march=znver2, -mtune=znver2
Generated Optimization Flags (Linux only):
# Compiler flags
export NIX_CFLAGS_COMPILE="-O3 -pipe -march=znver2 -mtune=znver2 -fno-plt"
# Rust flags
export RUSTFLAGS="-C target-cpu=znver2 -C target-feature=+sse2,+sse4.2,+avx,+avx2,+fma,+bmi1,+bmi2 -C link-arg=-Wl,-O1 -C link-arg=--as-needed"
When to Use:
- Performance-critical CPU workloads
- Benchmarking and optimization work
- Production builds targeting specific CPU architectures
- Environments where every bit of CPU performance matters
7. GPU-Optimized Shell (gpu-opt
)
Purpose: Maximum performance GPU development with optimizations.
Usage:
nix develop .#gpu-opt
Combined Optimizations:
- All GPU features - CUDA 12.8, cuDNN, NCCL
- CPU optimizations - Znver2 flags for host code
- PyTorch optimizations - Optimized builds with CPU and GPU acceleration
- Memory optimizations - Advanced CUDA memory management
GPU-Specific Optimizations:
# Target specific GPU architecture (configurable)
export TORCH_CUDA_ARCH_LIST=8.6 # RTX 30xx series
# Memory allocation strategy
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
Performance Verification:
# Check optimizations are active
echo $NIX_CFLAGS_COMPILE # Should show znver2 flags
echo $TORCH_CUDA_ARCH_LIST # Should show target GPU architecture
# Benchmark performance
python -c "
import torch
import time
x = torch.randn(1000, 1000, device='cuda')
start = time.time()
torch.mm(x, x)
print(f'GPU matrix multiply: {time.time() - start:.4f}s')
"
When to Use:
- Maximum performance AI inference
- GPU-accelerated training workloads
- Performance benchmarking and optimization
- Production deployments requiring peak performance
8. TEE Shell (tee
)
Purpose: Trusted Execution Environment deployment and attestation.
Usage:
nix develop .#tee
TEE Deployment Stack:
- Phala Cloud CLI - Modern CLI for TEE deployments
- Legacy VMM CLI - Compatible with older dstack systems
- Docker & Docker Compose - Container orchestration
- Bun Runtime - Fast JavaScript runtime
- Reproducible Image Builder - Attestation-ready container building
Available Commands:
# Modern Phala CLI
phala auth login <api-key> # Authenticate with Phala Cloud
phala status # Check authentication status
phala cvms list # List Confidential VMs
phala nodes # List available TEE nodes
# Legacy VMM CLI (if needed)
vmm-cli lsvm # List virtual machines
vmm-cli lsimage # List available images
vmm-cli lsgpu # List available GPUs
# Reproducible builds
nix run .#build-images # Build reproducible images
nix run .#deploy-to-tee # Deploy with attestation metadata
nix run .#verify-attestation # Verify TEE attestation
Multi-Registry Support:
# Deploy to ttl.sh (ephemeral registry)
NEKO_REGISTRY=ttl.sh NEKO_TTL=1h nix run .#deploy-to-tee
nix run .#deploy-to-ttl 24h
# Deploy to GitHub Container Registry
NEKO_REGISTRY=ghcr.io/your-org nix run .#deploy-to-tee
# Deploy to Docker Hub
NEKO_REGISTRY=docker.io/your-org nix run .#deploy-to-tee
# Deploy to local registry
NEKO_REGISTRY=localhost:5000/neko nix run .#deploy-to-tee
When to Use:
- Deploying to Trusted Execution Environments
- Creating attestable, reproducible deployments
- Multi-registry container management
- TEE-based inference deployments
- Confidential computing workloads
Docker Images and Packages
The flake builds optimized Docker images for production deployment:
Available Images
The flake now builds multiple specialized images for different components:
# Build all images
nix run .#build-images
# Agent images
nix build .#neko-agent-docker-generic
nix build .#neko-agent-docker-opt
# Capture images
nix build .#neko-capture-docker-generic
nix build .#neko-capture-docker-opt
# YAP (TTS) images
nix build .#neko-yap-docker-generic
nix build .#neko-yap-docker-opt
# Train images
nix build .#neko-train-docker-generic
nix build .#neko-train-docker-opt
1. Generic CUDA Image (neko-agent-docker-generic
)
Target: neko-agent:cuda12.8-generic
Features:
- Portable CUDA - Includes PTX for forward compatibility
- CUDA 12.8 - Full toolkit and libraries
- Python Environment - All dependencies with torch-bin
- Broad GPU Support - Works on any CUDA 8.6+ GPU
Configuration:
# Environment variables
CUDA_HOME=/nix/store/.../cuda-12.8
LD_LIBRARY_PATH=$CUDA_HOME/lib64:$CUDA_HOME/lib
CUDA_MODULE_LOADING=LAZY
TORCH_CUDA_ARCH_LIST=8.6+PTX # Forward compatibility
Use Cases:
- Multi-GPU deployment environments
- Cloud platforms with varying GPU types
- Development and testing across different hardware
2. Optimized Image (neko-agent-docker-opt
)
Target: neko-agent:cuda12.8-sm86-v3
Features:
- Specific GPU targeting - Optimized for RTX 30xx series (sm_86)
- CPU optimizations - Znver2 architecture flags
- Smaller size - No PTX, specific architecture only
- Maximum performance - All available optimizations enabled
Configuration:
# Optimized environment
TORCH_CUDA_ARCH_LIST=8.6 # Specific architecture only
NIX_CFLAGS_COMPILE="-O3 -pipe -march=znver2 -mtune=znver2 -fno-plt"
RUSTFLAGS="-C target-cpu=znver2 ..." # Rust optimizations
Use Cases:
- Production deployments with known hardware
- Performance-critical applications
- Cost-optimized cloud instances
Image Building System
# Helper function for consistent container structure
mkRoot = paths: pkgs.buildEnv {
name = "image-root";
inherit paths;
pathsToLink = [ "/bin" ];
};
# Generic image build
neko-agent-docker-generic = pkgs.dockerTools.buildImage {
name = "neko-agent:cuda12.8-generic";
created = "now";
copyToRoot = mkRoot ([
runnerGeneric
pyEnvGeneric
cuda.cudatoolkit
cuda.cudnn
cuda.nccl
pkgs.bashInteractive
] ++ commonSystemPackages);
config = {
Env = baseEnv ++ [
"CUDA_HOME=${cuda.cudatoolkit}"
"LD_LIBRARY_PATH=${cuda.cudatoolkit}/lib64:${cuda.cudnn}/lib"
"TORCH_CUDA_ARCH_LIST=8.6+PTX"
];
WorkingDir = "/workspace";
Entrypoint = [ "/bin/neko-agent" ];
};
};
Utility Apps
The flake provides comprehensive utility applications for common tasks:
Documentation Apps
# Build documentation
nix run .#docs-build
# Serve documentation with live reload
nix run .#docs-serve
# Check documentation for issues
nix run .#docs-check
Build and Deployment Apps
# Build all Docker images with attestation metadata
nix run .#build-images
# TEE deployment with multi-registry support
nix run .#deploy-to-tee
nix run .#deploy-to-ttl 24h # Quick ttl.sh deployment
nix run .#push-to-ttl 1h # Just push to ttl.sh
# Attestation verification
nix run .#verify-attestation <app-id> <expected-hash>
Container Registry Apps
# Local registry management
nix run .#start-registry # HTTP registry with auth
nix run .#start-registry-https # HTTPS with Tailscale certs
nix run .#stop-registry
# Public exposure
nix run .#start-tailscale-funnel # Expose via Tailscale Funnel
nix run .#start-cloudflare-tunnel # Expose via Cloudflare Tunnel
Registry Configuration Examples:
# Environment variables for registry customization
NEKO_REGISTRY_PORT=5000
NEKO_REGISTRY_USER=neko
NEKO_REGISTRY_PASSWORD=pushme
NEKO_REGISTRY_DATA_DIR=$PWD/registry-data
NEKO_REGISTRY_AUTH_DIR=$PWD/auth
NEKO_REGISTRY_CERTS_DIR=$PWD/certs
# Tailscale Funnel setup
NEKO_REGISTRY=your-device.tail-scale.ts.net/neko
# Cloudflare Tunnel setup
NEKO_CF_TUNNEL_NAME=neko-registry
NEKO_CF_HOSTNAME=registry.example.com
Common Development Workflows
Initial Setup
# Clone repository
git clone <repo-url>
cd neko-agent
# Enter development environment
nix develop .#gpu # or .#default for CPU-only
# Verify setup
python -c "import torch; print(torch.cuda.is_available())"
AI Development Workflow
# 1. Enter GPU environment
nix develop .#gpu
# 2. Load environment variables (if .env exists)
# Automatically loaded by shell hook
# 3. Test model loading
python -c "
from transformers import Qwen2VLForConditionalGeneration
model = Qwen2VLForConditionalGeneration.from_pretrained('showlab/ShowUI-2B')
print('Model loaded successfully')
"
# 4. Run agent
uv run src/agent.py --task "Navigate to google.com"
Documentation Development
# 1. Enter docs environment
nix develop .#docs
# 2. Start development server
nix run .#docs-serve
# Opens browser to http://localhost:3000
# 3. Edit files in docs/src/
# Changes automatically reload in browser
# 4. Build for deployment
nix run .#docs-build
Container Development
# 1. Enter container environment
nix develop .#neko
# 2. Start Neko server
neko-services up
# 3. Check status
neko-services status
# 4. View logs
neko-services logs neko
# 5. Test connection
curl http://localhost:8080/health
Performance Optimization
# 1. Use optimized environment
nix develop .#gpu-opt
# 2. Verify optimizations
echo $NIX_CFLAGS_COMPILE
echo $TORCH_CUDA_ARCH_LIST
# 3. Run performance benchmarks
python benchmarks/inference_speed.py
# 4. Build optimized container
nix build .#neko-agent-docker-opt
TEE Deployment Workflow
# 1. Enter TEE environment
nix develop .#tee
# 2. Build reproducible images
nix run .#build-images
# 3. Deploy to TEE (with registry choice)
# Option A: Use ttl.sh for testing
nix run .#deploy-to-ttl 1h
# Option B: Use GitHub Container Registry
NEKO_REGISTRY=ghcr.io/your-org nix run .#deploy-to-tee
# Option C: Use local registry (start it first)
nix run .#start-registry # In another terminal
NEKO_REGISTRY=localhost:5000/neko nix run .#deploy-to-tee
# 4. Verify attestation (inside TEE)
nix run .#verify-attestation <app-id> <compose-hash>
# 5. Check deployment status
phala cvms list # Modern CLI
# or
vmm-cli lsvm # Legacy CLI
Multi-Registry Development
# Setup local registry for testing
nix run .#start-registry
# Push images to multiple registries
docker tag neko-agent:latest localhost:5000/neko/agent:v1
docker push localhost:5000/neko/agent:v1
# Use Tailscale for team access
nix run .#start-tailscale-funnel
# Use Cloudflare for public access
nix run .#start-cloudflare-tunnel
Environment Variables and Configuration
Automatic .env Loading
All development shells automatically load .env
files:
# .env file example
NEKO_WS=ws://localhost:8080/api/ws
NEKO_LOGLEVEL=DEBUG
CUDA_VISIBLE_DEVICES=0
TORCH_CUDA_ARCH_LIST=8.6
Common Environment Variables
Variable | Purpose | Default | Set By |
---|---|---|---|
CUDA_HOME | CUDA installation path | Auto-detected | GPU shells |
CUDA_VISIBLE_DEVICES | GPU selection | all | User configurable |
PYTORCH_CUDA_ALLOC_CONF | Memory strategy | expandable_segments:True | GPU shells |
NPM_CONFIG_PREFIX | NPM global location | $PWD/.npm-global | All shells |
NIX_CFLAGS_COMPILE | Compiler optimizations | Znver2 flags | Optimized shells |
Shell-Specific Variables
GPU Shells:
export CUDA_MODULE_LOADING=LAZY
export NVIDIA_DRIVER_CAPABILITIES=compute,utility
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$CUDA_HOME/lib
Documentation Shell:
# No specific variables, uses standard tool defaults
Container Shell:
export DOCKER_HOST="unix://$HOME/.colima/default/docker.sock"
Cross-Platform Support
Supported Systems
supportedSystems = [ "x86_64-linux" "aarch64-darwin" ];
Platform-Specific Features
x86_64-Linux:
- Full GPU support - NVIDIA CUDA, Docker GPU passthrough
- CPU optimizations - Znver2, Intel architecture targeting
- Container building - Docker images with CUDA support
aarch64-Darwin (Apple Silicon):
- Metal Performance Shaders - GPU acceleration via MPS
- Rosetta compatibility - x86_64 dependencies when needed
- Native performance - ARM64-optimized packages
Platform Detection
# Conditional features based on platform
${pkgs.lib.optionalString pkgs.stdenv.isLinux ''
source ${znver2File}
echo "[cpu-opt] Using znver2 flags: $NIX_CFLAGS_COMPILE"
''}
Troubleshooting
Common Issues
CUDA Not Detected:
# Check NVIDIA drivers
nvidia-smi
# Verify CUDA environment
echo $CUDA_HOME
echo $LD_LIBRARY_PATH
# Test PyTorch CUDA
python -c "import torch; print(torch.cuda.is_available())"
Solution: Ensure NVIDIA drivers are installed and compatible with CUDA 12.8.
Docker Issues on macOS:
# Check Colima status
colima status
# Restart if needed
colima stop
colima start --vm-type vz --cpu 2 --memory 4
Slow Package Installation:
# Use binary cache
echo "substituters = https://cache.nixos.org https://cuda-maintainers.cachix.org" >> ~/.config/nix/nix.conf
echo "trusted-public-keys = cache.nixos.org-1:6NCHdD59X431o0gWypbMrAURkbJ16ZPMQFGspcDShjY= cuda-maintainers.cachix.org-1:0dq3bujKpuEPiCgBv7/11NEBpCcEKUzZzUNjRgPTOOA=" >> ~/.config/nix/nix.conf
Memory Issues
GPU Memory:
# Monitor GPU memory
nvidia-smi -l 1
# Optimize PyTorch memory
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128,expandable_segments:True
System Memory:
# Check available memory
free -h
# Monitor during development
htop
Performance Issues
Check Optimizations:
# Verify CPU flags
cat /proc/cpuinfo | grep flags
# Check compiler optimizations
echo $NIX_CFLAGS_COMPILE
# Benchmark inference
python -c "
import torch
import time
device = 'cuda' if torch.cuda.is_available() else 'cpu'
x = torch.randn(1000, 1000, device=device)
start = time.time()
result = torch.mm(x, x)
print(f'{device} time: {time.time() - start:.4f}s')
"
Advanced Usage
Custom Overlays
Create project-specific overlays in overlays/
:
# overlays/custom-package.nix
final: prev: {
custom-package = prev.python3Packages.buildPythonPackage {
pname = "custom-package";
version = "1.0.0";
src = prev.fetchFromGitHub {
owner = "owner";
repo = "repo";
rev = "v1.0.0";
sha256 = "...";
};
propagatedBuildInputs = with prev.python3Packages; [
numpy
torch
];
};
}
Custom Development Shells
Add new shells to the flake:
# Add to devShells
experimental = pkgs.mkShell {
buildInputs = commonSystemPackages ++ [
# Custom packages
];
shellHook = ''
echo "Experimental environment loaded"
# Custom setup
'';
};
Environment Specialization
Create environment-specific configurations:
# .env.gpu
CUDA_VISIBLE_DEVICES=0
TORCH_CUDA_ARCH_LIST=8.6
# .env.multi-gpu
CUDA_VISIBLE_DEVICES=0,1,2,3
NCCL_DEBUG=INFO
# Load specific environment
cp .env.gpu .env
nix develop .#gpu
Contributing to the Flake
Adding New Packages
- Create overlay in
overlays/new-package.nix
- Add to overlay list in
nekoOverlays
- Include in appropriate shells
- Test across platforms
- Update documentation
Testing Changes
# Test specific shell
nix develop .#shell-name --command python -c "import new_package"
# Test all shells
for shell in default gpu ai neko docs cpu-opt gpu-opt; do
echo "Testing $shell..."
nix develop .#$shell --command echo "✓ $shell loads successfully"
done
# Test image builds
nix build .#neko-agent-docker-generic
nix build .#neko-agent-docker-opt
Performance Considerations
- Binary caches - Use Cachix for custom packages
- Layer optimization - Minimize Docker image layers
- Dependency management - Avoid unnecessary dependencies
- Build reproducibility - Pin package versions when needed
This comprehensive flake system provides a robust, reproducible development environment that scales from local development to production deployment while maintaining consistency across different platforms and use cases.
Nix Flake Python Packaging Guide
This guide explains the nuances and patterns used for packaging Python dependencies in the Neko Agent project's Nix flake. The project uses a sophisticated overlay system to package complex Python libraries that aren't available in standard Nixpkgs, with special attention to PyTorch ecosystem integration and CUDA support.
Overview
The Neko Agent project requires numerous specialized Python packages for AI/ML, WebRTC, and audio processing that either don't exist in Nixpkgs or need custom configurations. The flake uses a comprehensive overlay system to package these dependencies while maintaining compatibility with the PyTorch ecosystem.
Key Packaging Principles
- Torch-bin Integration - Use pre-compiled PyTorch wheels for consistency
- CUDA Compatibility - Ensure proper CUDA toolkit integration
- Dependency Override - Override transitive dependencies to use
torch-bin
- Test Skipping - Disable problematic tests that require GPU access
- Version Flexibility - Patch restrictive version constraints when needed
Core Patterns
1. Basic Python Package Overlay
self: super: {
python3Packages = super.python3Packages // {
example-package = super.python3Packages.buildPythonPackage rec {
pname = "example-package";
version = "1.0.0";
format = "pyproject";
src = super.fetchFromGitHub {
owner = "owner";
repo = "example-package";
rev = "v${version}";
sha256 = "...";
};
nativeBuildInputs = with super.python3Packages; [ setuptools wheel ];
propagatedBuildInputs = with super.python3Packages; [ numpy scipy ];
doCheck = false;
meta = with super.lib; {
description = "Package description";
homepage = "https://github.com/owner/example-package";
license = licenses.mit;
};
};
};
}
2. PyTorch Integration Pattern
Always use torch-bin
and override transitive dependencies:
self: super: {
python3Packages = super.python3Packages // {
ml-package = super.python3Packages.buildPythonPackage rec {
# ... basic fields ...
propagatedBuildInputs = with super.python3Packages; [
super.python3Packages."torch-bin"
super.python3Packages."torchvision-bin"
# Override packages that depend on torch
(accelerate.override { torch = super.python3Packages."torch-bin"; })
(torchdiffeq.override { torch = super.python3Packages."torch-bin"; })
transformers
numpy
];
doCheck = false; # Skip GPU-dependent tests
};
};
}
3. CUDA-Aware Package Pattern
For packages needing CUDA compilation:
self: super:
let
cudaPackages = super.cudaPackages_12_8;
cuda-redist = super.symlinkJoin {
name = "cuda-redist-${cudaPackages.cudaMajorMinorVersion}";
paths = with cudaPackages; [
(super.lib.getDev cuda_cccl)
(super.lib.getDev libcublas)
# ... more CUDA libs
];
};
in
{
python3Packages = super.python3Packages // {
cuda-package = super.python3Packages.buildPythonPackage rec {
# ... basic fields ...
postPatch = ''
substituteInPlace setup.py \
--replace-fail "find_cuda_libs()" "['${cuda-redist}/lib/libcudart.so']"
'';
nativeBuildInputs = [ super.cmake cudaPackages.cuda_nvcc ];
buildInputs = [ cuda-redist ];
CUDA_HOME = "${cuda-redist}";
NVCC_PREPEND_FLAGS = [ "-I${cuda-redist}/include" ];
doCheck = false;
};
};
}
4. Version Constraint Patching
Relax overly restrictive version constraints:
postPatch = ''
substituteInPlace pyproject.toml \
--replace "numpy<=1.24.0" "numpy" \
--replace "pydantic>=2.0,<2.5" "pydantic>=2.0"
'';
## Real-World Examples
### F5-TTS Package
Demonstrates torch overrides, version patching, and platform-specific dependencies:
```nix
self: super: {
python3Packages = super.python3Packages // {
f5-tts = super.python3Packages.buildPythonPackage rec {
pname = "f5-tts";
version = "1.1.7";
src = super.fetchFromGitHub {
owner = "SWivid";
repo = "F5-TTS";
rev = "main";
sha256 = "sha256-MtPyqS5aNrq929pHMlDp3HFUSf+i9xYDb5xMA0Eqh9Y=";
};
propagatedBuildInputs = with super.python3Packages; [
# Torch overrides
(accelerate.override { torch = super.python3Packages."torch-bin"; })
(x-transformers.override { torch = super.python3Packages."torch-bin"; })
# Custom overlays
cached-path ema-pytorch vocos
# Standard deps
transformers gradio librosa soundfile
] ++ super.lib.optionals (!super.stdenv.isAarch64) [
bitsandbytes # x86_64 only
];
# Version constraint fixes
postPatch = ''
substituteInPlace pyproject.toml \
--replace "numpy<=1.26.4" "numpy" \
--replace "pydantic<=2.10.6" "pydantic"
'';
doCheck = false;
};
};
}
Pi-Zero PyTorch
Shows complex dependency chains and overlay usage:
self: super: {
python3Packages = super.python3Packages // {
pi-zero-pytorch = super.python3Packages.buildPythonPackage rec {
pname = "pi-zero-pytorch";
version = "0.2.5";
src = super.fetchFromGitHub {
owner = "lucidrains";
repo = "pi-zero-pytorch";
rev = "1c13cbbcc236c3cb4fd18213c543188fdf083b33";
sha256 = "1xl38c5spv798xydljahq8fw9nglwcw95s95zpn11cm9ra4rg4ib";
};
nativeBuildInputs = [ super.python3Packages.hatchling ];
propagatedBuildInputs = with super.python3Packages; [
# Torch overrides
(accelerate.override { torch = super.python3Packages."torch-bin"; })
# Custom overlays (all from overlays/)
einx x-transformers rotary-embedding-torch
accelerated-scan bidirectional-cross-attention
# Use patched einops from self
(self.python3Packages.einops)
beartype jaxtyping scipy tqdm
super.python3Packages."torch-bin"
];
doCheck = false;
};
};
}
Advanced Techniques
Using self
vs super
super
: Access packages before overlay modificationsself
: Access final overlaid packages (avoid infinite recursion)
propagatedBuildInputs = with super.python3Packages; [
numpy # Use super for standard packages
(self.python3Packages.custom-overlay-package) # Use self for overlay deps
(accelerate.override { torch = super.python3Packages."torch-bin"; })
];
Conditional Dependencies
propagatedBuildInputs = with super.python3Packages; [
numpy scipy
] ++ super.lib.optionals (!super.stdenv.isAarch64) [
bitsandbytes # x86_64 only
] ++ super.lib.optionals super.stdenv.isLinux [
nvidia-ml-py # Linux only
];
Source Fetching
# Preferred: Specific tag/commit
src = super.fetchFromGitHub {
owner = "owner"; repo = "repo";
tag = "v${version}"; # or rev = "commit-hash";
hash = "sha256-..."; # Use nix-prefetch-github
};
# PyPI when available
src = super.fetchPypi {
inherit pname version;
hash = "sha256-...";
};
Build Systems
# Modern pyproject.toml
format = "pyproject";
nativeBuildInputs = [ super.python3Packages.hatchling ];
# Legacy setup.py
format = "setuptools";
nativeBuildInputs = [ super.python3Packages.setuptools ];
Best Practices
Hash Management
# Get hashes for sources
nix-prefetch-github owner repo --rev v1.0.0
nix hash to-sri --type sha256 abc123... # Convert to SRI format
Testing Overlays
# Test single package
nix-shell -p '(python3.withPackages (ps: [ ps.my-package ]))'
nix-shell --run 'python -c "import my_package"'
Dependency Management
- Keep dependencies minimal and explicit
- Always use
torch-bin
for PyTorch ecosystem - Override transitive torch dependencies
Quality Metadata
meta = with super.lib; {
description = "Clear, concise description";
homepage = "https://github.com/owner/repo";
license = licenses.mit;
platforms = platforms.unix;
};
Troubleshooting
Common Issues
Import Errors: Use super.python3Packages."torch-bin"
not "torch"
Version Conflicts: Patch constraints with postPatch
postPatch = ''
substituteInPlace setup.py --replace "torch==2.0.0" "torch>=2.0.0"
'';
CUDA Issues: Set proper CUDA environment
buildInputs = [ cuda-redist ];
CUDA_HOME = "${cuda-redist}";
Test Failures: Disable with doCheck = false;
Integration with Flake
Adding New Overlays
- Create overlay file in
overlays/
directory - Add to
nekoOverlays
list inflake.nix
- Include in Python environments
# In flake.nix
nekoOverlays = [
(import ./overlays/new-package.nix)
];
# Usage in environments
pyEnvExample = pkgs.python3.withPackages (ps: [
ps.new-package
]);
Overlay Ordering
Dependencies must come first:
nekoOverlays = [
# Base packages first
(import ./overlays/cached-path.nix)
(import ./overlays/ema-pytorch.nix)
# Dependent packages after
(import ./overlays/f5-tts.nix) # Uses ema-pytorch
# Complex packages last
(import ./overlays/pi-zero-pytorch/pi-zero-pytorch.nix)
# Patches to existing packages last
(import ./overlays/einops.nix) # Overrides existing einops
];
Handling GPU-Heavy Test Suites
Some upstream projects (for example einops
) ship notebook-based test suites that start long-running CUDA benchmarks. In CI or local builds these checks routinely exceed the sandbox timeout. Our overlays/einops.nix
patch disables the notebook runner and its specific test_notebook_3
case, while still allowing the rest of the package to build reproducibly. Always prefer pinpointing the slow tests (via disabledTestPaths
/disabledTests
) instead of globally disabling checks so that lightweight import tests keep running.
This guide covers the key patterns for packaging Python dependencies in the Neko Agent project with proper PyTorch ecosystem integration and CUDA support.
Neko
Repository Cheat Sheet (Current Codebase)
- Project Structure:
server/
: Go backend (cmd/
,internal/
,pkg/
, optionalplugins/
). Build viaserver/build
→server/bin/neko
.client/
: Vue 2 + TypeScript SPA (src/
,public/
), built toclient/dist
.apps/
: Docker app images (e.g.,firefox/
,chromium/
,kde/
,vlc/
).runtime/
: Base image and Xorg/driver configs; basis for final images.utils/
: Tooling; Dockerfile generator atutils/docker/main.go
.webpage/
: Docusaurus docs site.- Root:
build
(image builder),docker-compose.yaml
(local run),config.yml
(sample config).
- Dev Commands:
- Client dev:
cd client && npm ci && npm run serve
(hot-reload dev server). - Client build:
cd client && npm run build
→client/dist
. - Client lint:
cd client && npm run lint
. - Server build:
cd server && ./build
(use./build core
to skip plugins) →server/bin/neko
. - Docker images: base
./build
, app./build -a firefox
, flavor./build -f nvidia -a chromium
. - Local run:
docker compose up -d
(servesghcr.io/m1k1o/neko/firefox:latest
on:8080
).
- Client dev:
- Coding Style:
- Indent 2 spaces (LF). Client: Prettier single quotes, no semicolons, trailing commas. Go: gofmt/go vet; package names short/lowercase; files snake_case.
- Architecture & Entry Points:
- SPA served as static assets from container; talks HTTP/WS on
:8080
. server/cmd/neko/main.go
→cmd.Execute()
;server/cmd/root.go
config/logging init;server/cmd/serve.go
wires managers in order: session → member → desktop → capture → webrtc → websocket → api → http.- HTTP:
server/internal/http/manager.go
registers/api
,/api/ws
,/health
,/metrics
, static files, optional pprof. - REST router:
server/internal/api/router.go
. WebSocket:server/internal/websocket/manager.go
. WebRTC:server/internal/webrtc/manager.go
. Session/auth:server/internal/session
,server/internal/member/*
. Capture/Xorg:server/internal/capture
,server/internal/desktop
.
- SPA served as static assets from container; talks HTTP/WS on
- API Surface (summary; see
server/openapi.yaml
):- Auth:
POST /api/login
,POST /api/logout
,GET /api/whoami
,POST /api/profile
. - Sessions:
GET /api/sessions
,GET/DELETE /api/sessions/{id}
,POST /api/sessions/{id}/disconnect
. - Room:
GET/POST /api/room/settings
, broadcastGET /api/room/broadcast
,POST /api/room/broadcast/start|stop
. - Clipboard:
GET/POST /api/room/clipboard
, imageGET /api/room/clipboard/image.png
. - Keyboard:
GET/POST /api/room/keyboard/map
,GET/POST /api/room/keyboard/modifiers
. - Control:
GET /api/room/control
,POST /api/room/control/{request|release|take|give/{sessionId}|reset}
. - Screen:
GET/POST /api/room/screen
,GET /api/room/screen/configurations
,GET /api/room/screen/{cast.jpg|shot.jpg}
. - Upload:
POST /api/room/upload/{drop|dialog}
,DELETE /api/room/upload/dialog
. - Utility:
POST /api/batch
,GET /health
,GET /metrics
.
- Auth:
- WebSocket:
- Connect
ws(s)://<host>/api/ws
with cookie,Authorization: Bearer <token>
, or?token=<token>
. - Envelope
{ "event": "<string>", "payload": <JSON> }
with events defined underserver/pkg/types/event/events.go
(e.g.,system/init
,signal/request
,control/move
,clipboard/set
,keyboard/map
,broadcast/status
,file_chooser_dialog/opened|closed
).
- Connect
- Configuration Model:
- Defaults in
config.yml
; override viaNEKO_*
envs; legacy v2 envs supported behindNEKO_LEGACY=1
and have v3 equivalents (seeserver/internal/config/*
).
- Defaults in
Complete REST API (v3)
Authentication
- Methods: Cookie (
NEKO_SESSION
), Bearer token (Authorization: Bearer <token>
), or query token (?token=<token>
). - Default security applies to all endpoints unless explicitly noted;
/health
,/metrics
, and/api/login
do not require auth.
General
- GET
/health
: Liveness probe. 200 text/plain. - GET
/metrics
: Prometheus metrics. 200 text/plain. - POST
/api/batch
: Execute multiple API requests in one call.- Request:
[{ path: string, method: 'GET'|'POST'|'DELETE', body?: any }]
. - Response:
[{ path, method, status: number, body?: any }]
.
- Request:
- GET
/api/stats
: Server/session statistics.- Response:
{ has_host, host_id, server_started_at, total_users, last_user_left_at, total_admins, last_admin_left_at }
.
- Response:
Current Session
- POST
/api/login
: Authenticate and start a session. No auth required.- Request:
{ username, password }
. - Response:
SessionLoginResponse
=SessionData
plus optional{ token }
if cookies are disabled.
- Request:
- POST
/api/logout
: Terminate current session. 200. - GET
/api/whoami
: Retrieve current session info.- Response:
SessionData
.
- Response:
- POST
/api/profile
: Update current session’s runtime profile (no member sync).- Request:
MemberProfile
. 204.
- Request:
Sessions
- GET
/api/sessions
: List active sessions.- Response:
SessionData[]
.
- Response:
- GET
/api/sessions/{sessionId}
: Get a session by id.- Response:
SessionData
. 404 if not found.
- Response:
- DELETE
/api/sessions/{sessionId}
: Remove a session. 204. - POST
/api/sessions/{sessionId}/disconnect
: Force disconnect a session. 204.
Room Settings
- GET
/api/room/settings
: Get room settings.- Response:
Settings
.
- Response:
- POST
/api/room/settings
: Update room settings. 204.- Request:
Settings
.
- Request:
Room Broadcast
- GET
/api/room/broadcast
: Get broadcast status.- Response:
{ url?: string, is_active: boolean }
.
- Response:
- POST
/api/room/broadcast/start
: Start RTMP broadcast.- Request:
{ url: string }
. 204. Errors: 400 missing URL; 422 already broadcasting; 500 start failure.
- Request:
- POST
/api/room/broadcast/stop
: Stop broadcast. 204. Error: 422 not broadcasting.
Room Clipboard
- GET
/api/room/clipboard
: Get clipboard text/HTML.- Response:
{ text?: string, html?: string }
.
- Response:
- POST
/api/room/clipboard
: Set clipboard text/HTML. 204.- Request:
{ text?: string, html?: string }
.
- Request:
- GET
/api/room/clipboard/image.png
: Get clipboard image.- Response:
image/png
binary.
- Response:
Room Keyboard
- GET
/api/room/keyboard/map
: Get keyboard map.- Response:
{ layout?: string, variant?: string }
.
- Response:
- POST
/api/room/keyboard/map
: Set keyboard map. 204.- Request:
{ layout?: string, variant?: string }
.
- Request:
- GET
/api/room/keyboard/modifiers
: Get keyboard modifiers.- Response:
{ shift?, capslock?, control?, alt?, numlock?, meta?, super?, altgr? }: boolean
.
- Response:
- POST
/api/room/keyboard/modifiers
: Set modifiers. 204.- Request: same shape as response.
Room Control
- GET
/api/room/control
: Get control status.- Response:
{ has_host: boolean, host_id?: string }
.
- Response:
- POST
/api/room/control/request
: Request host control. 204. Error: 422 already has host. - POST
/api/room/control/release
: Release control. 204. - POST
/api/room/control/take
: Take control (admin/host override). 204. - POST
/api/room/control/give/{sessionId}
: Give control to session. 204.- Path:
sessionId
string. Errors: 400 target can’t host; 404 unknown session.
- Path:
- POST
/api/room/control/reset
: Reset control state. 204.
Room Screen
- GET
/api/room/screen
: Get current screen config.- Response:
ScreenConfiguration
.
- Response:
- POST
/api/room/screen
: Change screen config.- Request/Response:
ScreenConfiguration
. Errors: 422 invalid config.
- Request/Response:
- GET
/api/room/screen/configurations
: List available configurations.- Response:
ScreenConfiguration[]
.
- Response:
- GET
/api/room/screen/cast.jpg
: Current screencast JPEG.- Response:
image/jpeg
. Errors: 400 screencast disabled; 500 fetch error.
- Response:
- GET
/api/room/screen/shot.jpg
: On-demand screenshot JPEG.- Query:
quality
integer (0–100). Response:image/jpeg
. Errors: 500 create image.
- Query:
Room Upload
- POST
/api/room/upload/drop
: Upload files and drop at coordinates. 204.- multipart/form-data:
x: number
,y: number
,files: file[]
. - Errors: 400 upload error; 500 processing error.
- multipart/form-data:
- POST
/api/room/upload/dialog
: Upload files to active dialog. 204.- multipart/form-data:
files: file[]
. Errors: 422 no dialog.
- multipart/form-data:
- DELETE
/api/room/upload/dialog
: Close file chooser dialog. 204. Error: 422 no dialog.
Members
- GET
/api/members
: List members.- Query:
limit?: number
,offset?: number
. - Response:
MemberData[]
.
- Query:
- POST
/api/members
: Create member.- Request:
MemberCreate
. - Response:
MemberData
. Error: 422 ID exists.
- Request:
- GET
/api/members/{memberId}
: Get member profile.- Response:
MemberProfile
. 404 if not found.
- Response:
- POST
/api/members/{memberId}
: Update member profile. 204.- Request:
MemberProfile
.
- Request:
- DELETE
/api/members/{memberId}
: Remove member. 204. - POST
/api/members/{memberId}/password
: Update member password. 204.- Request:
{ password: string }
.
- Request:
- POST
/api/members_bulk/update
: Bulk update member profiles. 204.- Request:
{ ids: string[], profile: MemberProfile }
.
- Request:
- POST
/api/members_bulk/delete
: Bulk delete members. 204.- Request:
{ ids: string[] }
.
- Request:
Schemas (Shapes)
SessionData
:{ id: string, profile: MemberProfile, state: { is_connected: boolean, is_watching: boolean } }
.MemberProfile
:{ name?: string, is_admin?: boolean, can_login?: boolean, can_connect?: boolean, can_watch?: boolean, can_host?: boolean, can_share_media?: boolean, can_access_clipboard?: boolean, sends_inactive_cursor?: boolean, can_see_inactive_cursors?: boolean, plugins?: object }
.MemberData
:{ id: string, profile: MemberProfile }
.MemberCreate
:{ username: string, password: string, profile?: MemberProfile }
.Settings
:{ private_mode?, locked_controls?, implicit_hosting?, inactive_cursors?, merciful_reconnect?, plugins?: object }
.ScreenConfiguration
:{ width: number, height: number, rate: number }
.KeyboardMap
:{ layout?: string, variant?: string }
;KeyboardModifiers
: all booleans listed above.BroadcastStatus
:{ url?: string, is_active: boolean }
.ClipboardText
:{ text?: string, html?: string }
.ErrorMessage
:{ message: string }
(for 4xx/5xx bodies).
Notes
- Unless stated, successful POSTs return 204 No Content; GETs return JSON bodies per schema.
- Image endpoints return binary JPEG/PNG with appropriate content types.
- Security schemes in effect globally: Bearer, Cookie, or token query;
/api/login
has security disabled.
Complete WebSocket API
- Envelope:
{ "event": string, "payload": any }
. Connect tows(s)://<host>/api/ws
with cookie, Bearer, or?token=
. - Heartbeats: server sends
system/heartbeat
~10s and low-level pings; clients may sendclient/heartbeat
(no payloads).
System
- Server → Client
system/init
:{ session_id: string, control_host: { id?: string, has_host: boolean, host_id?: string }, screen_size: { width, height, rate }, sessions: { [id]: { id, profile: MemberProfile, state: SessionState } }, settings: Settings, touch_events: boolean, screencast_enabled: boolean, webrtc: { videos: string[] } }
.
- Server → Client (admins)
system/admin
:{ screen_sizes_list: ScreenSize[], broadcast_status: { is_active: boolean, url?: string } }
.
- Server → Client
system/settings
:{ id: string, ...Settings }
(actorid
, new settings). - Client → Server
system/logs
:[{ level: 'debug'|'info'|'warn'|'error', fields?: object, message: string }]
. - Server → Client
system/disconnect
:{ message: string }
.
Signaling (WebRTC)
- Client → Server
signal/request
:{ video: { disabled?: boolean, selector?: { id: string, type: 'exact'|'best' }, auto?: boolean }, audio: { disabled?: boolean }, auto?: boolean }
. - Server → Client
signal/provide
:{ sdp: string, iceservers: [{ urls: string[], username?: string, credential?: string }], video: { disabled: boolean, id: string, video?: string, auto: boolean }, audio: { disabled: boolean } }
. - Client ↔ Server
signal/offer
|signal/answer
:{ sdp: string }
(direction depends on negotiation; Neko often offers first viasignal/provide
). - Client → Server
signal/candidate
:{ candidate: string, sdpMid?: string, sdpMLineIndex?: number }
. - Client → Server
signal/video
:{ disabled?: boolean, selector?: { id: string, type: 'exact'|'best' }, auto?: boolean }
. - Client → Server
signal/audio
:{ disabled?: boolean }
. - Server → Client
signal/restart
:{ sdp: string }
(ICE restart offer). - Server → Client
signal/close
: null payload (peer disconnected).
Sessions
- Server → Client
session/created
:{ id, profile: MemberProfile, state: SessionState }
. - Server → Client
session/deleted
:{ id }
. - Server → Client
session/profile
:{ id, ...MemberProfile }
. - Server → Client
session/state
:{ id, is_connected, connected_since?, not_connected_since?, is_watching, watching_since?, not_watching_since? }
. - Server → Client
session/cursors
:[{ id: string, cursors: { x: number, y: number }[] }]
(only whensettings.inactive_cursors
enabled).
Control & Input
- Server ↔ Client
control/request
:- Client → Server: no payload (request host control).
- Server → Client (to current host):
{ id: string }
(requester id).
- Server → Client
control/host
:{ id: string, has_host: boolean, host_id?: string }
. - Client → Server
control/release
: no payload (only host; requirescan_host
). - Client → Server
control/move
:{ x, y }
. - Client → Server
control/scroll
:{ delta_x?: number, delta_y?: number, control_key?: boolean }
(legacy{ x, y }
supported). - Client → Server
control/buttonpress|buttondown|buttonup
:{ code: uint32, x?: number, y?: number }
. - Client → Server
control/keypress|keydown|keyup
:{ keysym: uint32, x?: number, y?: number }
. - Client → Server
control/touchbegin|touchupdate|touchend
:{ touch_id: uint32, x: number, y: number, pressure: uint8 }
. - Client → Server
control/cut|control/copy|control/select_all
: no payload. - Client → Server
control/paste
:{ text?: string }
(optionally seeds clipboard first). - Notes: Control requires
profile.can_host
and either being host or implicit hosting; clipboard ops also requireprofile.can_access_clipboard
.
Screen
- Client → Server
screen/set
:{ width, height, rate }
(admin only). - Server → Client
screen/updated
:{ id: string, width, height, rate }
.
Clipboard
- Client → Server
clipboard/set
:{ text: string }
(host with clipboard permission). - Server → Client
clipboard/updated
:{ text: string }
(sent to host on OS clipboard change).
Keyboard
- Client → Server
keyboard/map
:{ layout?: string, variant?: string }
(host only). - Client → Server
keyboard/modifiers
:{ shift?, capslock?, control?, alt?, numlock?, meta?, super?, altgr? }
(host only; omitted fields unchanged).
Broadcast
- Server → Client (admins)
broadcast/status
:{ is_active: boolean, url?: string }
.
Opaque Send Channel
- Client → Server
send/unicast
:{ receiver: string, subject: string, body: any }
→ forwarded to receiver as{ sender, receiver, subject, body }
. - Client → Server
send/broadcast
:{ subject: string, body: any }
→ broadcast to others as{ sender, subject, body }
.
File Chooser Dialog
- Server → Client
file_chooser_dialog/opened
:{ id: string }
(host holding dialog). - Server → Client
file_chooser_dialog/closed
:{}
.
Shared Types (payload shapes)
MemberProfile
:{ name?: string, is_admin?, can_login?, can_connect?, can_watch?, can_host?, can_share_media?, can_access_clipboard?, sends_inactive_cursor?, can_see_inactive_cursors?, plugins?: object }
.SessionState
:{ is_connected: boolean, connected_since?: string, not_connected_since?: string, is_watching: boolean, watching_since?: string, not_watching_since?: string }
(ISO timestamps).Settings
(WS):{ private_mode, locked_logins, locked_controls, control_protection, implicit_hosting, inactive_cursors, merciful_reconnect, heartbeat_interval, plugins?: object }
.ScreenSize
:{ width: number, height: number, rate: number }
.KeyboardMap
:{ layout: string, variant: string }
;KeyboardModifiers
: booleans listed above.Cursor
:{ x: number, y: number }
.
Heartbeat & Reconnect Behavior
- Layers of liveness
- WebSocket ping/pong: Server sends a WS Ping every ~10s and also emits
system/heartbeat
in-band. Browsers auto‑reply with Pong; no client code is needed for WS Pong. - Client heartbeat event: Clients may send
client/heartbeat
at a cadence derived fromsettings.heartbeat_interval
(default 120s). The server accepts it but does not strictly require it for liveness; it’s useful for analytics and legacy bridges.
- WebSocket ping/pong: Server sends a WS Ping every ~10s and also emits
- Settings and defaults
session.heartbeat_interval
(default 120s) is included insystem/init.settings.heartbeat_interval
for the client to display/use.session.merciful_reconnect
(default true): If a session is “already connected” and a new WS arrives, the server replaces the old connection instead of rejecting it. With this off, a second connection is rejected with reason “already connected”.
- Timeouts and proxies
- Ensure reverse proxies have read timeouts comfortably above both the WS Ping cadence (~10s) and the client heartbeat interval (≥120s). Recommended
proxy_read_timeout ≥ 300s
and to forward Upgrade/Connection headers for WebSocket. - Avoid aggressive idle timeouts on L4/L7 load balancers that terminate idle TCP flows under a few minutes; set ≥5 minutes where possible.
- Ensure reverse proxies have read timeouts comfortably above both the WS Ping cadence (~10s) and the client heartbeat interval (≥120s). Recommended
- Disconnect semantics
- If WS Ping fails or the socket errors, the server closes the connection and sends
system/disconnect {message}
when possible. The session is marked disconnected; clients should auto‑reconnect and renegotiate WebRTC. - On WebRTC transport loss, the server emits
signal/close
; clients should re‑issuesignal/request
or handlesignal/restart
.
- If WS Ping fails or the socket errors, the server closes the connection and sends
- Troubleshooting frequent drops
- Blackouts every N seconds: increase reverse proxy
proxy_read_timeout
/keep‑alive timeouts; confirm no CDN/WebApp firewall is in the WS path. - “already connected” on reconnect: enable/keep
session.merciful_reconnect=true
(default), or ensure clients close prior tabs before reconnecting. - Mobile/background tabs: some OSes suspend WS/Pong; expect disconnects when backgrounded. The client must reconnect on resume.
- Blackouts every N seconds: increase reverse proxy
Legacy WebSocket Events (v2 Compatibility Proxy)
-
Enabled when legacy mode is active (see
server/internal/http/manager.go
uses viperlegacy
). The legacy bridge maps v3 events to v2 event names for old clients and also emits some additional compatibility messages. -
Envelope: legacy messages also use
{event, ...payload}
. -
System:
system/init
:{ locks: {login?, control?, file_transfer?}: string map, implicit_hosting: boolean, file_transfer: boolean, heartbeat_interval: number }
.system/disconnect
:{ title?: string, message: string }
.system/error
: historical; emitted by legacy paths on errors.
-
Members:
member/list
:{ members: [{ id, name, admin: boolean, muted: boolean }] }
.member/connected
:{ id, name, admin, muted }
.member/disconnected
:{ id }
.
-
Signaling:
signal/provide
:{ id: string, sdp: string, lite: boolean, ice: [{ urls, username?, credential? }] }
.signal/offer
:{ sdp: string }
.signal/answer
:{ displayname: string, sdp: string }
.signal/candidate
:{ data: string }
(raw ICE JSON as string).
-
Control/Admin (compatibility notifications):
control/locked
:{ id: string }
(lock established).control/release
:{ id: string }
.control/request
: client-originated;control/requesting
: notification to host;control/give
:{ id, target }
.control/clipboard
:{ text: string }
.control/keyboard
:{ layout?: string, capsLock?: boolean, numLock?: boolean, scrollLock?: boolean }
.admin/lock|unlock
:{ resource: 'control'|'login'|'file_transfer', id: string }
.admin/control|release|give
:{ id: string }
or{ id, target }
.admin/ban|kick|mute|unmute
:{ id: string }
or{ target: string, id: string }
.
-
Chat (legacy plugin messages):
chat/message
: send/receive{ id?: string, content: string }
.chat/emote
: send/receive{ id?: string, emote: string }
.
-
Filetransfer (legacy plugin messages):
filetransfer/list
:{ cwd: string, files: [{ name, size, is_dir, mtime, perms, ... }] }
.filetransfer/refresh
: same shape as list (triggered refresh).
-
Screen/Broadcast:
screen/configurations
:{ configurations: { [idx]: { width, height, rates: { [idx]: rate } } } }
.screen/resolution
:{ id?: string, width, height, rate }
.screen/set
:{ width, height, rate }
.broadcast/status
:{ url: string, isActive: boolean }
.broadcast/create
:{ url: string }
;broadcast/destroy
: no payload.
1. What Neko Is (Concept & Origin)
Neko (often styled n.eko) is an open‑source, self‑hosted virtual browser / remote desktop environment: you run a containerized Linux desktop with a preinstalled browser (Firefox, Chromium, etc.) on your own infrastructure; Neko streams the interactive desktop (video, audio, input) to remote clients via WebRTC, so multiple participants can watch and even take control in real time. GitHub neko.m1k1o.net heise online
The project was started by its author after the shutdown of Rabb.it; needing a reliable way to watch anime remotely with friends over limited bandwidth + unstable Discord streaming, he built a WebRTC‑based Dockerized environment so everyone could share a single browser session. This collaborative genesis still shapes Neko’s multi‑user design (shared control queue, watch‑party friendliness). GitHub heise online
Neko targets privacy, isolation, and portability: browsing happens in the container, not on the viewer’s device; host fingerprints/cookies stay server‑side; nothing persistent need touch the client unless you configure it. This “shielded browser” model is highlighted in both the docs and independent coverage (Heise), which also frames Neko as a lightweight VPN alternative for accessing internal resources without distributing full desktop access. neko.m1k1o.net heise online fossengineer.com
2. Primary Use Cases
- Collaborative browsing & watch parties: All participants see the same live browser; host control can be passed; synchronized media playback works well because WebRTC streams the rendered video/audio from the container. GitHub neko.m1k1o.net fossengineer.com
- Interactive presentations, workshops, remote support: Presenter drives a shared browser/desktop; participants can be granted temporary control for demos or troubleshooting. Heise specifically calls out company trainings and support scenarios. GitHub heise online fossengineer.com
- Privacy / throwaway browsing / firewall bypass: Because traffic originates from the Neko host, users can browse sites blocked locally (subject to policy/ethics); community reports note using Neko to get around locked‑down work networks. heise online fossengineer.com Reddit
- Web dev & cross‑browser testing in controlled envs: Spin up specific browser versions (incl. Waterfox, Tor, Chromium variants) to test sites without polluting local machines. neko.m1k1o.net neko.m1k1o.net heise online
- Remote application streaming beyond browsers: Official images include full desktop environments (KDE, Xfce), Remmina (RDP/VNC client), VLC, and more; you can install arbitrary Linux GUI apps, turning Neko into a general remote app delivery layer. GitHub neko.m1k1o.net heise online
- Embedding into other web properties / programmatic rooms: Docs and community guides show URL query param auth for frictionless embedding; REST API + Neko Rooms enable dynamic, ephemeral shareable sessions. neko.m1k1o.net GitHub fossengineer.com
3. High‑Level Architecture
At a high level, a Neko deployment comprises:
- Server container(s): Run the Linux desktop + target browser/application; capture Xorg display frames + PulseAudio; encode via GStreamer; feed into WebRTC pipeline (Pion stack). GitHub neko.m1k1o.net GitHub
- Signaling / control plane: HTTP + WebSocket endpoints manage sessions, auth, and host‑control; periodic ping/heartbeat maintain liveness (esp. behind proxies). neko.m1k1o.net neko.m1k1o.net GitHub
- WebRTC media plane: ICE negotiation (STUN/TURN) to establish peer link(s); selectable port strategy (ephemeral range vs. UDP/TCP mux single port); optional Coturn relay for NAT‑restricted environments. neko.m1k1o.net neko.m1k1o.net GitHub
- Client UI (served over HTTPS): Browser front‑end page that renders the stream in a canvas/video element, sends input events (mouse/keyboard), displays participant cursors, chat/plugins, and exposes host‑control queue. GitHub neko.m1k1o.net neko.m1k1o.net
- Optional ecosystem services: REST API, Prometheus metrics exporter, plugin hooks (chat, file upload), and higher‑level orchestration projects (Neko Rooms / Apps / VPN). GitHub neko.m1k1o.net neko.m1k1o.net
4. Feature Inventory (v3 era)
- Multi‑user concurrent session w/ host handoff + inactive cursors: Participants can join; privileges (watch / host / share media / clipboard) governed per‑member profile. neko.m1k1o.net GitHub neko.m1k1o.net
- Audio + video streaming w/ low latency: WebRTC transport from container to clients; GStreamer capture; stream selector to adjust quality. neko.m1k1o.net GitHub neko.m1k1o.net
- GPU acceleration modes (Intel/Nvidia flavors) & CPU builds: Select appropriate image flavor to offload encoding & improve responsiveness; GPU support maturity varies—docs caution focus currently on CPU images. neko.m1k1o.net heise online neko.m1k1o.net
- Granular auth/authorization (admin vs user; fine‑grained caps): Role bits include can_login, can_connect, can_watch, can_host, can_share_media, can_access_clipboard, etc.; supports multiuser password split, file‑backed users, in‑memory object sets, and no‑auth (dev only). neko.m1k1o.net GitHub
- REST API + API token (admin programmatic control) & batch HTTP: Added in v3; enables external orchestration, dynamic user provisioning, and admin operations without interactive login; API token should be short‑lived in ephemeral rooms. GitHub neko.m1k1o.net
- Prometheus metrics & pprof profiling: Expose runtime health / performance metrics; integrate into observability stacks; profiling hooks assist tuning. GitHub neko.m1k1o.net
- Desktop quality‑of‑life: Clipboard reworked via xclip; drag‑and‑drop & file chooser upload; touchscreen input driver; dynamic resolution via xrandr; cursor image events. GitHub neko.m1k1o.net
- Screencast endpoints + webcam/mic passthrough (experimental): HTTP/JPEG screencast endpoints for snapshots/casting; optional upstream of user webcam/mic. GitHub neko.m1k1o.net
- Plugin system (chat, file upload, user‑scoped plugin config map). GitHub neko.m1k1o.net neko.m1k1o.net
5. Supported Browsers / Apps / Desktops
Neko ships many tagged images; availability varies by architecture and GPU flavor. Current matrix (AMD64 strongest support): Firefox, Waterfox, Tor Browser; Chromium family incl. Google Chrome, Microsoft Edge, Brave, Vivaldi, Opera; plus Ungoogled Chromium. Additional desktop/media apps: KDE, Xfce, Remmina, VLC. ARM support exists for subsets (e.g., Brave & Vivaldi on ARM64; some lack DRM). neko.m1k1o.net heise online GitHub
Community packages (Umbrel) surface a streamlined install for home servers; Umbrel metadata shows current packaged version (3.0.4 at capture) and highlights collaboration + tunneling access patterns. apps.umbrel.com neko.m1k1o.net
6. Deployment Overview (Minimal to Advanced)
6.1 Quick Minimal Docker Run
Pull an image (e.g., Firefox flavor) and run mapping HTTP + WebRTC ports; provide screen size and user/admin passwords via env vars; share memory sized for modern browsers (e.g., 2GB). Community example docker‑compose (FOSS Engineer) shows mapping 8888:8080
plus 52000-52100/udp
EPR range and NEKO_MEMBER_MULTIUSER_*
passwords. fossengineer.com neko.m1k1o.net neko.m1k1o.net
6.2 Choosing Registry & Tags
Prefer GitHub Container Registry (GHCR) for stable, flavor‑specific version tags; Docker Hub hosts latest dev (amd64) convenience builds. Semantic versioning (MAJOR.MINOR.PATCH) supported; latest
for most recent stable—pin explicit tags for reproducibility. neko.m1k1o.net Docker Hub
6.3 Selecting Flavors (CPU vs GPU)
Image suffix selects hardware accel stack: nvidia-*
for CUDA GPUs (AMD64), intel-*
for VA‑API/QuickSync paths, or base CPU images. Docs caution GPU support may lag; verify in your environment. neko.m1k1o.net heise online
6.4 Architecture Match & Resource Planning
Images published for linux/amd64, arm64, arm/v7; not every browser builds on all arches; some Chromium‑derived variants require ≥2GB RAM (Heise). Check the docs availability matrix before pulling. neko.m1k1o.net heise online
6.5 Persistent State (Data Volumes)
While Neko can be run “throwaway,” you may bind‑mount config, member files, and persistent browser profiles to retain bookmarks, extensions (if policy permits), and user lists; docs show file/member providers referencing host paths (e.g., /opt/neko/members.json
). neko.m1k1o.net neko.m1k1o.net neko.m1k1o.net
7. Networking & WebRTC Ports
7.1 Why Ports Matter
WebRTC media does not traverse your HTTP reverse proxy; you must expose the negotiated media ports (or provide a TURN relay). If you only open 443 you will fail unless multiplexing or relay is used. neko.m1k1o.net neko.m1k1o.net Reddit
7.2 Ephemeral UDP Port Range (EPR)
Configure NEKO_WEBRTC_EPR
(e.g., 59000-59100
) and expose identical host:container UDP range; don’t remap—ICE candidates must match reachable ports. neko.m1k1o.net
7.3 UDP/TCP Multiplexing
Alternatively specify single udpmux
/ tcpmux
ports when firewall pinholes are scarce; open both protocols for fallback where UDP blocked. neko.m1k1o.net
7.4 Public vs NAT’d IPs
Set nat1to1
when advertising a different reachable address (NAT hairpin caveats); or provide an IP retrieval URL to auto‑detect public address; otherwise ICE may hand out unroutable candidates. neko.m1k1o.net
7.5 TURN Integration
Provide STUN/TURN server JSON (frontend/back‑end separation) via env vars; example Coturn compose snippet in docs; TURN recommended when clients sit behind strict NAT/firewalls. neko.m1k1o.net neko.m1k1o.net
7.6 Real‑World Gotchas
Community reverse‑proxy thread shows mis‑set X‑Forwarded headers and missing additional port exposures leading to 502s; verifying correct WebRTC ports resolved issues for some users. Reddit neko.m1k1o.net
8. Reverse Proxy Patterns (HTTP Plane)
8.1 Enable Proxy Trust
Set server.proxy=true
so Neko honors X-Forwarded-*
headers (important for logging, CSRF, cookie domain/path). Docs warn to adjust WebSocket timeouts because Neko pings every ~10s and expects client heartbeat ~120s. neko.m1k1o.net neko.m1k1o.net
8.2 Traefik v2 Example
Label‑driven routing to backend 8080
; integrate TLS cert resolver; ensure UDP media ports separately exposed. neko.m1k1o.net neko.m1k1o.net
8.3 Nginx Example & Header Hygiene
Minimal conf proxies HTTP + WebSocket upgrade; you may add X‑Forwarded‑For/Proto, cache bypass, and long read timeouts—legacy v2 docs show extended header set; community notes correcting X-Forwarded-Proto
spelling vs “Protocol.” neko.m1k1o.net neko.m1k1o.net Reddit
8.4 Apache, Caddy, HAProxy Templates
Docs provide working snippets incl. WebSocket rewrite for Apache; one‑liner reverse_proxy
for Caddy w/ auto HTTPS; HAProxy ACL routing recipe w/ timeout tuning guidance. neko.m1k1o.net neko.m1k1o.net
9. Authentication & Authorization
9.1 Member vs Session Providers
Auth split: Member Provider validates credentials + returns capability profile; Session Provider persists session state (memory/file). Single member provider active at a time. neko.m1k1o.net
9.2 Capability Flags (Granular Rights)
Per‑user profile booleans drive UI & backend enforcement: admin status; login/API; connect vs watch; host control; share media; clipboard access; send inactive cursor; see inactive cursors; plugin‑specific keys. neko.m1k1o.net GitHub
9.3 Provider Types
- Multiuser: Two shared passwords (admin/user) generate ephemeral usernames; mirrors legacy v2 behavior.
- File: Persistent JSON map of users → hashed (optional) passwords + profiles.
- Object: In‑memory static list; no dup logins.
- No‑Auth: Open guest access (testing only—danger). neko.m1k1o.net
9.4 API User Token
Separate non‑interactive admin identity for HTTP API calls; cannot join media; recommend ephemeral/rotated tokens (avoid long‑lived static in exposed rooms). neko.m1k1o.net GitHub
9.5 Cookie Controls
Session cookie name, expiry, secure, httpOnly, domain/path configurable; disabling cookies falls back to token in client local storage—less secure (XSS risk). Keep cookies enabled for production. neko.m1k1o.net
10. Security Considerations
- Surface reduction via containerization: Browsing occurs inside an isolated container; you can discard state or run read‑only images for throwaway sessions; community privacy guides emphasize non‑retention setups. fossengineer.com GitHub
- Transport security & certs: Terminate TLS at your reverse proxy (Traefik/Caddy/Certbot etc.); ensure WebSocket upgrades & long timeouts; see official reverse proxy examples. neko.m1k1o.net neko.m1k1o.net
- Auth hardening: Use strong unique admin/user passwords (or file/object providers w/ hashed credentials); avoid enabling no‑auth in public deployments; scope API tokens tightly. neko.m1k1o.net neko.m1k1o.net
- Cookie vs token leakage: Leaving cookies enabled (secure, httpOnly) prevents script access to session; disabling pushes token into JS‑accessible storage increasing exfiltration risk. neko.m1k1o.net
- Firewalling media ports: Only expose required UDP/TCP ranges; where possible, restrict source IPs or require authenticated TURN; community reports of leaving ports closed manifest as connection failures rather than leaks—but mis‑config can open broad EPR ranges; plan network policy. neko.m1k1o.net Reddit
- Extension install policy: Browser policies in images may block arbitrary extension installs; you must explicitly allow if you need them—reduces attack surface by default. neko.m1k1o.net neko.m1k1o.net
11. Performance & Tuning
- Screen resolution & frame rate:
NEKO_DESKTOP_SCREEN
/NEKO_SCREEN
env controls virtual display mode (e.g., 1920x1080@30); higher rates = more bandwidth/CPU/GPU; choose based on clients & uplink. fossengineer.com neko.m1k1o.net - Shared memory size: Modern Chromium‑family browsers need large
/dev/shm
; examples allocateshm_size: 2gb
; undersizing leads to crashes. fossengineer.com neko.m1k1o.net - Bandwidth estimator (experimental adaptive bitrate): Optional server‑side estimator can downgrade/upgrade encodes based on measured throughput; disabled by default; numerous thresholds/backoffs tunable. neko.m1k1o.net GitHub
- Hardware accel vs CPU encode tradeoffs: GPU flavors reduce encode latency but add driver complexity; docs call out limited support maturity; Heise notes Neko can leverage Intel/Nvidia accelerated builds. neko.m1k1o.net heise online
- Resource guidance for Chromium variants: Heise reports ≥2GB RAM allocation recommended when running Chromium‑based browsers in containers; plan host sizing accordingly. heise online neko.m1k1o.net
12. Administration & Operations
- Logging & Debugging: Enable debug logging via
log.level=debug
or envNEKO_DEBUG=1
; GStreamer verbosity viaGST_DEBUG
; Pion debug byPION_LOG_DEBUG=all
; inspect docker logs and browser dev console. neko.m1k1o.net GitHub - Metrics & Profiling: Prometheus metrics endpoint + pprof instrumentation introduced in v3 support operational monitoring and performance investigation. GitHub neko.m1k1o.net
- Upgrades / Migration from v2: Config modularization in v3; backward compatibility shims but deprecated; consult v3 docs + legacy reverse proxy header diffs when migrating. GitHub neko.m1k1o.net neko.m1k1o.net
- Embedding Auto‑Login: For kiosk/iframe use, append
?usr=<user>&pwd=<pwd>
to URL to bypass login prompt for viewers—use carefully; combine w/ restricted capability profile. neko.m1k1o.net neko.m1k1o.net - Clipboard Behavior: When accessed over HTTPS in supported host browsers (Chromium family), Neko hides its own clipboard button, deferring to native Clipboard API integration; not a bug. neko.m1k1o.net GitHub
13. Ecosystem Projects
- Neko Rooms: Multi‑room orchestration wrapper that spins up independent Neko instances (ephemeral or persistent) with simplified onboarding (scripts, HTTPS via Let’s Encrypt, Traefik/NGINX automation); useful when you need per‑group isolation. neko.m1k1o.net fossengineer.com Reddit
- Neko Apps: Library of containerized app bundles beyond browsers—expands use cases to general remote Linux app streaming; complements Rooms for scaling out multi‑app catalogs. fossengineer.com neko.m1k1o.net
- Neko VPN (experimental): Mentioned in docs nav as companion project enabling tunneled access paths; explore if you need integrated network overlay to reach internal apps through Neko. neko.m1k1o.net neko.m1k1o.net
- Umbrel Packaging: Curated home‑server integration; one‑click install, Umbrel tunneling for remote reachability, version tracking; good for homelab / non‑Docker‑experts. apps.umbrel.com neko.m1k1o.net
14. Comparison Touchpoints
- vs. Kasm Workspaces: Heise positions Neko as the lightweight alternative—Kasm provides full multi‑tenant workspace management & security layers but is heavier; Neko is simpler, container‑first, optimized for shared live sessions rather than individual isolated desktops (though you can run per‑user instances). heise online GitHub
- vs. Hyperbeam API (hosted embeddable co‑browse): Neko offers a similar embeddable shared browser experience but is self‑hosted, giving you data control & on‑prem compliance; Heise explicitly calls out analogous embedding. heise online neko.m1k1o.net
- vs. Generic Remote Desktop (VNC/NoVNC/Guacamole): WebRTC yields smoother video + audio sync and lower interactive latency compared to image‑diff or poll‑based remotes; community commentary and docs emphasize superior streaming for media/watch usage. fossengineer.com GitHub
15. Practical Config Snippets
version: "3.4"
services:
neko:
image: "ghcr.io/m1k1o/neko/firefox:latest" # pick flavor/tag
restart: unless-stopped
shm_size: 2gb
ports:
- "8080:8080" # HTTP / signaling
- "59000-59100:59000-59100/udp" # WebRTC EPR
environment:
NEKO_DESKTOP_SCREEN: 1920x1080@30
NEKO_MEMBER_MULTIUSER_ADMIN_PASSWORD: ${NEKO_ADMIN:?err}
NEKO_MEMBER_MULTIUSER_USER_PASSWORD: ${NEKO_USER:?err}
NEKO_WEBRTC_EPR: 59000-59100
NEKO_WEBRTC_ICELITE: 1
NEKO_DEBUG: 0
# optionally front/back STUN/TURN JSON:
# NEKO_WEBRTC_ICESERVERS_FRONTEND: '[{"urls":["stun:stun.l.google.com:19302"]}]'
# NEKO_WEBRTC_ICESERVERS_BACKEND: '[]'
volumes:
# mount for persistent member/session files if using file provider
# - ./data:/opt/neko
fossengineer.com neko.m1k1o.net neko.m1k1o.net
server {
listen 443 ssl http2;
server_name neko.example.com;
location / {
proxy_pass http://127.0.0.1:8080;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_cache_bypass $http_upgrade;
}
}```
[neko.m1k1o.net](https://neko.m1k1o.net/docs/v3/reverse-proxy-setup) [neko.m1k1o.net](https://neko.m1k1o.net/docs/v2/reverse-proxy) [Reddit](https://www.reddit.com/r/nginxproxymanager/comments/ut8zyu/help_with_setting_up_reverse_proxy_custom_headers/)
```yaml
# Minimal member provider switch to file‑backed users with hashed passwords
member:
provider: file
file:
path: "/opt/neko/members.json"
hash: true
session:
file: "/opt/neko/sessions.json"
session:
api_token: "<short-lived-random-hex>"
16. Operational Runbook Checklist
- Preflight: Pick image flavor + arch; allocate ≥2GB RAM (Chromium); set shm_size; open media ports (EPR or mux); decide auth provider; create strong creds. neko.m1k1o.net heise online neko.m1k1o.net
- Launch: Compose up; confirm logs show listening on 8080 + WebRTC ports; test LAN client first; verify ICE candidates reachable (browser dev console). neko.m1k1o.net neko.m1k1o.net
- Secure: Put behind TLS proxy; enable proxy trust; restrict ports/firewall; rotate API tokens; store hashed passwords. neko.m1k1o.net neko.m1k1o.net neko.m1k1o.net
- Scale / Multi‑tenant: Use Neko Rooms or orchestration (k8s, compose bundles) to spin per‑team instances; leverage REST API + metrics for automation & autoscaling triggers. GitHub neko.m1k1o.net fossengineer.com
- Troubleshoot: Turn on debug envs; inspect GStreamer logs for encode issues; validate reverse proxy headers; check that WebRTC ports aren’t blocked (common 502 confusion). neko.m1k1o.net Reddit neko.m1k1o.net
17. Roadmap Glimpses & Future Directions
Recent release notes hint at additional session backends (Redis/Postgres), richer plugin ecosystem, and potential RDP/VNC relay modes where Neko acts as a WebRTC gateway rather than running the browser locally. Heise reports interest in direct protocol relay; docs flag “in the future” for expanded session providers. heise online neko.m1k1o.net GitHub
18. Community Lore / Field Notes
Homelabbers use Neko to co‑watch media, punch through restrictive corporate firewalls (when Neko host has outbound freedom), and expose full Linux desktops (KDE) to lightweight tablets. These anecdotes underscore why low‑latency WebRTC streaming and easy multi‑user control were prioritized. Reddit fossengineer.com GitHub
Reverse‑proxy misconfig (wrong header name, missing EPR exposure) is a recurring community stumbling block; always validate both HTTP and media planes. Reddit neko.m1k1o.net neko.m1k1o.net
Neko Browser + Playwright/CDP Integration Deep Dive
Neko (“n.eko”) is an open‑source, self‑hosted virtual browser that streams a full Linux desktop (not just a headless DOM) over WebRTC so multiple remote users can view and interactively control the same session in real time. It targets collaborative browsing, watch parties, remote support, embedded browser surfaces, and hardened “throwaway” cloud browsing where nothing persists locally. neko.m1k1o.net GitHub heise online
Unlike simple remote‑automation containers, Neko can run any Linux GUI application—browsers (Firefox, Chromium, Brave, Vivaldi, Waterfox, Tor, etc.), media players like VLC, full desktop environments (XFCE, KDE), and bespoke tools—because it captures an Xorg display and streams audio/video frames to clients via WebRTC. This breadth makes it viable for shared debugging sessions, interactive presentations, and as a privacy “jump box” into otherwise restricted networks. GitHub heise online Umbrel App Store
Multi‑user collaboration is first‑class: user roles, admin elevation, shared cursor visibility, host (keyboard/mouse) control arbitration, clipboard access, media sharing, and plugin‑scoped per‑user settings are governed by Neko’s v3 authentication system (Member + Session providers). This replaces v2’s simple dual‑password model and lets you express richer authorization matrices or plug in external identity sources. neko.m1k1o.net neko.m1k1o.net
Versioning: v3 vs v2 & Legacy Mode
Neko v3 reorganized configuration into modular namespaces (server, member, session, webrtc, desktop, capture, plugins, etc.) and introduced providers; however, v3 retains backward compatibility with v2 environment variables when NEKO_LEGACY=true
is set (and some legacy features auto‑detected). A migration table maps every major v2 var to its v3 equivalent (e.g., NEKO_SCREEN
→NEKO_DESKTOP_SCREEN
; NEKO_PASSWORD
→NEKO_MEMBER_MULTIUSER_USER_PASSWORD
; NEKO_NAT1TO1
→NEKO_WEBRTC_NAT1TO1
). This is critical when modernizing older compose files (like the snippet you shared) to avoid silent fallbacks and dual‑stream cursor quirks. neko.m1k1o.net neko.m1k1o.net
Heise’s Neko 3.0 coverage underscores why migrating matters: new browser flavors (Waterfox, additional Chromium builds, ARM variants), GPU‑accelerated options, HTTP/JPEG screencast endpoints, plugin ecosystem growth, and structural config changes—all shipping under a maintained Apache‑2.0 project—mean staying current pays dividends in stability and capability. heise online GitHub
Community quick‑start guides still widely circulate v2 envs (e.g., NEKO_SCREEN
, NEKO_PASSWORD
, NEKO_ICELITE
, NEKO_EPR
), which “work” only because legacy support remains—but they obscure v3 tuning knobs and can yield performance or auth surprises (e.g., no granular per‑user policy). Use the migration mapping to upgrade; I’ll show a patched compose below. fossengineer.com neko.m1k1o.net
Authentication Model (v3)
Authentication splits into Member Provider (who are you? what can you do?) and Session Provider (state & tokens). The multiuser provider emulates v2’s “user password” + “admin password” flow; you enable it via NEKO_MEMBER_PROVIDER=multiuser
, then supply NEKO_MEMBER_MULTIUSER_USER_PASSWORD
and NEKO_MEMBER_MULTIUSER_ADMIN_PASSWORD
, optionally overriding default per‑role capability profiles (host, watch, clipboard, etc.). For tighter control, switch to file or object providers to define fixed accounts, hashed passwords, and granular profiles; or noauth for unsecured demo setups (never production). Session storage can persist to file; API access can be separately tokenized (NEKO_SESSION_API_TOKEN
). neko.m1k1o.net
When exposing Neko programmatically (embedding in an app, auto‑provisioning rooms, LLM agents), consider disabling cookies or providing short‑lived API tokens; but weigh increased XSS risk if tokens leak into client JS when cookies are off. v3 exposes cookie flags (secure
, http_only
, domain/path scoping) so you can harden deployment behind TLS. neko.m1k1o.net
WebRTC Transport Essentials
For smooth low‑latency A/V + input streaming you must correctly expose Neko’s WebRTC ports. Three main patterns:
- Ephemeral UDP Port Range (EPR) — Specify a contiguous range (e.g.,
56000-56100
) viaNEKO_WEBRTC_EPR
and map the exact same range host:container without remap. Each new participant consumes ports; size range accordingly. neko.m1k1o.net - UDP/TCP Multiplexing — Collapse to a single well‑known port (e.g.,
59000
) asNEKO_WEBRTC_UDPMUX
/NEKO_WEBRTC_TCPMUX
for NAT‑challenged environments; trade throughput. neko.m1k1o.net - ICE Servers — Provide STUN/TURN front/back split:
NEKO_WEBRTC_ICESERVERS_FRONTEND
(what clients see) and..._BACKEND
(what the server dials internally); JSON‑encoded arrays. Required when clients are off‑LAN and UDP paths are blocked. neko.m1k1o.net
If you run behind NAT, set NEKO_WEBRTC_NAT1TO1
to the public (hairpin‑reachable) address; otherwise clients may ICE‑candidate a private IP and fail to connect. Automatic public IP fetch is available but you can override with NEKO_WEBRTC_IP_RETRIEVAL_URL
. neko.m1k1o.net neko.m1k1o.net
Do not rely on your HTTP reverse proxy to relay WebRTC media. Nginx/Traefik only front the signaling/control (HTTP(S)/WS) on port 8080; actual RTP/DTLS flows use the ports you expose above and must be reachable end‑to‑end or via TURN. neko.m1k1o.net neko.m1k1o.net
Reverse Proxy & Timeouts
When fronting Neko with nginx/Traefik/etc., enable proxy trust in server config (server.proxy=true
/ NEKO_SERVER_PROXY=1
in v3) so real client IPs from X-Forwarded-*
are honored. Neko sends WS pings ~10s; clients heartbeat ~120s—so bump proxy read timeouts accordingly or users drop during long idle automation runs. Official nginx sample shows required Upgrade
/Connection
headers for WebSocket upgrade; community Nginx Proxy Manager threads confirm these plus extended proxy_read_timeout
and forwarded IP headers to avoid 502s and broken control channels. neko.m1k1o.net Reddit
Container Security / Chromium Sandboxing
Chromium inside containers often needs elevated namespaces to run its sandbox; many headless automation images either add --no-sandbox
(reduced isolation) or grant --cap-add=SYS_ADMIN
and supporting kernel flags so Chrome’s sandbox works. Puppeteer’s Docker docs call out the SYS_ADMIN requirement for their hardened image; Neko’s own v2 troubleshooting notes that forgetting SYS_ADMIN yields a black screen in Chromium variants—evidence the capability remains relevant. Decide: secure host kernel + allow SYS_ADMIN (preferred for full sandbox) or run --no-sandbox
and accept risk; the sample supervisord snippet you posted already includes --no-sandbox
, so SYS_ADMIN is belt‑and‑suspenders but still recommended for stability in GPU/namespace operations. pptr.dev neko.m1k1o.net
Enabling Chrome DevTools Protocol (CDP) in Neko for Playwright
Your goal: let humans drive the streamed Neko Chromium UI and attach automation via Playwright. Playwright supports attaching to any existing Chromium instance that exposes a DevTools endpoint via chromium.connectOverCDP(endpointURL)
, where endpointURL
can be the HTTP JSON version URL or direct WS endpoint; the returned browser
exposes existing contexts/pages. Lower fidelity than full Playwright protocol, but ideal for “co‑drive” scenarios. Playwright GitHub
Once connected, you can open a raw CDPSession per page/context to send protocol commands (e.g., Runtime.evaluate
, Animation.enable
), mirroring the manual WebSocket probes in your test.js
. This is useful for diagnostics, performance metrics, and low‑level tweaks Playwright doesn’t expose natively. Playwright Playwright
Remote Debugging Flags & Port Forward Pattern
Modern Chromium removed unrestricted --remote-debugging-address=0.0.0.0
for security; recommended practice is bind the DevTools socket to localhost within the container (e.g., --remote-debugging-port=9223
), then selectively forward or reverse‑proxy to an external port (e.g., 9222) with an auth / ACL layer (nginx, socat, SSH tunnel). Your nginx‑cdp sidecar implements precisely this 9222→9223 pass‑through with WebSocket upgrade and long timeouts—aligning with guidance from the Dockerized Chromium remote debugging discussion. Stack Overflow neko.m1k1o.net
Review of Your web-agent/neko-with-playwright
Compose Snippet
You posted a two‑service stack: neko
(using m1k1o/neko:chromium
) and an nginx-cdp
sidecar in service network_mode sharing; supervisord launches Chromium with CDP flags and disables sandbox/gpu; nginx maps host 9222 to internal 9223 to front DevTools with WS keepalive/timeouts. Ports published: 52000→8080(tcp?) and 9222 (tcp). Issues & improvements:
-
1. Legacy Env Vars – You’re mixing v2 (
NEKO_SCREEN
,NEKO_PASSWORD*
,NEKO_ICELITE
,NEKO_NAT1TO1
) in a v3 world; while legacy support exists, you lose granular control and risk double cursor streams (cursor once in video, once separate) plus awkward auth extension later. Upgrade to v3 vars (NEKO_DESKTOP_SCREEN
,NEKO_MEMBER_PROVIDER=multiuser
,NEKO_MEMBER_MULTIUSER_*
,NEKO_WEBRTC_ICELITE
,NEKO_WEBRTC_NAT1TO1
). neko.m1k1o.net neko.m1k1o.net neko.m1k1o.net -
2. Missing WebRTC Ports – No UDP EPR or mux port is exposed, so remote WebRTC will fail off‑box unless clients are on the container host network and fallback mechanisms kick in. Add either an EPR range mapping and
NEKO_WEBRTC_EPR
or UDPMUX/TCPMUX single‑port mapping. neko.m1k1o.net neko.m1k1o.net -
3. Public vs Private Subnet – Your custom Docker subnet
17.100.0.0/16
collides with publicly routed Apple allocations (17.0.0.0/8 owned by Apple); choose RFC1918 (e.g.,172.31.0.0/16
or10.67.0.0/16
) to avoid confusing clients seeing ICE candidates referencing real vs container ranges. Proper NAT1TO1 matters when advertising ICE addresses. neko.m1k1o.net neko.m1k1o.net -
4. Proxy Headers & Timeouts – Good start; ensure
proxy_read_timeout
is comfortably above Neko’s WebSocket ping interval (~10s)—set ≥60s or higher—and thatNEKO_SERVER_PROXY=1
(or config) is set so Neko trusts forwarded IPs; align with official reverse proxy doc + community NPM thread. neko.m1k1o.net Reddit -
5. Chromium Capability / Sandbox – You added
cap_add: SYS_ADMIN
(good) and--no-sandbox
(less secure). Consider removing--no-sandbox
once you confirm kernel support; Neko experiences black screens without SYS_ADMIN in Chromium images; Puppeteer’s hardened image docs reinforce giving SYS_ADMIN if you want sandbox. pptr.dev neko.m1k1o.net -
6. Password Hygiene – Hard‑coding
neko
/admin
is fine for testing but never production; switch to secrets or.env
injection; multiuser provider makes it easy. neko.m1k1o.net neko.m1k1o.net -
7. NAT Hairpin & ICE Lite – You set
NEKO_ICELITE=0
(full ICE) and NAT1TO1 to container IP; if you actually need WAN access supply your public IP or domain; ICE Lite mode is only appropriate when server has public reflexive; official doc warns not to mix with external ICE servers. neko.m1k1o.net -
8. Debug Logging – When diagnosing CDP or WebRTC handshake, enable
NEKO_DEBUG=1
and optionalGST_DEBUG
per FAQ; huge time saver. neko.m1k1o.net
Hardened & Modernized Compose Example (v3 Vars, CDP Enabled)
Below is an updated docker-compose.yml
(org‑mode src). Key changes:
- Switched to GHCR explicit version tag (pin for reproducibility).
- RFC1918 subnet.
- Proper WebRTC EPR exposure.
- v3 auth vars.
- Proxy flag so Neko trusts sidecar.
- Optional API token for automation mgmt.
- Chromium started with localhost‑bound remote debugging; nginx sidecar terminates TLS (optional) & ACLs; you can env‑inject allowed upstream (e.g., ngrok tunnel).
- Dropped
--no-sandbox
(commented) to prefer secure sandbox; toggle per your threat model. - Added healthcheck & log volumes.
version: "3.8"
x-neko-env: &neko-env
NEKO_DESKTOP_SCREEN: 1920x1080@30
NEKO_MEMBER_PROVIDER: multiuser
NEKO_MEMBER_MULTIUSER_USER_PASSWORD: ${NEKO_USER_PASSWORD:-neko}
NEKO_MEMBER_MULTIUSER_ADMIN_PASSWORD: ${NEKO_ADMIN_PASSWORD:-admin}
NEKO_WEBRTC_EPR: 56000-56100 # match ports below
NEKO_WEBRTC_ICELITE: "false" # full ICE unless static public IP
NEKO_WEBRTC_NAT1TO1: ${NEKO_PUBLIC_IP:-auto} # set literal IP or leave unset to auto-detect
NEKO_SERVER_PROXY: "true" # trust reverse proxy headers
NEKO_SESSION_API_TOKEN: ${NEKO_API_TOKEN:-} # optional; blank disables
NEKO_DEBUG: ${NEKO_DEBUG:-0}
services:
neko:
image: ghcr.io/m1k1o/neko/chromium:3.0.4
container_name: neko
restart: unless-stopped
shm_size: 2gb
networks:
proxy:
ipv4_address: 172.31.0.3
ports:
- "8080:8080/tcp" # web / signaling
- "56000-56100:56000-56100/udp" # WebRTC EPR (must match env)
environment:
<<: *neko-env
cap_add:
- SYS_ADMIN # required if not using --no-sandbox
volumes:
- neko-data:/var/lib/neko # persistent config / sessions (bind as needed)
- neko-logs:/var/log/neko
configs:
- source: supervisord_chromium
target: /etc/neko/supervisord/chromium.conf
nginx-cdp:
image: nginx:alpine
container_name: neko-cdp
network_mode: "service:neko" # join same net & PID
depends_on:
- neko
environment:
# restrict which hosts may speak CDP (use allowlist or auth)
ALLOWED_CDP_ORIGIN: ${ALLOWED_CDP_ORIGIN:-127.0.0.1}
configs:
- source: nginx_cdp_conf
target: /etc/nginx/conf.d/cdp.conf
ports:
- "9222:9222/tcp" # exposed CDP endpoint proxied to 9223 in container
networks:
proxy:
ipam:
config:
- subnet: 172.31.0.0/16
volumes:
neko-data:
neko-logs:
configs:
supervisord_chromium:
content: |
[program:chromium]
environment=HOME="/home/%(ENV_USER)s",USER="%(ENV_USER)s",DISPLAY="%(ENV_DISPLAY)s"
command=/usr/bin/chromium \
--remote-debugging-port=9223 \
--remote-debugging-address=127.0.0.1 \
--remote-allow-origins="*" \
--disable-web-security \
--disable-features=VizDisplayCompositor \
--disable-extensions \
# --no-sandbox \ # uncomment only if you drop SYS_ADMIN
--disable-dev-shm-usage \
--enable-automation \
--disable-background-timer-throttling \
--disable-backgrounding-occluded-windows \
--disable-renderer-backgrounding \
--force-devtools-available \
--disable-features=TranslateUI \
--disable-ipc-flooding-protection \
--enable-blink-features=IdleDetection \
--headless=new \
--disable-gpu
stopsignal=INT
autorestart=true
priority=800
user=%(ENV_USER)s
stdout_logfile=/var/log/neko/chromium.log
stdout_logfile_maxbytes=100MB
stdout_logfile_backups=10
redirect_stderr=true
[program:openbox]
environment=HOME="/home/%(ENV_USER)s",USER="%(ENV_USER)s",DISPLAY="%(ENV_DISPLAY)s"
command=/usr/bin/openbox --config-file /etc/neko/openbox.xml
autorestart=true
priority=300
user=%(ENV_USER)s
stdout_logfile=/var/log/neko/openbox.log
stdout_logfile_maxbytes=100MB
stdout_logfile_backups=10
redirect_stderr=true
nginx_cdp_conf:
content: |
map $http_upgrade $connection_upgrade {
default upgrade;
'' close;
}
upstream chrome {
server 127.0.0.1:9223;
keepalive 32;
}
server {
listen 9222;
# Optional IP allowlist (simple example); extend w/ auth / mTLS as needed
allow 127.0.0.1;
allow ::1;
# env-subst ALLOWED_CDP_ORIGIN could template additional allow lines
deny all;
location / {
proxy_pass http://chrome;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection $connection_upgrade;
proxy_set_header Host $host:$server_port;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_read_timeout 86400s;
proxy_send_timeout 86400s;
proxy_connect_timeout 7200s;
proxy_cache off;
proxy_buffering off;
proxy_max_temp_file_size 0;
proxy_request_buffering off;
proxy_set_header Sec-WebSocket-Extensions $http_sec_websocket_extensions;
proxy_set_header Sec-WebSocket-Key $http_sec_websocket_key;
proxy_set_header Sec-WebSocket-Version $http_sec_websocket_version;
proxy_socket_keepalive on;
keepalive_timeout 300s;
keepalive_requests 1000;
}
}
neko.m1k1o.net neko.m1k1o.net neko.m1k1o.net neko.m1k1o.net Stack Overflow pptr.dev Reddit
Minimal .env
Illustration (override at deploy)
NEKO_USER_PASSWORD=supersecretuser
NEKO_ADMIN_PASSWORD=supersecretadmin
NEKO_PUBLIC_IP=203.0.113.45 # example; or set DNS name in upstream LB/TURN
NEKO_API_TOKEN=$(openssl rand -hex 32)
NEKO_DEBUG=1
ALLOWED_CDP_ORIGIN=10.0.0.0/8 # example ACL range for automation runners
neko.m1k1o.net neko.m1k1o.net neko.m1k1o.net
Playwright Attach Script (Improved)
Key best practices: discover the browser WebSocket endpoint from /json/version
; create a context if none returned (some builds start w/ zero pages when headless new); gracefully handle targets; optionally filter to the Neko desktop window by URL. Example:
// attach-neko.js
const { chromium } = require('playwright');
(async () => {
const cdpHttp = process.env.NEKO_CDP_URL || 'http://localhost:9222';
// Attach to existing Chromium exposed by Neko's CDP proxy.
const browser = await chromium.connectOverCDP(cdpHttp);
// In many cases Neko's running Chromium already has a default context.
// If none, create one.
const [defaultContext] = browser.contexts().length
? browser.contexts()
: [await browser.newContext()];
// Reuse first existing page or open a new one.
const page = defaultContext.pages() || await defaultContext.newPage();
await page.goto('https://example.com');
console.log('Neko page title:', await page.title());
// Get raw CDP session if you want low-level control.
const client = await page.context().newCDPSession(page);
const version = await client.send('Browser.getVersion').catch(() => null);
console.log('CDP Browser Version:', version);
// Keep browser open for human co-driving; do NOT close().
})();
Diagnostic CDP Ping Script (Refined from Your test.js
/ test4.js
)
Below is a leaner diagnostic that:
- Fetches
/json/version
; - Opens WebSocket;
- Discovers targets;
- Attaches to first non‑extension page;
- Evaluates an expression;
- Logs failures cleanly.
// cdp-diagnostics.js
const WebSocket = require('ws');
const fetch = (...args) => import('node-fetch').then(({default: f}) => f(...args));
(async () => {
const base = process.env.NEKO_CDP_URL || 'http://localhost:9222';
const version = await (await fetch(`${base}/json/version`)).json();
const wsUrl = version.webSocketDebuggerUrl;
const ws = new WebSocket(wsUrl, { perMessageDeflate: false });
let id = 0;
function send(method, params, sessionId) {
ws.send(JSON.stringify({ id: ++id, method, params, sessionId }));
}
ws.on('open', () => {
console.log('CDP connected');
send('Target.setDiscoverTargets', { discover: true });
});
let firstSession;
ws.on('message', data => {
const msg = JSON.parse(data);
if (msg.method === 'Target.targetCreated') {
const t = msg.params.targetInfo;
if (t.type === 'page' && !t.url.startsWith('chrome-extension://')) {
send('Target.attachToTarget', { targetId: t.targetId, flatten: true });
}
} else if (msg.method === 'Target.attachedToTarget' && !firstSession) {
firstSession = msg.params.sessionId;
send('Runtime.enable', {}, firstSession);
send('Runtime.evaluate', { expression: '1+1' }, firstSession);
} else if (msg.id && msg.result) {
console.log('Result', msg.id, msg.result);
} else if (msg.error) {
console.error('CDP Error', msg.error);
}
});
})();
Stack Overflow Playwright Playwright
Operational Checklist for Playwright‑Augmented Neko
Check | Why | How to Verify |
---|---|---|
Chromium started with --remote-debugging-port (localhost) | Required for CDP attach; safer than 0.0.0.0 | curl http://<host>:9222/json/version returns JSON |
CDP proxy ACL in place | Prevent hostile takeover of your shared session | restrict IPs or auth in nginx; test from unauthorized host fails |
WebRTC ports reachable | Avoid black screens / frozen video | webrtc-internals in client; docker logs ICE candidate errors |
SYS_ADMIN vs --no-sandbox decision documented | Security posture clarity | Confirm container start flags; run chrome://sandbox |
Multiuser passwords rotated | Prevent drive‑by admin | Use secrets; verify login roles mapping |
Proxy timeout > heartbeat | Prevent surprise disconnects during long automation | Nginx proxy_read_timeout >= 120s |
Debug logging toggled for incident response | Rapid triage | NEKO_DEBUG=1 , GST_DEBUG=3 when needed |
neko.m1k1o.net neko.m1k1o.net neko.m1k1o.net pptr.dev neko.m1k1o.net |
Example Hybrid Workflow: Humans Steer, Agents Assist
A common pattern in agentic stacks:
- Human opens Neko in browser, logs in as admin (multiuser).
- Automation runner (Playwright script / LLM agent) attaches over CDP using service account limited by firewall.
- Agent performs scripted setup (login, nav, cookie seeding) then relinquishes; human sees results instantly.
- If human taking over triggers UI state changes, agent can poll via CDP events (Target/Runtime) to resume.
This model avoids re‑launching browsers and preserves session continuity Neko already streams to participants. Playwright neko.m1k1o.net GitHub
Deployment Channels & Ecosystem
You can deploy via raw Docker/Compose, room orchestration stacks (neko‑rooms), homelab bundles (Umbrel App Store), or community charts/templates; packaging often pre‑wires reverse proxy + TLS but may lag in env var updates—review and update to v3 syntax after install. Umbrel App Store GitHub neko.m1k1o.net
Troubleshooting Quick Hits
-
Black screen (cursor only) in Chromium flavor → missing SYS_ADMIN or mis‑sandbox; confirm capability or drop sandbox flag. neko.m1k1o.net pptr.dev
-
WebRTC connect stalls / DTLS not started → exposed UDP mismatch or firewall block; check EPR mapping & NAT1TO1; review server logs at debug level. neko.m1k1o.net neko.m1k1o.net
-
Users disconnect behind proxy → heartbeat vs proxy timeout mismatch; ensure
proxy_read_timeout
>120s andserver.proxy
enabled. neko.m1k1o.net Reddit -
CDP connect refused → nginx sidecar not up or ACL blocking; verify
/json/version
at 9222 and upstream 9223 reachable in container. Stack Overflow neko.m1k1o.net -
Legacy envs ignored → upgrade to v3 names or set
NEKO_LEGACY=true
explicitly; review migration matrix. neko.m1k1o.net
Neko v3 WebRTC & WebSocket Control: Frame/State, Keyboard, Mouse (Cited)
TL;DR:
- All browser control in Neko v3 is mediated over a single
/api/ws
WebSocket after session authentication. - Browser frames are not delivered directly over the WS as video; rather, the WS carries control, signaling, events, and input (mouse/keyboard) JSON, with media frames (video, audio) negotiated via WebRTC ICE as a peer connection.
- Full workflow: REST login → WS upgrade (
/api/ws
) → system/init → WebRTC signal/request → ICE handshake → frames sent to client, controls sent from client.
1. Authenticate (REST, Cookie, Token, Password)
Mode | REST Call | Response | WS Upgrade Auth |
---|---|---|---|
Cookie (default) | POST /api/login {username, password} | Set-Cookie: NEKO_SESSION | Cookie auto-sent |
Token (stateless) | POST /api/login for {token} | Opaque JWT/Bearer | ?token=... or Bearer header |
Legacy (query) | (multiuser only) skip REST, ?password= | — | ?password in query triggers v2 |
2. WebSocket Upgrade URL (With/Without Path Prefix)
- Mainline:
wss://host[:port]/api/ws?token=<TOKEN>
- With path-prefix: e.g.
wss://proxy.example.com/neko/api/ws?token=...
- Alt: cookies or
Authorization: Bearer ...
supported. - Legacy
/ws
endpoint: only if enabled or in legacy mode.
3. Connection Lifecycle
- Upgrade: Gorilla WS server handles
/api/ws
, performs token/cookie/Bearer/session check. - Init: Server pushes
system/init
(JSON: session id, settings, role). - Heartbeat: Server sends
system/heartbeat
; clients may reply withclient/heartbeat
, but liveness relies on WebSocket ping/pong. - All interaction now flows over the socket: control events (keyboard/mouse), signaling (ICE, SDP), system/broadcast, errors.
- All client state (host, cursor, input, session, etc.) is managed by events.
4. How Media (Frames) and Control Flow
Media
- Video and audio frames do not go over the WebSocket; they are WebRTC media streams (negotiated via signaling on WS).
- To initiate frame streaming, send:
{"event":"signal/request","payload":{"video":{},"audio":{}}}
- Server replies with ICE candidates/SDP; client opens WebRTC peer connection.
- Browser’s actual frames (video, audio) arrive via WebRTC MediaStream.
- Some deployments gate input on a completed WebRTC handshake:
- send
{"event":"signal/request","payload":{"video":{},"audio":{}}}
- perform SDP/ICE (offer/provide → answer; exchange candidates)
- only then will control events be honored.
- send
Input: Keyboard/Mouse
Input is sent from client to server as JSON events (current protocol):
- Pointer move:
{"event":"control/move","payload":{"x":123,"y":456}}
- Mouse scroll:
{"event":"control/scroll", "payload": {"delta_x": 0, "delta_y": 120}}
- Mouse press:
{"event":"control/buttonpress","payload":{"x":123,"y":456,"code":1}}
- Mouse down:
{"event":"control/buttondown","payload":{"x":123,"y":456,"code":1}}
- Mouse up:
{"event":"control/buttonup","payload":{"x":123,"y":456,"code":1}}
- button codes: 1=left, 2=middle, 3=right
- Key press:
{"event":"control/keypress","payload":{"keysym":65}}
- Key down:
{"event":"control/keydown","payload":{"keysym":65}}
- Key up:
{"event":"control/keyup","payload":{"keysym":65}}
- printable chars use ASCII; control keys use X11 keysyms (e.g., Enter=65293, Esc=65307, Arrows=65361–65364)
Host arbitration:
- Request:
{"event":"control/request"}
- Release:
{"event":"control/release"}
- Note: many servers will implicitly grant host on first valid control event, but explicit
control/request
is cleaner.
Coordinates expected by server are pixels. If your client uses normalized [0..1], convert and clamp before send.
Remove legacy examples using: control/click
, control/mouse
, control/key
, control/request_host
— these are not recognized by current servers.
Control quick reference (current)
Action | Event | Payload schema |
---|---|---|
Move pointer | control/move | { "x": int, "y": int } |
Mouse scroll | control/scroll | { "delta_x": int, "delta_y": int } |
Mouse button press | control/buttonpress | `{ "x": int, "y": int, "code": 1 |
Mouse button down/up | control/buttondown/up | `{ "x": int, "y": int, "code": 1 |
Key press | control/keypress | { "keysym": int } |
Key down/up | control/keydown/up | { "keysym": int } |
Request host | control/request | {} |
Release host | control/release | {} |
Common keysyms: Enter=65293, Backspace=65288, Tab=65289, Escape=65307, Left=65361, Up=65362, Right=65363, Down=65364, Control=65507. Button codes: 1=left, 2=middle, 3=right.
5. Production-Grade Client Implementation Notes
Building a robust programmatic client requires handling several subtle but critical aspects of the Neko protocol beyond basic event sending.
5.1 Connection Management & Reconnection
A production client must anticipate WebSocket disconnects. Simply reconnecting is insufficient. The correct pattern is a connection manager that, upon a detected drop:
- Completely tears down the old state: cancel all background async tasks (like loggers and signal consumers), and close the
RTCPeerConnection
. This prevents resource leaks and "zombie" tasks. - Initiates a new connection attempt, often with exponential backoff to avoid spamming a downed server.
- Upon successful reconnection, rebuilds all necessary components: a new
Signaler
, new background tasks, and a newRTCPeerConnection
for the subsequent WebRTC handshake.
5.2 Dynamic Keyboard Mapping
While a client can maintain a static table of common X11 keysyms, the most robust approach is to listen for the keyboard/map
event from the server. This event provides a JSON object mapping key names to their precise keysym values for the server's specific environment.
- Compatibility: A client should handle both legacy (
{...}
) and current ({"map": {...}}
) payload shapes for this event. - Strategy: Maintain a default, built-in keysym map but dynamically update or extend it with the values received from the server at runtime. This ensures maximum compatibility across different keyboard layouts and server versions.
5.3 Liveness and Error Handling
- Heartbeats: The server periodically sends a
system/heartbeat
event and also issues WebSocket pings. Clients may optionally sendclient/heartbeat
, but connection liveness primarily depends on WebSocket ping/pong; lack ofclient/heartbeat
alone does not cause disconnection. - Error Events: A client should listen for and surface all
error/...
events. These provide crucial feedback from the server if a sent command was malformed, rejected, or failed, which is essential for debugging.
5.4 Robust Input Handling
- Host Control: A robust client should monitor the
control/host
event. Ifauto-host
functionality is desired, the client should track its ownsession_id
(fromsystem/init
) and automatically send acontrol/request
if it observes that the host has been dropped (has_host: false
) or given to another session. - Sequential Double-Click: To reliably simulate a double-click, a client should send two
control/buttonpress
(or down/up pairs) events sequentially, with a small delay (e.g., 50-100ms) in between. Usingasync.gather
or sending them simultaneously can result in the server's input handler swallowing one of the events.
5.5 Defensive WebRTC Handshake
When parsing the signal/provide
or signal/offer
events, a client should be defensive about the iceservers
array.
- Strict Field Mapping: Only parse and use the fields relevant to
aiortc
or the target WebRTC library (typicallyurls
,username
, andcredential
). Ignore any extra fields to avoid errors. - URL Flexibility: Be prepared to handle both a single
url
key and aurls
array.
6. References and Further Reading
- Neko GitHub
- Neko v3 Docs: Configuration
- aiortc Documentation
- Python Websockets Library
- WS Event Types (Source)
MosaicML Streaming (MDS) — Practical Guide for Codex Agents
This document explains what the MosaicML Streaming library is, why it’s useful for large, sharded datasets, and how to write and read MDS shards locally or to an S3-compatible store (MinIO, R2, etc.). It includes runnable patterns you can copy into data pipelines.
Why Streaming/MDS?
- Sharded, resumable, deterministic data loading across workers/nodes, with built-in shuffling and fault tolerance. Useful when the dataset is bigger than local disk/RAM or lives in object storage.
- MDS is the on-disk format used by Streaming: a set of shard files (e.g.,
shard.00000.mds[.zstd]
) plus anindex.json
that records metadata/offsets for fast random access.
Key Concepts (10,000-ft view)
- MDSWriter: builds shards from Python dict samples (columnar schema you define), supports compression and direct upload to remote storage.
- StreamingDataset: reads from a local cache and/or directly from remote (S3, GCS, etc.), handling shuffling, ordering, and resumption deterministically.
- S3-compatible endpoints: Use standard AWS creds +
S3_ENDPOINT_URL
to point to MinIO, Cloudflare R2, etc.
Minimal: Write an MDS dataset
from streaming import MDSWriter
import json, io, numpy as np
# 1) Define the schema: map column -> storage type.
# Supported logical types include 'bytes', 'str', numeric scalars, etc. (see docs)
columns = {
"frame_table": "bytes", # packed npy/npz/etc.
"meta_json": "str", # utf-8 JSON string
}
# 2) Writer: local out + optional remote (s3://bucket/prefix).
# Compression & shard size are critical for throughput.
with MDSWriter(
out="data/mds/train", # local staging/cache
columns=columns,
compression="zstd", # fast decode in training
hashes="sha1", # integrity
size_limit="128MB", # shard target
remote="s3://my-bucket/datasets/ui-trajs" # optional: upload shards as they close
) as w:
for episode in make_episodes():
# Pack the frame table as bytes (e.g., np.save to a BytesIO)
buf = io.BytesIO()
np.save(buf, episode["frame_table"].astype("float32"))
sample = {
"frame_table": buf.getvalue(),
"meta_json": json.dumps(episode["meta"], ensure_ascii=False),
}
w.write(sample)
Notes
• remote=... enables automatic upload of completed shards to object storage while writing.
• Prefer compression="zstd" for read-time throughput; it’s commonly used in training loops.
⸻
Configure S3 / S3-Compatible Remotes
Set standard AWS creds and (for non-AWS) an endpoint override:
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_DEFAULT_REGION=us-east-1
# For MinIO / Cloudflare R2 / etc.:
export S3_ENDPOINT_URL="https://<your-endpoint>" # e.g., https://<accountid>.r2.cloudflarestorage.com
S3_ENDPOINT_URL is the knob for S3-compatible providers.
⸻
Read an MDS dataset (local cache + remote)
from streaming import StreamingDataset
from torch.utils.data import DataLoader
ds = StreamingDataset(
remote="s3://my-bucket/datasets/ui-trajs", # remote is optional if data is fully local
local="data/cache/ui-trajs", # local cache directory
shuffle=True, shuffle_seed=1337, # deterministic across workers
)
def collate(batch):
# Each item is a dict with your columns.
return batch
loader = DataLoader(ds, batch_size=8, num_workers=8, collate_fn=collate, persistent_workers=True)
for batch in loader:
...
• StreamingDataset handles shuffling/resume deterministically across workers and epochs.
• Local cache fills on demand; you can pin/cache shards for repeated jobs.
⸻
Practical Tips
• Shard size: 64–256 MB is a good default for training throughput; too tiny increases open/seek overhead, too huge hurts parallelism (tune per infra). (General best practice aligned with MDS design. )
• Compression: zstd balances ratio & decode speed for training (common choice in large-scale pipelines).
• Schema: store big blobs under bytes (e.g., numpy npz) and small metadata as str/scalars—keeps readers simple.
• Determinism: set shuffle_seed once for the run; Streaming takes care of global ordering across ranks.
⸻
FAQ
Q: Can I mix multiple datasets?
Yes—construct multiple StreamingDataset streams or concatenate datasets; Streaming supports mixing sources and remote stores. See the docs’ “main concepts” and examples.
Q: How do I resume mid-epoch?
The index and shard metadata let Streaming resume deterministically; you don’t need custom offset logic.
Q: Non-AWS S3?
Use S3_ENDPOINT_URL with your provider’s endpoint (R2/MinIO/etc.).
---
Neko Agent API Mode - Architecture Design Document
Executive Summary
This document outlines the design for a new --api
mode for the Neko Agent system that exposes the existing WebRTC-based GUI automation agent as a RESTful API service. The design includes containerization, orchestration with Kubernetes, load balancing, and auto-scaling strategies to handle production workloads.
Current Architecture Analysis
Existing System (src/agent.py
)
The current Neko Agent is a WebRTC-based GUI automation system with the following key components:
- NekoAgent Class: Core agent that manages WebRTC connections and performs GUI automation
- WebRTC Integration: Uses aiortc for peer-to-peer connections with Neko Chrome containers
- AI Model: ShowUI-2B (Qwen2VL) for visual reasoning and action planning
- Action Execution: Supports CLICK, INPUT, SCROLL, SWIPE, TAP, ANSWER actions
- Signaling: WebSocket-based communication with Neko server
- Frame Processing: Real-time video frame capture and analysis
- Metrics: Prometheus integration for monitoring
Current Limitations for API Mode
- Single Session: Designed for one agent per WebRTC connection
- Blocking Operations: Synchronous task execution model
- No Request Queuing: No API endpoint abstraction
- Resource Management: No dynamic resource allocation
- Scaling Constraints: Manual deployment and management
Proposed API Mode Architecture
1. System Overview
graph TB subgraph "Infrastructure Layer" LB[Load Balancer<br/>Kubernetes] GW[API Gateway<br/>FastAPI] AM[Agent Manager<br/>Orchestrator] end subgraph "Processing Layer" TQ[Task Queue<br/>Redis/RQ] AP[Agent Pods<br/>Multi-Container] end subgraph "Execution Layer" NI[Neko Instances<br/>Chrome/Edge] GPU[GPU Resources<br/>CUDA/MIG] end subgraph "Agent Pod Layout" subgraph "Pod Components" NA[Neko Agent API<br/>Core Automation] CS[Capture Sidecar<br/>Training Data MDS] TS[TTS Sidecar<br/>F5-TTS + WebRTC] end SR[Shared Resources:<br/>GPU, Storage, Voices, Network] end LB --> GW GW --> AM GW --> TQ AM --> AP TQ --> AP AP --> NI AP --> GPU GPU --> AP AP -.-> NA AP -.-> CS AP -.-> TS NA --- SR CS --- SR TS --- SR
2. Core Components
2.1 Training Data Capture Service (src/capture.py
)
Technology: Python with MosaicML Streaming, WebSocket client, HTTP polling
Purpose: Captures training data from Neko sessions for model improvement and fine-tuning
Responsibilities:
- Real-time episode recording during agent sessions
- Screenshot capture at configurable FPS via HTTP polling
- Action annotation parsing from chat messages
- MDS (MosaicML Streaming) shard generation with S3 upload
- Episode lifecycle management with automatic timeout handling
Integration with API Mode:
- Runs as sidecar container alongside agent instances
- Monitors WebSocket chat for
/start
,/stop
, andAction:
commands - Captures synchronized video frames and action annotations
- Outputs training episodes as ZIP archives in MDS format
- Supports remote storage (S3) for distributed training pipelines
Key Features:
- Episode-based Recording: Packages complete task sessions as single training samples
- Streaming Upload: Direct-to-S3 capability with local/remote storage options
- Frame Synchronization: Precise timestamp alignment between actions and screenshots
- Compression: ZSTD compression for efficient storage and transfer
- 12-Factor Compliance: Configuration via environment variables only
2.2 Text-to-Speech Service (src/yap.py
)
Technology: F5-TTS, aiortc WebRTC audio, PyAV, asyncio
Purpose: Provides real-time voice synthesis and audio streaming to Neko sessions
Responsibilities:
- F5-TTS voice synthesis with custom voice models
- Real-time audio streaming via WebRTC
- Voice management and hot-reloading
- Punctuation-aware text segmentation
- Low-latency audio pipeline with jitter buffering
Integration with API Mode:
- Deploys as companion service for agent instances
- WebRTC audio track injection into Neko sessions
- Chat command interface for voice control
- Streaming and batch TTS modes
- Voice configuration management via JSON registry
Key Features:
- Multi-Voice Support: Hot-reloadable voice registry with custom models
- Streaming TTS:
/yap:begin
.../yap:end
for incremental synthesis - Low Latency: Parallel TTS workers with crossfade audio splicing
- Voice Customization: Live rate/pitch adjustments and style parameters
- Production Audio: 48kHz WebRTC-compatible output with jitter buffering
2.3 API Gateway Service
Technology: FastAPI with async/await support
Responsibilities:
- HTTP request handling and validation
- Authentication and authorization
- Rate limiting and quotas
- Request/response transformation
- API documentation (OpenAPI/Swagger)
- Health checks and metrics
Key Endpoints:
# Core automation tasks
POST /api/v1/tasks # Create new automation task
GET /api/v1/tasks/{task_id} # Get task status and results
PUT /api/v1/tasks/{task_id} # Update task parameters
DELETE /api/v1/tasks/{task_id} # Cancel running task
GET /api/v1/tasks # List tasks with filtering
POST /api/v1/sessions # Create persistent session
# Training data capture
POST /api/v1/capture/episodes # Start episode recording
GET /api/v1/capture/episodes/{episode_id} # Get episode status
DELETE /api/v1/capture/episodes/{episode_id} # Stop episode recording
GET /api/v1/capture/datasets # List available training datasets
POST /api/v1/capture/datasets/{dataset_id}/upload # Upload to remote storage
GET /api/v1/capture/config # Get capture configuration
PUT /api/v1/capture/config # Update capture settings
# Text-to-Speech and voice
POST /api/v1/tts/speak # Immediate text-to-speech
POST /api/v1/tts/stream/start # Begin streaming TTS session
POST /api/v1/tts/stream/text # Add text to streaming session
POST /api/v1/tts/stream/stop # End streaming TTS session
DELETE /api/v1/tts/stop # Stop all TTS and clear queue
# Voice management
GET /api/v1/voices # List available voices
POST /api/v1/voices # Add new voice
GET /api/v1/voices/{voice_id} # Get voice details
PUT /api/v1/voices/{voice_id} # Update voice configuration
DELETE /api/v1/voices/{voice_id} # Remove voice
POST /api/v1/voices/reload # Hot-reload voice registry
# System
GET /api/v1/health # Health check endpoint
GET /api/v1/metrics # Prometheus metrics
Request/Response Format:
// POST /api/v1/tasks - Core automation task
{
"task": "Search for Python tutorials on YouTube",
"mode": "web",
"max_steps": 10,
"timeout": 300,
"callback_url": "https://app.example.com/webhook",
"options": {
"screen_resolution": "1920x1080",
"save_screenshots": true,
"refinement_steps": 5,
"enable_capture": true, // Enable training data capture
"enable_tts": true, // Enable voice output
"voice_id": "default" // Voice for TTS
}
}
// Response
{
"task_id": "task_123456",
"status": "queued",
"created_at": "2024-01-15T10:30:00Z",
"estimated_completion": "2024-01-15T10:35:00Z"
}
// POST /api/v1/capture/episodes - Start training capture
{
"task_description": "Navigate to YouTube and search for tutorials",
"neko_session_id": "session_abc123",
"capture_options": {
"fps": 2.0,
"jpeg_quality": 85,
"min_frames": 4,
"timeout_seconds": 900
}
}
// Response
{
"episode_id": "episode_xyz789",
"status": "recording",
"started_at": "2024-01-15T10:30:00Z",
"estimated_size_mb": 120
}
// POST /api/v1/tts/speak - Immediate TTS
{
"text": "I'm now searching for Python tutorials on YouTube",
"voice_id": "instructor",
"params": {
"rate": 1.0,
"pitch": 0.0
}
}
// Response
{
"request_id": "tts_req_456",
"status": "speaking",
"estimated_duration_ms": 3500
}
// POST /api/v1/voices - Add new voice
{
"voice_id": "instructor",
"ref_audio": "/voices/instructor.wav",
"ref_text": "This is an instructional voice for tutorials",
"styles": ["calm", "explanatory"],
"params": {
"rate": 0.95,
"pitch": -1.0
}
}
// Response
{
"voice_id": "instructor",
"status": "registered",
"created_at": "2024-01-15T10:30:00Z"
}
2.2 Agent Manager (Orchestrator)
Technology: Python with asyncio, Kubernetes Python client
Responsibilities:
- Agent lifecycle management
- Resource allocation and cleanup
- Task routing to available agents
- Agent health monitoring
- Scaling decisions based on queue depth
Architecture:
class AgentManager:
def __init__(self):
self.k8s_client = kubernetes.client.ApiClient()
self.agent_pools = {}
self.task_queue = RedisQueue()
self.metrics = PrometheusMetrics()
async def provision_agent(self, requirements: AgentRequirements) -> AgentInstance
async def scale_agent_pool(self, pool_name: str, replicas: int)
async def route_task(self, task: Task) -> AgentInstance
async def cleanup_completed_agents(self)
2.3 Modified Agent Core
Changes to src/agent.py
:
class APINekoAgent(NekoAgent):
def __init__(self, *args, api_mode=True, **kwargs):
super().__init__(*args, **kwargs)
self.api_mode = api_mode
self.current_task = None
self.task_queue = asyncio.Queue()
async def api_run(self) -> None:
"""API mode main loop - processes tasks from queue"""
while not self.shutdown.is_set():
try:
task = await self.task_queue.get()
self.current_task = task
result = await self.execute_task(task)
await self.report_result(result)
except Exception as e:
await self.report_error(e)
async def execute_task(self, task: APITask) -> TaskResult:
"""Execute single API task and return results"""
self.nav_task = task.description
# Existing navigation logic with API adaptations
return TaskResult(task_id=task.id, status="completed", ...)
3. Containerization Strategy
3.1 Multi-Stage Docker Build
# Base image with CUDA support for GPU inference
FROM nvidia/cuda:12.1-runtime-ubuntu22.04 as base
# Python environment setup
RUN apt-get update && apt-get install -y \
python3.11 python3-pip \
libgl1-mesa-glx libglib2.0-0 \
&& rm -rf /var/lib/apt/lists/*
# Dependencies stage
FROM base as deps
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Model download stage (optional optimization)
FROM deps as models
RUN python -c "from transformers import Qwen2VLForConditionalGeneration; \
Qwen2VLForConditionalGeneration.from_pretrained('showlab/ShowUI-2B')"
# Application stage
FROM models as app
WORKDIR /app
COPY src/ ./src/
COPY pyproject.toml ./
RUN pip install -e .
# API mode entrypoint
CMD ["python", "-m", "agent", "--api", "--port", "8000"]
Capture Service Dockerfile:
FROM python:3.11-slim as base
# System dependencies
RUN apt-get update && apt-get install -y \
libgl1-mesa-glx libglib2.0-0 \
&& rm -rf /var/lib/apt/lists/*
# Python dependencies
COPY requirements-capture.txt .
RUN pip install --no-cache-dir -r requirements-capture.txt
# Application
WORKDIR /app
COPY src/capture.py ./src/
COPY pyproject.toml ./
RUN pip install -e .
# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=10s --retries=3 \
CMD python -c "import capture; print('ok')"
# Entry point
CMD ["python", "src/capture.py"]
TTS Service Dockerfile:
FROM nvidia/cuda:12.1-runtime-ubuntu22.04 as base
# System dependencies for audio processing
RUN apt-get update && apt-get install -y \
python3.11 python3-pip \
libsndfile1 libasound2-dev \
ffmpeg libavcodec-dev libavformat-dev \
&& rm -rf /var/lib/apt/lists/*
# Python dependencies including F5-TTS
COPY requirements-tts.txt .
RUN pip install --no-cache-dir -r requirements-tts.txt
# Download F5-TTS models (optional optimization)
RUN python -c "from f5_tts.api import F5TTS; F5TTS()"
# Application
WORKDIR /app
COPY src/yap.py ./src/
COPY voices/ ./voices/
COPY pyproject.toml ./
RUN pip install -e .
# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=30s --retries=3 \
CMD python -c "from yap import Settings; s=Settings(); print('ok' if s.validate()==[] else 'fail')"
# Entry point
CMD ["python", "src/yap.py"]
3.2 Container Images
Agent Image: neko-agent-api:latest
- Contains the modified agent with API capabilities
- GPU-enabled for model inference
- Health checks and graceful shutdown
- Prometheus metrics export
Gateway Image: neko-api-gateway:latest
- FastAPI application
- Authentication middleware
- Rate limiting and request validation
- OpenAPI documentation
Manager Image: neko-agent-manager:latest
- Kubernetes orchestration logic
- Agent lifecycle management
- Auto-scaling algorithms
Capture Image: neko-capture:latest
- Training data capture service (
src/capture.py
) - MosaicML Streaming for MDS shard generation
- S3-compatible storage support for remote upload
- WebSocket client for Neko session monitoring
- Screenshot polling and episode management
TTS Image: neko-yap:latest
- Text-to-speech service (
src/yap.py
) - F5-TTS model inference with GPU acceleration
- WebRTC audio streaming capabilities
- Voice management and hot-reloading
- Real-time audio pipeline with jitter buffering
Voice Registry Sidecar: neko-voices:latest
- Shared voice model storage
- Voice file hosting and management
- Hot-reloadable voice configurations
- Persistent volume for voice assets
4. Kubernetes Deployment
4.1 Namespace Organization
apiVersion: v1
kind: Namespace
metadata:
name: neko-agents
labels:
app: neko-agent-system
---
# Resource quotas and limits
apiVersion: v1
kind: ResourceQuota
metadata:
name: neko-agents-quota
namespace: neko-agents
spec:
hard:
requests.nvidia.com/gpu: "20"
limits.memory: "200Gi"
persistentvolumeclaims: "10"
4.2 Enhanced Agent Deployment with Sidecars
apiVersion: apps/v1
kind: Deployment
metadata:
name: neko-agent-pool
namespace: neko-agents
spec:
replicas: 3
selector:
matchLabels:
app: neko-agent
template:
metadata:
labels:
app: neko-agent
spec:
nodeSelector:
accelerator: nvidia-tesla-t4
volumes:
- name: voices-storage
persistentVolumeClaim:
claimName: neko-voices-pvc
- name: capture-storage
emptyDir: {}
containers:
# Main agent container
- name: neko-agent
image: neko-agent-api:latest
resources:
requests:
nvidia.com/gpu: 1
memory: "4Gi"
cpu: "2"
limits:
nvidia.com/gpu: 1
memory: "8Gi"
cpu: "4"
env:
- name: NEKO_API_MODE
value: "true"
- name: CUDA_VISIBLE_DEVICES
value: "0"
ports:
- containerPort: 8000
name: api
- containerPort: 9000
name: metrics
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
# Training data capture sidecar
- name: neko-capture
image: neko-capture:latest
env:
- name: NEKO_WS
value: "ws://localhost:8000/api/ws"
- name: CAPTURE_OUT
value: "/capture/mds"
- name: CAPTURE_REMOTE
valueFrom:
secretKeyRef:
name: neko-s3-config
key: remote_uri
optional: true
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: neko-s3-config
key: access_key_id
optional: true
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: neko-s3-config
key: secret_access_key
optional: true
volumeMounts:
- name: capture-storage
mountPath: /capture
resources:
requests:
memory: "512Mi"
cpu: "0.5"
limits:
memory: "1Gi"
cpu: "1"
# TTS sidecar
- name: neko-yap
image: neko-yap:latest
env:
- name: YAP_WS
value: "ws://localhost:8000/api/ws"
- name: YAP_VOICES_DIR
value: "/voices"
- name: YAP_SR
value: "48000"
- name: YAP_PARALLEL
value: "2"
volumeMounts:
- name: voices-storage
mountPath: /voices
resources:
requests:
nvidia.com/gpu: 0.5 # GPU sharing with main agent
memory: "2Gi"
cpu: "1"
limits:
nvidia.com/gpu: 0.5
memory: "4Gi"
cpu: "2"
livenessProbe:
exec:
command:
- python
- -c
- "from yap import Settings; s=Settings(); exit(0 if s.validate()==[] else 1)"
initialDelaySeconds: 45
periodSeconds: 30
4.3 Persistent Storage for Voices
# Persistent Volume Claim for voice assets
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: neko-voices-pvc
namespace: neko-agents
spec:
accessModes:
- ReadWriteMany # Multiple pods can mount
resources:
requests:
storage: 10Gi
storageClassName: nfs-client # Use NFS or similar for shared access
---
# Secret for S3 configuration (training data upload)
apiVersion: v1
kind: Secret
metadata:
name: neko-s3-config
namespace: neko-agents
type: Opaque
stringData:
remote_uri: "s3://neko-training-data/episodes"
access_key_id: "AKIA..."
secret_access_key: "..."
region: "us-west-2"
4.4 Service Configuration
# Service for agent API endpoints
apiVersion: v1
kind: Service
metadata:
name: neko-agent-service
namespace: neko-agents
spec:
selector:
app: neko-agent
ports:
- name: api
port: 8000
targetPort: 8000
- name: metrics
port: 9000
targetPort: 9000
type: ClusterIP
---
# Headless service for TTS WebRTC (if needed)
apiVersion: v1
kind: Service
metadata:
name: neko-tts-service
namespace: neko-agents
spec:
selector:
app: neko-agent
ports:
- name: webrtc
port: 8080
targetPort: 8080
protocol: UDP
clusterIP: None
4.5 Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: neko-agent-hpa
namespace: neko-agents
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: neko-agent-pool
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: nvidia.com/gpu
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: task_queue_length
target:
type: AverageValue
averageValue: "5"
5. Load Balancing and Auto-Scaling Strategy
5.1 Multi-Tier Load Balancing
Tier 1: External Load Balancer
- Cloud provider load balancer (ALB/GKE/AKS)
- SSL termination and geographical routing
- Health checks and failover
Tier 2: Ingress Controller
- NGINX Ingress with session affinity
- Rate limiting and request buffering
- WebSocket upgrade support for agent communication
Tier 3: Service Mesh (Optional)
- Istio for advanced traffic management
- Circuit breaker patterns
- Observability and tracing
5.2 Intelligent Auto-Scaling
Queue-Based Scaling:
class QueueBasedScaler:
def __init__(self):
self.queue_threshold_scale_up = 10
self.queue_threshold_scale_down = 2
self.avg_task_duration = 30 # seconds
def calculate_desired_replicas(self, queue_length: int, current_replicas: int) -> int:
if queue_length > self.queue_threshold_scale_up:
# Predictive scaling based on queue growth rate
desired = min(current_replicas * 2, max_replicas)
elif queue_length < self.queue_threshold_scale_down:
desired = max(current_replicas // 2, min_replicas)
else:
desired = current_replicas
return desired
GPU Utilization Scaling:
- Monitor GPU memory usage and compute utilization
- Scale up when average GPU utilization > 80%
- Scale down when GPU utilization < 30% for 5 minutes
Cost-Optimized Node Selection:
- Prefer spot instances for batch workloads
- Use GPU node pools with automatic provisioning
- Implement preemption handling for graceful task migration
5.3 Advanced Scaling Features
Predictive Scaling (2024 Best Practice):
Model Selection Rationale: For Neko Agent workloads, we recommend a Prophet + XGBoost ensemble approach due to:
- Variable task durations (30s to 30+ minutes)
- Bursty request patterns with business hour seasonality
- Multi-dimensional resource requirements (CPU, GPU, memory)
- Need for 5-15 minute prediction horizon to account for container startup times
Core Implementation:
import numpy as np
import pandas as pd
from prophet import Prophet
import xgboost as xgb
from sklearn.ensemble import VotingRegressor
from typing import Dict, List, Tuple, Optional
import asyncio
import logging
from dataclasses import dataclass
from datetime import datetime, timedelta
@dataclass
class LoadPrediction:
timestamp: datetime
predicted_requests: float
predicted_gpu_hours: float
predicted_memory_gb: float
confidence_interval: Tuple[float, float]
recommendation: str # 'scale_up', 'scale_down', 'maintain'
class NekoLoadPredictor:
"""
Ensemble model for predicting Neko Agent resource demands.
Uses Prophet for seasonality/trends + XGBoost for complex patterns.
Optimized for multi-container pods with GPU sharing constraints.
"""
def __init__(self, prediction_horizon_minutes: int = 10):
self.horizon = prediction_horizon_minutes
self.prophet_model = None
self.xgb_model = None
self.ensemble_model = None
self.feature_columns = [
'hour_of_day', 'day_of_week', 'is_weekend', 'is_holiday',
'queue_depth', 'avg_task_duration', 'active_pods',
'gpu_utilization_pct', 'memory_utilization_pct',
'requests_last_5min', 'requests_last_15min', 'requests_last_hour',
'tts_requests_ratio', 'capture_enabled_ratio'
]
self.logger = logging.getLogger(__name__)
def _engineer_features(self,
historical_metrics: pd.DataFrame,
current_time: datetime) -> pd.DataFrame:
"""
Engineer features for prediction model.
:param historical_metrics: Time series of system metrics
:param current_time: Current timestamp for prediction
:return: Engineered feature DataFrame
"""
df = historical_metrics.copy()
# Time-based features
df['hour_of_day'] = df['timestamp'].dt.hour
df['day_of_week'] = df['timestamp'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['is_holiday'] = df['timestamp'].dt.date.isin(self._get_holidays()).astype(int)
# Rolling window features
df['requests_last_5min'] = df['request_count'].rolling('5min').sum()
df['requests_last_15min'] = df['request_count'].rolling('15min').sum()
df['requests_last_hour'] = df['request_count'].rolling('60min').sum()
# Task complexity indicators
df['avg_task_duration'] = df['total_task_seconds'].rolling('15min').sum() / df['request_count'].rolling('15min').sum()
df['tts_requests_ratio'] = df['tts_requests'] / df['request_count'].clip(lower=1)
df['capture_enabled_ratio'] = df['capture_requests'] / df['request_count'].clip(lower=1)
# Resource utilization
df['gpu_utilization_pct'] = df['gpu_memory_used'] / df['gpu_memory_total'] * 100
df['memory_utilization_pct'] = df['memory_used'] / df['memory_total'] * 100
return df.fillna(0)
def _get_holidays(self) -> List[datetime]:
"""Get list of holidays that might affect traffic patterns."""
# Simplified - in production, use holidays library
return [
datetime(2024, 1, 1), # New Year
datetime(2024, 7, 4), # Independence Day
datetime(2024, 12, 25), # Christmas
# Add more holidays based on user base geography
]
async def train_models(self, historical_data: pd.DataFrame):
"""
Train the ensemble prediction model.
:param historical_data: Historical metrics with features and targets
"""
self.logger.info("Training predictive scaling models...")
# Prepare data
df = self._engineer_features(historical_data, datetime.now())
# Prophet for seasonality and trends
prophet_df = df[['timestamp', 'request_count']].rename(
columns={'timestamp': 'ds', 'request_count': 'y'}
)
self.prophet_model = Prophet(
daily_seasonality=True,
weekly_seasonality=True,
yearly_seasonality=False, # Need more data
changepoint_prior_scale=0.05, # Conservative for stability
interval_width=0.8
)
# Add business hours regressor
prophet_df['business_hours'] = (
(prophet_df['ds'].dt.hour >= 9) &
(prophet_df['ds'].dt.hour < 17) &
(prophet_df['ds'].dt.dayofweek < 5)
).astype(int)
self.prophet_model.add_regressor('business_hours')
self.prophet_model.fit(prophet_df)
# XGBoost for complex patterns
feature_df = df[self.feature_columns].fillna(0)
target_requests = df['request_count'].values
target_gpu_hours = df['gpu_hours_consumed'].values
target_memory = df['memory_gb_peak'].values
self.xgb_model = {
'requests': xgb.XGBRegressor(
n_estimators=100,
max_depth=6,
learning_rate=0.1,
subsample=0.8,
random_state=42
),
'gpu_hours': xgb.XGBRegressor(
n_estimators=100,
max_depth=6,
learning_rate=0.1,
subsample=0.8,
random_state=42
),
'memory_gb': xgb.XGBRegressor(
n_estimators=100,
max_depth=6,
learning_rate=0.1,
subsample=0.8,
random_state=42
)
}
# Train XGBoost models
self.xgb_model['requests'].fit(feature_df, target_requests)
self.xgb_model['gpu_hours'].fit(feature_df, target_gpu_hours)
self.xgb_model['memory_gb'].fit(feature_df, target_memory)
self.logger.info("Model training completed successfully")
async def predict_load(self,
current_metrics: Dict[str, float],
prediction_time: datetime) -> LoadPrediction:
"""
Predict future load using ensemble model.
:param current_metrics: Current system metrics
:param prediction_time: Time to predict for
:return: Load prediction with confidence intervals
"""
if not (self.prophet_model and self.xgb_model):
raise ValueError("Models not trained. Call train_models() first.")
# Prophet prediction (seasonal/trend)
future_df = pd.DataFrame({
'ds': [prediction_time],
'business_hours': [
1 if (9 <= prediction_time.hour < 17 and prediction_time.weekday() < 5) else 0
]
})
prophet_forecast = self.prophet_model.predict(future_df)
prophet_requests = max(0, prophet_forecast['yhat'].iloc[0])
confidence_lower = prophet_forecast['yhat_lower'].iloc[0]
confidence_upper = prophet_forecast['yhat_upper'].iloc[0]
# XGBoost prediction (complex patterns)
current_features = pd.DataFrame([{
'hour_of_day': prediction_time.hour,
'day_of_week': prediction_time.weekday(),
'is_weekend': 1 if prediction_time.weekday() >= 5 else 0,
'is_holiday': 1 if prediction_time.date() in [h.date() for h in self._get_holidays()] else 0,
'queue_depth': current_metrics.get('queue_depth', 0),
'avg_task_duration': current_metrics.get('avg_task_duration', 180),
'active_pods': current_metrics.get('active_pods', 1),
'gpu_utilization_pct': current_metrics.get('gpu_utilization_pct', 0),
'memory_utilization_pct': current_metrics.get('memory_utilization_pct', 0),
'requests_last_5min': current_metrics.get('requests_last_5min', 0),
'requests_last_15min': current_metrics.get('requests_last_15min', 0),
'requests_last_hour': current_metrics.get('requests_last_hour', 0),
'tts_requests_ratio': current_metrics.get('tts_requests_ratio', 0.3),
'capture_enabled_ratio': current_metrics.get('capture_enabled_ratio', 0.8),
}])
xgb_requests = max(0, self.xgb_model['requests'].predict(current_features)[0])
xgb_gpu_hours = max(0, self.xgb_model['gpu_hours'].predict(current_features)[0])
xgb_memory_gb = max(0, self.xgb_model['memory_gb'].predict(current_features)[0])
# Ensemble: Weight Prophet (seasonality) and XGBoost (patterns)
# Use higher Prophet weight during business hours, XGBoost for off-hours
prophet_weight = 0.7 if current_features['is_weekend'].iloc[0] == 0 else 0.4
xgb_weight = 1 - prophet_weight
final_requests = prophet_weight * prophet_requests + xgb_weight * xgb_requests
# Generate scaling recommendation
current_capacity = current_metrics.get('active_pods', 1) * current_metrics.get('pods_max_requests_per_hour', 20)
recommendation = self._generate_recommendation(
predicted_load=final_requests,
current_capacity=current_capacity,
current_utilization=current_metrics.get('current_utilization_pct', 0)
)
return LoadPrediction(
timestamp=prediction_time,
predicted_requests=final_requests,
predicted_gpu_hours=xgb_gpu_hours,
predicted_memory_gb=xgb_memory_gb,
confidence_interval=(confidence_lower, confidence_upper),
recommendation=recommendation
)
def _generate_recommendation(self,
predicted_load: float,
current_capacity: float,
current_utilization: float) -> str:
"""Generate scaling recommendation based on prediction."""
utilization_threshold_up = 80.0 # Scale up if predicted > 80%
utilization_threshold_down = 30.0 # Scale down if predicted < 30%
predicted_utilization = (predicted_load / current_capacity) * 100
# Conservative scaling to avoid thrashing
if predicted_utilization > utilization_threshold_up and current_utilization > 60:
return 'scale_up'
elif predicted_utilization < utilization_threshold_down and current_utilization < 40:
return 'scale_down'
else:
return 'maintain'
class PredictiveScaler:
"""
Production-ready predictive scaler for Neko Agent pods.
"""
def __init__(self,
k8s_client,
namespace: str = "neko-agents",
deployment_name: str = "neko-agent-pool"):
self.predictor = NekoLoadPredictor()
self.k8s_client = k8s_client
self.namespace = namespace
self.deployment_name = deployment_name
self.logger = logging.getLogger(__name__)
self.scaling_history = []
self.min_replicas = 2
self.max_replicas = 50
async def collect_metrics(self) -> Dict[str, float]:
"""Collect current system metrics for prediction."""
# This would integrate with Prometheus/Grafana
# Simplified example:
return {
'queue_depth': await self._get_queue_depth(),
'active_pods': await self._get_active_pods(),
'gpu_utilization_pct': await self._get_gpu_utilization(),
'memory_utilization_pct': await self._get_memory_utilization(),
'requests_last_5min': await self._get_recent_requests(5),
'requests_last_15min': await self._get_recent_requests(15),
'requests_last_hour': await self._get_recent_requests(60),
'avg_task_duration': await self._get_avg_task_duration(),
'tts_requests_ratio': await self._get_tts_ratio(),
'capture_enabled_ratio': await self._get_capture_ratio(),
}
async def predict_and_scale(self):
"""Main prediction and scaling loop."""
try:
current_metrics = await self.collect_metrics()
prediction_time = datetime.now() + timedelta(minutes=10)
prediction = await self.predictor.predict_load(current_metrics, prediction_time)
self.logger.info(
f"Load prediction: {prediction.predicted_requests:.1f} requests, "
f"recommendation: {prediction.recommendation}"
)
if prediction.recommendation == 'scale_up':
await self._scale_up(prediction)
elif prediction.recommendation == 'scale_down':
await self._scale_down(prediction)
# Record scaling decision for model improvement
self.scaling_history.append({
'timestamp': datetime.now(),
'prediction': prediction,
'action_taken': prediction.recommendation,
'current_metrics': current_metrics
})
except Exception as e:
self.logger.error(f"Predictive scaling error: {e}")
async def _scale_up(self, prediction: LoadPrediction):
"""Scale up agent pods based on prediction."""
current_replicas = await self._get_current_replicas()
# Calculate required replicas with safety margin
requests_per_pod_per_hour = 20 # Conservative estimate
required_replicas = max(
current_replicas + 1, # Minimum increment
int(np.ceil(prediction.predicted_requests / requests_per_pod_per_hour * 1.2)) # 20% safety margin
)
target_replicas = min(required_replicas, self.max_replicas)
if target_replicas > current_replicas:
await self._update_replicas(target_replicas)
self.logger.info(f"Scaled up from {current_replicas} to {target_replicas} replicas")
async def _scale_down(self, prediction: LoadPrediction):
"""Scale down agent pods based on prediction."""
current_replicas = await self._get_current_replicas()
# Conservative scale-down to avoid thrashing
if len(self.scaling_history) > 0:
recent_scale_up = any(
h['action_taken'] == 'scale_up'
for h in self.scaling_history[-3:] # Last 3 decisions
)
if recent_scale_up:
self.logger.info("Skipping scale-down due to recent scale-up")
return
requests_per_pod_per_hour = 15 # More conservative for scale-down
required_replicas = max(
self.min_replicas,
int(np.ceil(prediction.predicted_requests / requests_per_pod_per_hour))
)
target_replicas = max(required_replicas, current_replicas - 1) # Maximum decrement
if target_replicas < current_replicas:
await self._update_replicas(target_replicas)
self.logger.info(f"Scaled down from {current_replicas} to {target_replicas} replicas")
async def _get_current_replicas(self) -> int:
"""Get current number of replicas from Kubernetes."""
# Implementation would use Kubernetes API
pass
async def _update_replicas(self, target_replicas: int):
"""Update deployment replica count."""
# Implementation would use Kubernetes API
pass
# Additional helper methods for metrics collection...
async def _get_queue_depth(self) -> float:
# Redis queue length
pass
async def _get_gpu_utilization(self) -> float:
# NVIDIA DCGM metrics
pass
Model Training Pipeline:
class ModelTrainingPipeline:
"""
Automated training pipeline for predictive scaling models.
Runs weekly to incorporate new patterns and seasonal changes.
"""
def __init__(self, prometheus_client, model_storage_path: str):
self.prometheus = prometheus_client
self.storage_path = model_storage_path
self.logger = logging.getLogger(__name__)
async def collect_training_data(self, days_back: int = 30) -> pd.DataFrame:
"""Collect historical metrics from Prometheus for training."""
end_time = datetime.now()
start_time = end_time - timedelta(days=days_back)
queries = {
'request_count': 'sum(rate(neko_api_requests_total[5m])) * 300',
'gpu_hours_consumed': 'sum(neko_gpu_utilization_percent) / 100 * 5/60',
'memory_gb_peak': 'max(neko_memory_usage_bytes) / 1024/1024/1024',
'total_task_seconds': 'sum(neko_task_duration_seconds)',
'tts_requests': 'sum(rate(neko_tts_requests_total[5m])) * 300',
'capture_requests': 'sum(rate(neko_capture_episodes_total[5m])) * 300',
# Add more metrics as needed
}
historical_data = []
current_time = start_time
while current_time < end_time:
row = {'timestamp': current_time}
for metric_name, query in queries.items():
try:
result = await self.prometheus.query(query, time=current_time)
row[metric_name] = float(result[0]['value'][1]) if result else 0.0
except Exception as e:
self.logger.warning(f"Failed to get {metric_name}: {e}")
row[metric_name] = 0.0
historical_data.append(row)
current_time += timedelta(minutes=5)
return pd.DataFrame(historical_data)
async def train_and_evaluate(self):
"""Train models and evaluate performance."""
self.logger.info("Starting model training pipeline...")
# Collect data
training_data = await self.collect_training_data()
# Split train/test (last 20% for testing)
split_idx = int(len(training_data) * 0.8)
train_data = training_data[:split_idx]
test_data = training_data[split_idx:]
# Train models
predictor = NekoLoadPredictor()
await predictor.train_models(train_data)
# Evaluate performance
mae_scores = []
mape_scores = []
for _, row in test_data.iterrows():
if row.name % 12 == 0: # Every hour
current_metrics = {
'queue_depth': 0, # Simplified for evaluation
'active_pods': 3,
'gpu_utilization_pct': 50,
# ... other metrics
}
prediction = await predictor.predict_load(
current_metrics,
row['timestamp'] + timedelta(minutes=10)
)
actual = row['request_count']
predicted = prediction.predicted_requests
mae_scores.append(abs(actual - predicted))
if actual > 0:
mape_scores.append(abs(actual - predicted) / actual * 100)
mae = np.mean(mae_scores)
mape = np.mean(mape_scores)
self.logger.info(f"Model evaluation - MAE: {mae:.2f}, MAPE: {mape:.2f}%")
# Save model if performance is acceptable
if mape < 20: # 20% MAPE threshold
await self._save_model(predictor)
self.logger.info("Model saved successfully")
else:
self.logger.warning("Model performance below threshold, keeping previous version")
async def _save_model(self, predictor: NekoLoadPredictor):
"""Save trained model to persistent storage."""
import joblib
model_data = {
'prophet_model': predictor.prophet_model,
'xgb_model': predictor.xgb_model,
'trained_at': datetime.now(),
'feature_columns': predictor.feature_columns
}
joblib.dump(model_data, f"{self.storage_path}/neko_scaling_model.pkl")
Integration with Kubernetes HPA:
# Custom HPA with predictive scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: neko-agent-predictive-hpa
namespace: neko-agents
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: neko-agent-pool
minReplicas: 2
maxReplicas: 50
metrics:
- type: External
external:
metric:
name: neko_predicted_load
selector:
matchLabels:
deployment: neko-agent-pool
target:
type: Value
value: "20" # Target requests per pod
behavior:
scaleUp:
stabilizationWindowSeconds: 60 # React quickly to spikes
policies:
- type: Percent
value: 50 # Max 50% increase per minute
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # Conservative scale-down
policies:
- type: Percent
value: 10 # Max 10% decrease per 5 minutes
periodSeconds: 300
Multi-Region Deployment:
- Deploy agent pools across multiple availability zones
- Use regional load balancers with latency-based routing
- Cross-region task replication for disaster recovery
6. Monitoring and Observability
6.1 Metrics Collection
System Metrics:
- Request rate, latency, and error rate
- Queue depth and processing time
- Resource utilization (CPU, GPU, memory)
- Agent lifecycle events
Business Metrics:
- Task success/failure rates by type
- Average task completion time
- User adoption and API usage patterns
- Cost per task execution
Custom Prometheus Metrics:
# Agent-specific metrics
ACTIVE_AGENTS = Gauge('neko_active_agents_total', 'Number of active agent instances')
TASK_DURATION = Histogram('neko_task_duration_seconds', 'Task execution time', ['task_type'])
QUEUE_SIZE = Gauge('neko_task_queue_size', 'Number of tasks in queue')
GPU_UTILIZATION = Gauge('neko_gpu_utilization_percent', 'GPU utilization per agent', ['agent_id'])
6.2 Logging Strategy
Structured Logging:
{
"timestamp": "2024-01-15T10:30:00Z",
"level": "INFO",
"agent_id": "agent-pod-xyz",
"task_id": "task_123456",
"event": "task_completed",
"duration": 45.2,
"actions_executed": 8,
"success": true
}
Log Aggregation: ELK Stack (Elasticsearch, Logstash, Kibana) or equivalent Distributed Tracing: OpenTelemetry with Jaeger for request flow tracking
7. Security Considerations
7.1 API Security
Authentication: JWT tokens with role-based access control
Authorization: Resource-level permissions and quota enforcement
Input Validation: Request schema validation and sanitization
Rate Limiting: Per-user and per-endpoint limits
7.2 Container Security
Image Scanning: Vulnerability scanning in CI/CD pipeline Runtime Security: Pod Security Standards (PSS) and AppArmor profiles Secret Management: Kubernetes secrets with encryption at rest Network Policies: Microsegmentation within the cluster
7.3 Model Security
Model Integrity: Checksum verification for downloaded models Inference Security: Input validation and output filtering Resource Isolation: GPU memory isolation between agents
8. Implementation Roadmap
Phase 1: Core API Development (4 weeks)
- Implement FastAPI gateway with basic endpoints
- Modify agent.py for API mode support
- Create Docker containers for all components
- Basic Kubernetes deployment manifests
Phase 2: Training Data & TTS Integration (3 weeks)
- Deploy capture.py as sidecar service with MDS integration
- Implement yap.py TTS service with F5-TTS and WebRTC
- Configure voice management and persistent storage
- Set up S3 integration for training data upload
- Add new API endpoints for capture and TTS
Phase 3: Scaling Infrastructure (3 weeks)
- Implement agent manager and orchestration logic
- Set up Redis task queue and job processing
- Configure HPA with custom metrics for multi-container pods
- Load testing and performance optimization with sidecars
Phase 4: Production Hardening (3 weeks)
- Implement monitoring and alerting for all services
- Security hardening and compliance (including voice data)
- Documentation and API reference (capture + TTS endpoints)
- CI/CD pipeline integration with multi-stage builds
Phase 5: Advanced Features (4 weeks)
- Predictive scaling algorithms with multi-container awareness
- Multi-region deployment with voice model replication
- Advanced monitoring and cost optimization for GPU sharing
- Performance analytics and training data insights
9. Cost Analysis
9.1 Resource Requirements
Per Enhanced Agent Pod (Main + Sidecars):
- GPU: 1x NVIDIA T4 or A10 (shared: 1.0 agent + 0.5 TTS) ($0.35-$1.10/hour)
- CPU: 3.5-7 cores (2-4 agent + 1-2 TTS + 0.5-1 capture) ($0.14-$0.28/hour)
- Memory: 6.5-13GB (4-8 agent + 2-4 TTS + 0.5-1 capture) ($0.0065-$0.013/hour)
- Storage: 60GB SSD + 10GB voice models ($0.007/hour)
Additional Components per Pod:
- Capture sidecar: Minimal overhead, primarily I/O and storage
- TTS sidecar: GPU sharing, ~50% memory overhead
- Voice storage: Shared across pods via NFS/PVC
Total Cost per Enhanced Pod/Hour: ~$0.52-$1.42 (+30% for sidecars)
Supporting Infrastructure:
- Load balancer: $0.025/hour
- Redis cluster: $0.10-$0.30/hour
- Monitoring stack: $0.05-$0.15/hour
9.2 Cost Optimization Strategies
- Spot Instances: 60-90% cost reduction for fault-tolerant workloads
- Right-sizing: Dynamic resource allocation based on workload
- Reserved Instances: 30-60% discount for predictable baseline load
- Multi-tenancy: Share GPU resources across multiple small tasks
10. Conclusion
The proposed API mode architecture transforms the existing Neko Agent from a single-session WebRTC client into a comprehensive, cloud-native AI automation platform. The integration of capture.py
and yap.py
creates a complete solution with training data collection and real-time voice interaction capabilities.
Key Benefits:
- Horizontal Scalability: Support for hundreds of concurrent automation tasks with integrated sidecars
- Training Data Pipeline: Automated collection of high-quality training episodes in MosaicML format
- Voice Interaction: Real-time TTS with F5-TTS and custom voice models via WebRTC
- Cost Efficiency: Dynamic scaling with GPU sharing between agent and TTS workloads
- Developer Experience: RESTful API with training and voice management endpoints
- Production Ready: Monitoring, security, and operational excellence across all services
- Future Proof: Built on modern container orchestration with ML/AI workflow patterns
Enhanced Capabilities:
- Self-Improving: Continuous training data collection enables model refinement
- Multimodal: Visual automation + audio output for richer user experiences
- Enterprise Ready: S3 integration, voice asset management, and compliance features
- Flexible Deployment: Sidecar pattern allows optional feature enablement per workload
The implementation leverages proven 2024 best practices for AI workload orchestration, GPU resource management, and microservices architecture, while adding cutting-edge training data collection and voice synthesis capabilities.
Total Implementation Estimate: 17 weeks for full production deployment (including training/TTS features) Expected Performance: 100+ concurrent tasks, <2s API response time, 99.9% availability, <500ms TTS latency Estimated Operating Cost: $0.52-$1.42 per enhanced pod-hour, with potential 50-60% reduction through optimization Training Data Output: ~2-5GB per hour of operation, ready for distributed training pipelines
References
- Kubernetes AI/ML Best Practices 2024
- WebRTC Scaling Patterns with STUNner
- NVIDIA GPU Operator for Kubernetes
- KubeAI: AI Inference Operator
- Prometheus Operator for Monitoring
Glossary
General Terms
Neko Server : A containerized remote desktop solution that provides browser-based access to desktop environments via WebRTC.
WebRTC : Web Real-Time Communication protocol used for peer-to-peer audio, video, and data transmission in web browsers.
ICE (Interactive Connectivity Establishment) : Protocol for establishing peer-to-peer connections across NATs and firewalls.
SDP (Session Description Protocol) : Protocol for describing media sessions and connection parameters.
Component-Specific Terms
Manual Control CLI
REPL (Read-Eval-Print Loop) : Interactive command-line interface for real-time control and testing.
Signaler : WebSocket connection manager that handles message routing and reconnection.
Broker : Event distribution system that routes incoming messages to appropriate topic queues.
Normalized Coordinates : Coordinate system using 0.0-1.0 range instead of pixel values for resolution independence.
Host Control : Exclusive mouse and keyboard control permission on the Neko server.
Core Agent
Action Space : Set of possible actions the agent can perform (click, type, scroll, etc.).
Observation Space : Visual and contextual information the agent receives from the environment.
Training Data : Collected sequences of observations and actions used for model training.
Capture Service
MDS (MosaicML Dataset) : Efficient dataset format optimized for streaming and distributed training.
Trajectory : Sequence of state-action pairs representing a complete task execution.
Temporal Alignment : Synchronization of visual frames with corresponding action timestamps.
TTS Service
Voice Synthesis : Process of generating spoken audio from text input.
Audio Streaming : Real-time transmission of audio data over WebRTC channels.
Voice Activity Detection (VAD) : Technology for detecting presence of human speech in audio.
Technical Terms
Async/Await : Python asynchronous programming pattern for concurrent execution.
Task Cancellation : Graceful shutdown mechanism for asyncio background tasks.
Exponential Backoff : Retry strategy with increasing delays between attempts.
Event Loop : Core of asyncio that manages and executes asynchronous operations.
Queue (asyncio.Queue) : Thread-safe queue implementation for passing data between async tasks.
Protocol Terms
Event Payload : Data structure containing the actual content of WebSocket messages.
Heartbeat : Periodic messages sent to maintain connection and detect timeouts.
Session ID : Unique identifier for each client connection to the Neko server.
Control Protocol : Set of message types and formats for remote desktop interaction.
Development Terms
mdBook : Documentation generator that creates books from Markdown files.
PEP 257 : Python Enhancement Proposal defining docstring conventions.
PEP 287 : Python Enhancement Proposal defining reStructuredText format for docstrings.
Type Hints : Python annotations that specify expected types for function parameters and return values.