TTS Service (yap.py)
The YAP (Yet Another Presenter) service provides real-time text-to-speech functionality through WebRTC audio streaming. It connects to Neko servers and responds to chat commands for immediate or streaming speech synthesis.
Overview
YAP transforms text messages into high-quality speech using F5-TTS and broadcasts the audio through WebRTC. It supports both immediate synthesis and streaming mode for real-time conversation scenarios.
Key Features
- Real-time TTS: Low-latency speech synthesis using F5-TTS
- WebRTC Integration: Direct audio streaming to Neko browser sessions
- Voice Management: Hot-reloadable voice configurations with custom parameters
- Streaming Mode: Incremental text processing for live conversations
- Chat Commands: Interactive control through Neko chat interface
- Parallel Processing: Multi-threaded TTS workers for improved performance
- Audio Pipeline: Crossfade splicing and jitter buffering for smooth playback
Architecture
graph TD A[Chat Commands] --> B[Command Parser] B --> C[Stream Assembler] C --> D[Text Segmentation] D --> E[TTS Pipeline] E --> F[Parallel F5-TTS Workers] F --> G[Audio Splicer] G --> H[PCM Queue] H --> I[WebRTC Audio Track] I --> J[Neko Browser] K[Voice Manager] --> E L[WebSocket Signaler] --> A L --> M[WebRTC Peer Connection] M --> I
Installation
Dependencies
# Core dependencies
pip install aiortc av websockets numpy
# F5-TTS (required for speech synthesis)
pip install git+https://github.com/SWivid/F5-TTS.git
# Optional resampling backends (first available will be used)
pip install torchaudio # Preferred
# OR
pip install scipy # Alternative
# Linear fallback is built-in
System Requirements
- Python 3.8+
- CUDA-capable GPU (recommended for F5-TTS)
- WebRTC-compatible environment
Configuration
YAP follows 12-factor app principles with environment-based configuration:
Connection Settings
Variable | Default | Description |
---|---|---|
YAP_WS / NEKO_WS | - | Direct WebSocket URL |
NEKO_URL | - | REST API base URL |
NEKO_USER | - | Username for REST authentication |
NEKO_PASS | - | Password for REST authentication |
Audio Settings
Variable | Default | Description |
---|---|---|
YAP_SR | 48000 | Output sample rate (Hz) |
YAP_AUDIO_CHANNELS | 1 | Audio channels (1=mono, 2=stereo) |
YAP_FRAME_MS | 20 | WebRTC frame size (10/20/30/40/60ms) |
YAP_JITTER_MAX_SEC | 6.0 | PCM buffer maximum duration |
TTS Pipeline Settings
Variable | Default | Description |
---|---|---|
YAP_PARALLEL | 2 | Parallel TTS worker threads |
YAP_CHUNK_TARGET_SEC | 3.0 | Target chunk duration hint |
YAP_MAX_CHARS | 350 | Maximum characters per chunk |
YAP_OVERLAP_MS | 30 | Audio crossfade overlap |
Voice Settings
Variable | Default | Description |
---|---|---|
YAP_VOICES_DIR | ./voices | Voice configuration directory |
YAP_SPK_DEFAULT | default | Default speaker ID |
Logging & ICE Settings
Variable | Default | Description |
---|---|---|
YAP_LOGLEVEL | INFO | Log level (DEBUG/INFO/WARNING/ERROR) |
YAP_LOG_FORMAT | text | Log format (text/json) |
YAP_STUN_URL | stun:stun.l.google.com:19302 | STUN server |
YAP_ICE_POLICY | strict | ICE server policy (strict/all) |
Usage
Basic Usage
# Direct WebSocket connection
export YAP_WS="wss://demo.neko.com/api/ws?token=your_token"
python src/yap.py
# REST API authentication
export NEKO_URL="https://demo.neko.com"
export NEKO_USER="username"
export NEKO_PASS="password"
python src/yap.py
Command Line Options
python src/yap.py --help
# Common options
python src/yap.py \
--ws "wss://host/api/ws?token=..." \
--sr 48000 \
--channels 1 \
--parallel 4 \
--loglevel DEBUG
Health Check
# Validate configuration
python src/yap.py --healthcheck
Chat Commands
YAP responds to commands in Neko chat:
Immediate Speech
/yap Hello, this is immediate speech synthesis!
Streaming Mode
/yap:begin
This text will be processed incrementally...
More text gets added and synthesized in chunks...
/yap:end
Voice Control
# Switch active voice and parameters
/yap:voice set --spk alice --rate 1.2 --pitch 0.5
# Add new voice
/yap:voice add --spk bob --ref /path/to/reference.wav --ref-text "Reference text" --styles "calm,friendly"
# Reload voice configurations
/yap:voice reload
Queue Management
# Stop current synthesis and clear queue
/yap:stop
Voice Management
Voice Configuration (voices.json
)
{
"default": {
"ref_audio": "./voices/default.wav",
"ref_text": "This is a default reference recording.",
"styles": ["calm", "neutral"],
"params": {
"rate": 1.0,
"pitch": 0.0
}
},
"alice": {
"ref_audio": "./voices/alice_sample.wav",
"ref_text": "Hello, I'm Alice and this is my voice.",
"styles": ["friendly", "energetic"],
"params": {
"rate": 1.1,
"pitch": 0.2
}
}
}
Voice Parameters
- rate: Speech speed multiplier (0.5-2.0)
- pitch: Pitch shift in semitones (-12 to +12)
- styles: Descriptive tags for voice characteristics
- ref_audio: Path to reference audio file (WAV format)
- ref_text: Transcript of reference audio
Hot Reloading
Voice configurations can be updated without restarting:
- Edit
voices.json
in the voices directory - Send
/yap:voice reload
command in chat - Changes take effect immediately
Technical Implementation
Audio Pipeline
- Text Segmentation: Smart punctuation-aware chunking
- Parallel Synthesis: Multi-threaded F5-TTS workers
- Audio Splicer: Crossfade blending between chunks
- PCM Queue: Jitter-buffered audio streaming
- WebRTC Track: Real-time audio transmission
Text Processing
# Punctuation-aware segmentation
chunks = segment_text("Hello world! How are you today?", max_chars=50)
# Result: ["Hello world!", "How are you today?"]
# Streaming assembly with opportunistic emission
assembler = StreamAssembler(max_chars=100)
ready_chunks = assembler.feed("Partial text...")
final_chunks = assembler.flush()
Audio Formats
- Input: F5-TTS generated audio (variable sample rate)
- Processing: Float32 waveforms with resampling
- Output: 16-bit PCM at configured sample rate
- WebRTC: Opus-encoded frames for transmission
Resampling Backends
YAP automatically selects the best available resampling backend:
- torchaudio (preferred): High-quality GPU acceleration
- scipy: CPU-based signal processing
- linear (fallback): Simple interpolation
Performance Tuning
Parallel Workers
# Increase for better throughput (GPU memory permitting)
export YAP_PARALLEL=4
Chunk Sizing
# Smaller chunks = lower latency, higher overhead
export YAP_MAX_CHARS=200
# Larger chunks = higher latency, better efficiency
export YAP_MAX_CHARS=500
Audio Buffer
# Reduce for lower latency (risk of underruns)
export YAP_JITTER_MAX_SEC=3.0
# Increase for stability (higher latency)
export YAP_JITTER_MAX_SEC=10.0
Troubleshooting
Common Issues
No audio output
- Check WebRTC connection status in browser developer tools
- Verify audio permissions in browser
- Confirm STUN/TURN server connectivity
High latency
- Reduce
YAP_MAX_CHARS
for smaller chunks - Decrease
YAP_JITTER_MAX_SEC
buffer size - Increase
YAP_PARALLEL
workers (if GPU permits)
Audio artifacts
- Increase
YAP_OVERLAP_MS
for smoother crossfades - Check F5-TTS reference audio quality
- Verify sample rate consistency
Connection failures
- Validate WebSocket URL and authentication
- Check firewall settings for WebRTC ports
- Test with simpler STUN configuration
Debug Logging
export YAP_LOGLEVEL=DEBUG
export YAP_LOG_FORMAT=json
python src/yap.py 2>&1 | jq .
Health Validation
# Check configuration and dependencies
python src/yap.py --healthcheck
API Reference
Core Classes
YapApp
Main application coordinator handling WebSocket signaling, WebRTC setup, and command processing.
app = YapApp(settings, logger)
await app.run() # Main event loop
TTSPipeline
Manages parallel TTS synthesis with crossfade splicing.
pipeline = TTSPipeline(voices, tts, pcmq, sr_out, ch_out, overlap_ms, parallel, logger, max_chars)
await pipeline.speak_text("Hello world", speaker="alice")
VoiceManager
Hot-reloadable voice configuration registry.
voices = VoiceManager(voices_dir, default_speaker, logger)
voice = voices.get("alice") # Get voice config
voices.reload() # Hot reload from JSON
PCMQueue
Thread-safe jitter buffer for WebRTC audio streaming.
pcmq = PCMQueue(sr=48000, channels=1, max_sec=6.0, logger=logger)
pcmq.push(audio_samples) # Producer
samples = pcmq.pull(frame_size) # Consumer
Key Functions
segment_text(text, max_chars)
Intelligent text segmentation respecting punctuation boundaries.
apply_rate_pitch(wave, sr, rate, pitch)
Apply naive prosody transformations (rate/pitch adjustments).
_resample(wave, sr_from, sr_to)
Multi-backend audio resampling with automatic fallback.
Integration Examples
Docker Deployment
FROM python:3.9-slim
# Install system dependencies
RUN apt-get update && apt-get install -y \
ffmpeg \
&& rm -rf /var/lib/apt/lists/*
# Install Python dependencies
COPY requirements.txt .
RUN pip install -r requirements.txt
# Copy application
COPY src/ /app/src/
COPY voices/ /app/voices/
WORKDIR /app
# Configuration via environment
ENV YAP_SR=48000
ENV YAP_PARALLEL=2
ENV YAP_VOICES_DIR=/app/voices
ENTRYPOINT ["python", "src/yap.py"]
Docker Compose
version: '3.8'
services:
yap:
build: .
environment:
- NEKO_URL=http://neko:8080
- NEKO_USER=admin
- NEKO_PASS=password
- YAP_PARALLEL=4
- YAP_LOGLEVEL=INFO
volumes:
- ./voices:/app/voices
depends_on:
- neko
Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: yap-tts
spec:
replicas: 1
selector:
matchLabels:
app: yap-tts
template:
metadata:
labels:
app: yap-tts
spec:
containers:
- name: yap
image: neko-agent/yap:latest
env:
- name: NEKO_URL
valueFrom:
secretKeyRef:
name: neko-config
key: url
- name: NEKO_USER
valueFrom:
secretKeyRef:
name: neko-config
key: username
- name: NEKO_PASS
valueFrom:
secretKeyRef:
name: neko-config
key: password
- name: YAP_PARALLEL
value: "4"
resources:
requests:
memory: "2Gi"
cpu: "1000m"
limits:
memory: "4Gi"
cpu: "2000m"
volumeMounts:
- name: voices
mountPath: /app/voices
volumes:
- name: voices
persistentVolumeClaim:
claimName: yap-voices
Future Enhancements
- Voice Cloning: Real-time voice learning from user samples
- Emotion Control: Dynamic emotional expression parameters
- SSML Support: Advanced speech markup for prosody control
- Multi-language: Automatic language detection and synthesis
- Audio Effects: Real-time audio processing (reverb, filters)
- Metrics: Prometheus/OpenTelemetry integration for monitoring
Related Components
- Core Agent: Main automation engine that can trigger TTS
- Capture Service: Training data collection for voice models
- Neko Integration: Browser environment and WebRTC setup
Source Reference
File: src/yap.py:1
Lines of Code: ~1550
Dependencies: aiortc, av, websockets, numpy, F5-TTS