Mod³ — Model Modality Modulator

Give your AI agent a voice.

Mod³ is a Python MCP server that provides text-to-speech for Claude Code, Cursor, and other MCP-compatible AI tools. It runs four TTS engines locally on Apple Silicon, generates speech faster than realtime, and returns immediately so the agent keeps working while audio plays.

What it does

Non-blocking speech -- speak() returns immediately with a job ID. Audio plays in the background. The agent writes code while it talks.
Queue-aware output -- Every speak() return includes queue position, estimated wait time, and active job state. The agent knows what's playing without making a separate status call.
Barge-in detection -- VAD (voice activity detection) monitors the microphone. If the user starts talking, playback stops and the agent is notified. No talking over people.
Turn-taking -- Bidirectional awareness of who's speaking. The agent can check user state before deciding to speak or wait.
Multi-model routing -- Four TTS engines behind one interface. Voice name determines which engine handles the request.
Voice profile registry -- Cloned voices are stored as named profiles under ~/.mod3/voices/ and addressable as first-class voice IDs alongside built-in engine presets.
Continuous open-mic -- Always-on VAD with auto-start barge-in and tunable endpointing; Whisper STT uses multi-strategy deduplication (Z-function, sentence-level, N-way) to eliminate phrase doubling.
Adaptive buffering -- EMA-based arrival rate tracking with dynamic startup threshold. Gapless playback under normal load, graceful degradation under GPU contention.
Structured metrics -- Every call returns TTFA, RTF, per-chunk timing, buffer health, underrun counts, and memory usage. The agent can diagnose its own audio quality.
Observability -- Per-phase wall-time instrumentation and W3C traceparent propagation through CogOSProvider; trace IDs flow from inbound request to every pipeline phase.

Engines

Engine	Model	Size	TTFA	Control Surfaces
Kokoro	Kokoro-82M-bf16	82M	~60ms	Speed, emphasis (ALL CAPS), pacing (punctuation)
Voxtral	Voxtral-4B-TTS-mlx-4bit	4B	~500ms	20 voice presets, multi-language
Chatterbox	chatterbox-4bit	~1B	~60ms	Emotion/exaggeration (0-1), voice cloning
Spark	Spark-TTS-0.5B-bf16	0.5B	~1s	Pitch (5-level), speed, gender

Models are downloaded on first use via HuggingFace Hub.

Quick Start

git clone https://github.com/myrgic/mod3.git
cd mod3
./setup.sh

HTTP-MCP (recommended)

Start mod3 as a persistent daemon and connect via HTTP-MCP. This is the canonical transport going forward. The daemon stays alive between agent sessions so TTS engines stay warm and multiple clients can share one instance.

# Start the server (or configure as a launchd service)
python server.py --http

Then point your MCP client at the HTTP-MCP endpoint:

{
  "mcpServers": {
    "mod3": {
      "type": "http",
      "url": "http://localhost:7860/mcp"
    }
  }
}

stdio MCP (deprecated)

Deprecated. stdio MCP is still functional but is being phased out. Each client session spawns a new mod3 process, which means TTS engines cold-start on every connection (~60s for Kokoro) and state is not shared across sessions. Prefer HTTP-MCP above. A DeprecationWarning is printed to stderr at boot when this path is active. Removal is tracked in issue #11.

For users who have not migrated yet, the stdio path remains available. Add to your project's .mcp.json:

{
  "mcpServers": {
    "mod3": {
      "command": "/path/to/mod3/.venv/bin/python",
      "args": ["/path/to/mod3/server.py"]
    }
  }
}

MCP Tools

`speak(text, voice?, stream?, speed?, emotion?)`

Synthesize text and play through speakers. Returns immediately with a job ID, queue state, and estimated wait time.

speak("Hello world")                                        → default voice (eng_uk_m_davids @ 1.25x)
speak("Hello world", voice="casual_male")                   → Voxtral
speak("Hello world", voice="chatterbox", emotion=0.8)       → Chatterbox with high emotion
speak("Hello world", voice="am_michael", speed=1.4)         → Kokoro fast

`output(text, mode?, stream?)`

Unified output tool. mode selects the channel: "audio" (TTS only), "text" (dashboard chat bubble only), or "both" (simultaneous TTS + chat bubble). Defaults to "audio". Replaces separate speak-and-notify patterns with a single call.

`speech_status(job_id?, verbose?)`

Check if speech is still playing, or get metrics from the last completed job. Pass verbose=True for per-chunk detail.

`stop()`

Interrupt current speech immediately.

`vad_check()`

Check microphone for voice activity. Returns whether the user is currently speaking, enabling the agent to wait for a natural pause before responding.

`list_voices()`

List all available voices grouped by engine, with control surface tags. Includes cloned voices from the voice profile registry (~/.mod3/voices/).

`set_output_device(device?)`

List audio output devices, or switch the active one mid-session.

`diagnostics()`

Show loaded engines, active jobs, output device, and last generation metrics.

Architecture

Key modules:

server.py -- MCP tool definitions, multi-model registry, sentence chunking, non-blocking job management, queue-aware returns
http_api.py -- FastAPI HTTP server; mounts the HTTP-MCP transport at /mcp, the ACP WebSocket endpoint at /ws/acp, and per-session audio at /ws/audio/{session_id}; implements ACP session/list, session/load, session/resume, and authenticate
channels.py -- ChannelMode enum (passthrough / transcribe / agent) and composable directed-acyclic stage graph; pipeline stages are wired at startup from registered @register_stage classes
inbound.py -- Intentional pipeline stages (VAD, STT, intent classification) as @register_stage-decorated classes; consumed by the channel stage graph
bus.py -- Session-aware event bus; sessions are first-class, per-session routing replaces broadcast fan-out (ADR-082)
bus_bridge.py -- SSE bridge that forwards CogOS kernel events (identity projection, voice config) to connected dashboard and channel clients
seats.py -- Seat registration and identity claim management; register_session emits presence.started with iss/sub pairs for both user and agent identities
identity_projection_handler.py -- Handles incoming CogOS identity-projection events; updates active seat voice config from IdentityVoiceProfile
adaptive_player.py -- Callback-based audio playback with EMA arrival rate tracking, adaptive startup threshold, and structured metrics collection
voice_profiles.py / voice_profile_io.py / voice_profile_schema.py -- Voice profile registry and schema; cloned voices stored under ~/.mod3/voices/ addressable as first-class voice IDs; IdentityVoiceProfile schema mirrors the CogOS identity CRD for voice config received via projection events
chat_flow_log.py -- Structured turn lifecycle log with per-phase wall-time instrumentation and W3C traceparent propagation
dashboard/ -- Three-column browser dashboard: sessions sidebar, main chat panel, and Settings / Traces / Debug side panel with hierarchical span tree

The adaptive player is model-agnostic. Any TTS engine that produces audio chunks feeds the same pipeline.

Requirements

macOS with Apple Silicon (M1/M2/M3/M4)
Python 3.12+
espeak-ng (brew install espeak-ng) -- required for Kokoro's phonemizer

Using Voice as a Modality

See skills/voice/SKILL.md for the full guide on dual-modal communication -- when to speak vs write, non-blocking patterns, reading metrics, and anti-patterns.

Voice carries the ephemeral (context, intent, tone). Text carries the persistent (code, data, decisions). Both channels active simultaneously.

Ecosystem

Mod³ is the voice layer in the CogOS ecosystem. It integrates as a modality channel -- the kernel routes intents to Mod³ when voice output is appropriate. Works standalone without CogOS.

Repo	Purpose
cogos	The daemon
mod3	Voice -- this repo
constellation	Distributed identity and trust
plugins	Agent skill library
charts	Helm charts for deployment

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 173 Commits
.github/workflows		.github/workflows
bargein		bargein
clients		clients
dashboard		dashboard
demo		demo
docs		docs
integrations		integrations
mod3		mod3
modules		modules
schemas		schemas
scripts		scripts
skills/voice		skills/voice
tests		tests
vendor		vendor
.gitignore		.gitignore
.mcp.json		.mcp.json
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
CHANNELS.md		CHANNELS.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
_version.py		_version.py
access.py		access.py
adaptive_player.py		adaptive_player.py
agent_loop.py		agent_loop.py
audio_subscribers.py		audio_subscribers.py
bus.py		bus.py
bus_bridge.py		bus_bridge.py
bus_bridge_runner.py		bus_bridge_runner.py
capture.py		capture.py
channels.py		channels.py
chat_flow_log.py		chat_flow_log.py
compositions.py		compositions.py
conftest.py		conftest.py
draft_queue.py		draft_queue.py
engine.py		engine.py
http_api.py		http_api.py
identity_projection_handler.py		identity_projection_handler.py
inbound.py		inbound.py
mcp.channel.json		mcp.channel.json
message_store.py		message_store.py
modality.py		modality.py
output_queue.py		output_queue.py
pipeline_graph.py		pipeline_graph.py
pipeline_state.py		pipeline_state.py
providers.py		providers.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
seats.py		seats.py
server.py		server.py
session_registry.py		session_registry.py
setup.sh		setup.sh
turn_detector.py		turn_detector.py
vad.py		vad.py
voice_profile_io.py		voice_profile_io.py
voice_profile_schema.py		voice_profile_schema.py
voice_profiles.py		voice_profiles.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mod³ — Model Modality Modulator

What it does

Engines

Quick Start

HTTP-MCP (recommended)

stdio MCP (deprecated)

MCP Tools

`speak(text, voice?, stream?, speed?, emotion?)`

`output(text, mode?, stream?)`

`speech_status(job_id?, verbose?)`

`stop()`

`vad_check()`

`list_voices()`

`set_output_device(device?)`

`diagnostics()`

Architecture

Requirements

Using Voice as a Modality

Ecosystem

License

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Mod³ — Model Modality Modulator

What it does

Engines

Quick Start

HTTP-MCP (recommended)

stdio MCP (deprecated)

MCP Tools

speak(text, voice?, stream?, speed?, emotion?)

output(text, mode?, stream?)

speech_status(job_id?, verbose?)

stop()

vad_check()

list_voices()

set_output_device(device?)

diagnostics()

Architecture

Requirements

Using Voice as a Modality

Ecosystem

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`speak(text, voice?, stream?, speed?, emotion?)`

`output(text, mode?, stream?)`

`speech_status(job_id?, verbose?)`

`stop()`

`vad_check()`

`list_voices()`

`set_output_device(device?)`

`diagnostics()`

Packages