Speak or Type — AIRI Answers Instantly

AIRI’s most magical moment is the first time it replies out loud. Whether you prefer voice or keyboard, the conversation loop is designed to feel quick, private, and expressive.

What You Can Do (Function-First)

Action	What Happens	Why It Feels Natural
Say “Hey AIRI…”	Local Voice Activity Detection (VAD) turns on the mic only while you speak.	Saves CPU & protects privacy—no always-on streaming.
Ask a Question	Speech is instantly transcribed on-device; text goes to the AI brain.	You hear no “recording” lag.
Hear the Answer	AI streams its reply; intelligent chunking feeds TTS so AIRI starts talking before the sentence is done.	Reduces dead air—conversation flows.
Switch to Typing	Open the chat window; messages appear with Markdown formatting and code highlighting.	Great for quiet environments or sharing links/snippets.
Emotional Cues	Hidden markers in the AI text trigger avatar expressions (e.g., happy, thinking).	Visual feedback makes the AI feel alive.

Latency Goal: < 300 ms from end-of-speech to first audible syllable (local model default).

How It Works (Conceptual Mechanism)

Voice Activity Detection (Silero-VAD) – A lightweight ONNX model runs every audio frame and sends a speech / silence flag.
Circular Buffer Recorder – When VAD signals end of speech, the buffered WAV is forwarded to Whisper (STT).
Whisper STT (Local or Cloud) – Converts audio to text; language auto-detected.
LLM Request – The llm.ts store builds a prompt using recent chat plus the Character Card, then streams the response.
Marker Parser & Chunker – A streaming parser removes tags like <|EMOTE_HAPPY|> and splits the text at natural pauses (punctuation, ~120 chars).
Text-to-Speech (TTS) – Each chunk is sent to your chosen TTS provider; audio streams back and plays immediately.
Avatar Emotion – Removed markers dispatch events to the rendering store, switching blend-shapes or animations in sync with speech.

Sequence Diagram:

Choosing Your Models

Layer	Local Default	Cloud Alternatives	Trade-Off
VAD	Silero-VAD (ONNX)	–	Negligible CPU, less than 1 MB model
STT	Whisper tiny/int8 (ONNX)	OpenAI Whisper API	Local = private but uses CPU; cloud = faster on weak PCs
LLM	Ollama (e.g., `llama3:8b`)	OpenAI GPT-4, Anthropic Claude 3	Local = free & private; cloud = smarter but costs
TTS	Piper VITS	ElevenLabs, Azure TTS	Local = rough voice; cloud = studio-quality

You can mix-and-match in Settings → Providers.

Tips for Smooth Conversation

Use a Push-to-Talk Hotkey (default Cmd+Shift+Space) if ambient noise keeps triggering VAD.
Enable Auto-Language detection for bilingual chats.
Adjust Chunk Length in Settings → Voice to balance speed vs. natural pauses.
Mute TTS quickly with Esc if you need silence.

Privacy Corner

All raw audio stays on your device when using local models.
Only cleaned text is sent to cloud LLM/TTS providers if selected.
Toggle “Store Conversation History” off to keep chats in volatile memory only.

Functional Role	Code File	Description
VAD Front-End Composable	`packages/stage-ui/src/composables/micvad.ts`	Controls mic stream & speech detection
VAD Rust Plugin	`crates/tauri-plugin-ipc-audio-vad-ort/src/lib.rs`	Runs Silero-VAD via ONNX
Whisper Composable	`packages/stage-ui/src/composables/whisper.ts`	Sends audio to local STT plugin
LLM Store	`packages/stage-ui/src/stores/llm.ts`	Streams responses from selected provider
Marker Parser	`packages/stage-ui/src/composables/llmmarkerParser.ts`	Detects `<
TTS Utility	`packages/stage-ui/src/utils/tts.ts`	Intelligent chunking & playback
Chat Window UI	`apps/stage-tamagotchi/src/components/ChatHistory.vue`	Renders text chats with Markdown

Desktop Companion Personality Design

Speak or Type — AIRI Answers Instantly

What You Can Do (Function-First)

How It Works (Conceptual Mechanism)

Choosing Your Models

Tips for Smooth Conversation

Privacy Corner

Related Technical Files