Speak or Type — AIRI Answers Instantly

AIRI’s most magical moment is the first time it replies out loud. Whether you prefer voice or keyboard, the conversation loop is designed to feel quick, private, and expressive.

What You Can Do (Function-First)

ActionWhat HappensWhy It Feels Natural
Say “Hey AIRI…”Local Voice Activity Detection (VAD) turns on the mic only while you speak.Saves CPU & protects privacy—no always-on streaming.
Ask a QuestionSpeech is instantly transcribed on-device; text goes to the AI brain.You hear no “recording” lag.
Hear the AnswerAI streams its reply; intelligent chunking feeds TTS so AIRI starts talking before the sentence is done.Reduces dead air—conversation flows.
Switch to TypingOpen the chat window; messages appear with Markdown formatting and code highlighting.Great for quiet environments or sharing links/snippets.
Emotional CuesHidden markers in the AI text trigger avatar expressions (e.g., happy, thinking).Visual feedback makes the AI feel alive.

Latency Goal: < 300 ms from end-of-speech to first audible syllable (local model default).

How It Works (Conceptual Mechanism)

  1. Voice Activity Detection (Silero-VAD) – A lightweight ONNX model runs every audio frame and sends a speech / silence flag.
  2. Circular Buffer Recorder – When VAD signals end of speech, the buffered WAV is forwarded to Whisper (STT).
  3. Whisper STT (Local or Cloud) – Converts audio to text; language auto-detected.
  4. LLM Request – The llm.ts store builds a prompt using recent chat plus the Character Card, then streams the response.
  5. Marker Parser & Chunker – A streaming parser removes tags like &lt;|EMOTE_HAPPY|> and splits the text at natural pauses (punctuation, ~120 chars).
  6. Text-to-Speech (TTS) – Each chunk is sent to your chosen TTS provider; audio streams back and plays immediately.
  7. Avatar Emotion – Removed markers dispatch events to the rendering store, switching blend-shapes or animations in sync with speech.

Sequence Diagram:

Choosing Your Models

LayerLocal DefaultCloud AlternativesTrade-Off
VADSilero-VAD (ONNX)Negligible CPU, less than 1 MB model
STTWhisper tiny/int8 (ONNX)OpenAI Whisper APILocal = private but uses CPU; cloud = faster on weak PCs
LLMOllama (e.g., llama3:8b)OpenAI GPT-4, Anthropic Claude 3Local = free & private; cloud = smarter but costs
TTSPiper VITSElevenLabs, Azure TTSLocal = rough voice; cloud = studio-quality

You can mix-and-match in Settings → Providers.

Tips for Smooth Conversation

  • Use a Push-to-Talk Hotkey (default Cmd+Shift+Space) if ambient noise keeps triggering VAD.
  • Enable Auto-Language detection for bilingual chats.
  • Adjust Chunk Length in Settings → Voice to balance speed vs. natural pauses.
  • Mute TTS quickly with Esc if you need silence.

Privacy Corner

  • All raw audio stays on your device when using local models.
  • Only cleaned text is sent to cloud LLM/TTS providers if selected.
  • Toggle “Store Conversation History” off to keep chats in volatile memory only.

Functional RoleCode FileDescription
VAD Front-End Composablepackages/stage-ui/src/composables/micvad.tsControls mic stream & speech detection
VAD Rust Plugincrates/tauri-plugin-ipc-audio-vad-ort/src/lib.rsRuns Silero-VAD via ONNX
Whisper Composablepackages/stage-ui/src/composables/whisper.tsSends audio to local STT plugin
LLM Storepackages/stage-ui/src/stores/llm.tsStreams responses from selected provider
Marker Parserpackages/stage-ui/src/composables/llmmarkerParser.tsDetects `<
TTS Utilitypackages/stage-ui/src/utils/tts.tsIntelligent chunking & playback
Chat Window UIapps/stage-tamagotchi/src/components/ChatHistory.vueRenders text chats with Markdown