Speak or Type — AIRI Answers Instantly
AIRI’s most magical moment is the first time it replies out loud. Whether you prefer voice or keyboard, the conversation loop is designed to feel quick, private, and expressive.
What You Can Do (Function-First)
Action | What Happens | Why It Feels Natural |
---|---|---|
Say “Hey AIRI…” | Local Voice Activity Detection (VAD) turns on the mic only while you speak. | Saves CPU & protects privacy—no always-on streaming. |
Ask a Question | Speech is instantly transcribed on-device; text goes to the AI brain. | You hear no “recording” lag. |
Hear the Answer | AI streams its reply; intelligent chunking feeds TTS so AIRI starts talking before the sentence is done. | Reduces dead air—conversation flows. |
Switch to Typing | Open the chat window; messages appear with Markdown formatting and code highlighting. | Great for quiet environments or sharing links/snippets. |
Emotional Cues | Hidden markers in the AI text trigger avatar expressions (e.g., happy, thinking). | Visual feedback makes the AI feel alive. |
Latency Goal: < 300 ms from end-of-speech to first audible syllable (local model default).
How It Works (Conceptual Mechanism)
- Voice Activity Detection (Silero-VAD) – A lightweight ONNX model runs every audio frame and sends a speech / silence flag.
- Circular Buffer Recorder – When VAD signals end of speech, the buffered WAV is forwarded to Whisper (STT).
- Whisper STT (Local or Cloud) – Converts audio to text; language auto-detected.
- LLM Request – The
llm.ts
store builds a prompt using recent chat plus the Character Card, then streams the response. - Marker Parser & Chunker – A streaming parser removes tags like
<|EMOTE_HAPPY|>
and splits the text at natural pauses (punctuation, ~120 chars). - Text-to-Speech (TTS) – Each chunk is sent to your chosen TTS provider; audio streams back and plays immediately.
- Avatar Emotion – Removed markers dispatch events to the rendering store, switching blend-shapes or animations in sync with speech.
Sequence Diagram:
Choosing Your Models
Layer | Local Default | Cloud Alternatives | Trade-Off |
---|---|---|---|
VAD | Silero-VAD (ONNX) | – | Negligible CPU, less than 1 MB model |
STT | Whisper tiny/int8 (ONNX) | OpenAI Whisper API | Local = private but uses CPU; cloud = faster on weak PCs |
LLM | Ollama (e.g., llama3:8b ) | OpenAI GPT-4, Anthropic Claude 3 | Local = free & private; cloud = smarter but costs |
TTS | Piper VITS | ElevenLabs, Azure TTS | Local = rough voice; cloud = studio-quality |
You can mix-and-match in Settings → Providers.
Tips for Smooth Conversation
- Use a Push-to-Talk Hotkey (default
Cmd+Shift+Space
) if ambient noise keeps triggering VAD. - Enable Auto-Language detection for bilingual chats.
- Adjust Chunk Length in Settings → Voice to balance speed vs. natural pauses.
- Mute TTS quickly with
Esc
if you need silence.
Privacy Corner
- All raw audio stays on your device when using local models.
- Only cleaned text is sent to cloud LLM/TTS providers if selected.
- Toggle “Store Conversation History” off to keep chats in volatile memory only.
Related Technical Files
Functional Role | Code File | Description |
---|---|---|
VAD Front-End Composable | packages/stage-ui/src/composables/micvad.ts | Controls mic stream & speech detection |
VAD Rust Plugin | crates/tauri-plugin-ipc-audio-vad-ort/src/lib.rs | Runs Silero-VAD via ONNX |
Whisper Composable | packages/stage-ui/src/composables/whisper.ts | Sends audio to local STT plugin |
LLM Store | packages/stage-ui/src/stores/llm.ts | Streams responses from selected provider |
Marker Parser | packages/stage-ui/src/composables/llmmarkerParser.ts | Detects `< |
TTS Utility | packages/stage-ui/src/utils/tts.ts | Intelligent chunking & playback |
Chat Window UI | apps/stage-tamagotchi/src/components/ChatHistory.vue | Renders text chats with Markdown |