Turning Sights & Sounds into Understanding
AIRI senses the world through two low-latency, privacy-respecting pathways:
- Hearing — A local audio pipeline that detects speech, transcribes it, and feeds text to the LLM.
- Seeing — A media pipeline that describes photos & stickers so the AI can reference visual context later.
1 · Hearing — Voice Pipeline (Function-First)
Stage | What Happens | Why It Matters |
---|---|---|
Voice Activity Detection (Silero-VAD) | Runs every 20 ms frame to flag speech vs. silence. | Mic is only recorded when needed → lower CPU & stronger privacy. |
Audio Buffering | Last 15 s stored in RAM; final buffer sent when VAD flips to silence. | Captures complete sentence even if user pauses briefly. |
On-Device STT (Whisper ONNX) | Converts WAV to text locally; language auto-detected. | Works offline, no raw audio leaves device. |
Confidence Filter | Discards results with confidence < 0.6 or ??? . | Prevents garbage input from reaching LLM. |
Streaming to LLM | Text becomes part of chat context; response generated. | Continues unified brain flow. |
Latency Benchmarks (M2 Mac, Whisper-tiny-int8):
Step | Avg Time |
---|---|
VAD detection | 12 ms |
STT transcription (5 s audio) | 420 ms |
LLM first token | 180 ms |
Total to first spoken phoneme | < 800 ms |
Concept Diagram
Accessibility & Control
- Push-to-Talk shortcut bypasses VAD when ambient noise is high.
- Language Override lets bilingual users force STT language.
- Audio Logs optional; if enabled, encrypted WAV files kept 24 h for debugging.
2 · Seeing — Media Understanding Pipeline (Telegram & Future Discord)
When AIRI receives a photo or animated sticker, it “looks” before it speaks.
Stage | What Happens | Notes |
---|---|---|
Media Download | File fetched via platform API; animated WEBP → frames. | Size capped at 2 MB. |
Caption Detection | If user added a caption, stored alongside image. | Saves LLM tokens if caption is descriptive. |
Vision-LLM Description | Sends JPEG to Gemini Pro Vision (cloud) or local BLIP2. | Returns 1-2 sentence alt-text. |
Embedding & Storage | Text embedded by sentence-transformers and stored in PostgreSQL media table with vector column. | Enables similarity search later. |
Prompt Injection | On each tick, most relevant 3 image descriptions added to LLM context (if memory slot free). | Allows callbacks like “That cat picture was cute!” hours later. |
Flow Diagram
Privacy Notes
- Vision requests default to local BLIP2; cloud Gemini requires explicit opt-in.
- Images deleted from DB after 30 days by TTL policy.
Tuning & Extensibility
- Model Size Swap: Replace Whisper-tiny with Whisper-base for better accuracy; update
STT_MODEL=base
in settings. - GPU Acceleration: ONNX Runtime automatically uses Metal / CUDA when available.
- Custom Vision Model: Implement
describeMedia(buffer)
inmedia/photo.ts
to plug any model.
Related Technical Files
Functional Role | Code File | Description |
---|---|---|
VAD Plugin (Rust) | crates/tauri-plugin-ipc-audio-vad-ort/src/lib.rs | Silero-VAD inferencing |
Mic Composable | packages/stage-ui/src/composables/micvad.ts | Streams mic → VAD plugin |
Whisper Plugin (Rust) | crates/tauri-plugin-ipc-audio-transcription-ort/src/lib.rs | Whisper STT |
STT Composable | packages/stage-ui/src/composables/whisper.ts | Front-end wrapper |
Media Handler | services/telegram-bot/src/media/photo.ts | Download, vision LLM, embed |
Media DB Schema | services/telegram-bot/src/db/schema.ts | Stores image & vector |
Context Builder | services/telegram-bot/src/llm/actions.ts | Injects vision text into prompt |