Turning Sights & Sounds into Understanding

AIRI senses the world through two low-latency, privacy-respecting pathways:

Hearing — A local audio pipeline that detects speech, transcribes it, and feeds text to the LLM.
Seeing — A media pipeline that describes photos & stickers so the AI can reference visual context later.

1 · Hearing — Voice Pipeline (Function-First)

Stage	What Happens	Why It Matters
Voice Activity Detection (Silero-VAD)	Runs every 20 ms frame to flag speech vs. silence.	Mic is only recorded when needed → lower CPU & stronger privacy.
Audio Buffering	Last 15 s stored in RAM; final buffer sent when VAD flips to silence.	Captures complete sentence even if user pauses briefly.
On-Device STT (Whisper ONNX)	Converts WAV to text locally; language auto-detected.	Works offline, no raw audio leaves device.
Confidence Filter	Discards results with confidence < 0.6 or `???`.	Prevents garbage input from reaching LLM.
Streaming to LLM	Text becomes part of chat context; response generated.	Continues unified brain flow.

Latency Benchmarks (M2 Mac, Whisper-tiny-int8):

Step	Avg Time
VAD detection	12 ms
STT transcription (5 s audio)	420 ms
LLM first token	180 ms
Total to first spoken phoneme	< 800 ms

Concept Diagram

Accessibility & Control

Push-to-Talk shortcut bypasses VAD when ambient noise is high.
Language Override lets bilingual users force STT language.
Audio Logs optional; if enabled, encrypted WAV files kept 24 h for debugging.

2 · Seeing — Media Understanding Pipeline (Telegram & Future Discord)

When AIRI receives a photo or animated sticker, it “looks” before it speaks.

Stage	What Happens	Notes
Media Download	File fetched via platform API; animated WEBP → frames.	Size capped at 2 MB.
Caption Detection	If user added a caption, stored alongside image.	Saves LLM tokens if caption is descriptive.
Vision-LLM Description	Sends JPEG to Gemini Pro Vision (cloud) or local BLIP2.	Returns 1-2 sentence alt-text.
Embedding & Storage	Text embedded by `sentence-transformers` and stored in PostgreSQL `media` table with vector column.	Enables similarity search later.
Prompt Injection	On each tick, most relevant 3 image descriptions added to LLM context (if memory slot free).	Allows callbacks like “That cat picture was cute!” hours later.

Flow Diagram

Privacy Notes

Vision requests default to local BLIP2; cloud Gemini requires explicit opt-in.
Images deleted from DB after 30 days by TTL policy.

Tuning & Extensibility

Model Size Swap: Replace Whisper-tiny with Whisper-base for better accuracy; update STT_MODEL=base in settings.
GPU Acceleration: ONNX Runtime automatically uses Metal / CUDA when available.
Custom Vision Model: Implement describeMedia(buffer) in media/photo.ts to plug any model.

Functional Role	Code File	Description
VAD Plugin (Rust)	`crates/tauri-plugin-ipc-audio-vad-ort/src/lib.rs`	Silero-VAD inferencing
Mic Composable	`packages/stage-ui/src/composables/micvad.ts`	Streams mic → VAD plugin
Whisper Plugin (Rust)	`crates/tauri-plugin-ipc-audio-transcription-ort/src/lib.rs`	Whisper STT
STT Composable	`packages/stage-ui/src/composables/whisper.ts`	Front-end wrapper
Media Handler	`services/telegram-bot/src/media/photo.ts`	Download, vision LLM, embed
Media DB Schema	`services/telegram-bot/src/db/schema.ts`	Stores image & vector
Context Builder	`services/telegram-bot/src/llm/actions.ts`	Injects vision text into prompt

How AIRI Thinks The Desktop Magic