airiBehind the ScenesHow AIRI Sees & Hears

Turning Sights & Sounds into Understanding

AIRI senses the world through two low-latency, privacy-respecting pathways:

  1. Hearing — A local audio pipeline that detects speech, transcribes it, and feeds text to the LLM.
  2. Seeing — A media pipeline that describes photos & stickers so the AI can reference visual context later.

1 · Hearing — Voice Pipeline (Function-First)

StageWhat HappensWhy It Matters
Voice Activity Detection (Silero-VAD)Runs every 20 ms frame to flag speech vs. silence.Mic is only recorded when needed → lower CPU & stronger privacy.
Audio BufferingLast 15 s stored in RAM; final buffer sent when VAD flips to silence.Captures complete sentence even if user pauses briefly.
On-Device STT (Whisper ONNX)Converts WAV to text locally; language auto-detected.Works offline, no raw audio leaves device.
Confidence FilterDiscards results with confidence < 0.6 or ???.Prevents garbage input from reaching LLM.
Streaming to LLMText becomes part of chat context; response generated.Continues unified brain flow.

Latency Benchmarks (M2 Mac, Whisper-tiny-int8):

StepAvg Time
VAD detection12 ms
STT transcription (5 s audio)420 ms
LLM first token180 ms
Total to first spoken phoneme< 800 ms

Concept Diagram

Accessibility & Control

  • Push-to-Talk shortcut bypasses VAD when ambient noise is high.
  • Language Override lets bilingual users force STT language.
  • Audio Logs optional; if enabled, encrypted WAV files kept 24 h for debugging.

2 · Seeing — Media Understanding Pipeline (Telegram & Future Discord)

When AIRI receives a photo or animated sticker, it “looks” before it speaks.

StageWhat HappensNotes
Media DownloadFile fetched via platform API; animated WEBP → frames.Size capped at 2 MB.
Caption DetectionIf user added a caption, stored alongside image.Saves LLM tokens if caption is descriptive.
Vision-LLM DescriptionSends JPEG to Gemini Pro Vision (cloud) or local BLIP2.Returns 1-2 sentence alt-text.
Embedding & StorageText embedded by sentence-transformers and stored in PostgreSQL media table with vector column.Enables similarity search later.
Prompt InjectionOn each tick, most relevant 3 image descriptions added to LLM context (if memory slot free).Allows callbacks like “That cat picture was cute!” hours later.

Flow Diagram

Privacy Notes

  • Vision requests default to local BLIP2; cloud Gemini requires explicit opt-in.
  • Images deleted from DB after 30 days by TTL policy.

Tuning & Extensibility

  • Model Size Swap: Replace Whisper-tiny with Whisper-base for better accuracy; update STT_MODEL=base in settings.
  • GPU Acceleration: ONNX Runtime automatically uses Metal / CUDA when available.
  • Custom Vision Model: Implement describeMedia(buffer) in media/photo.ts to plug any model.

Functional RoleCode FileDescription
VAD Plugin (Rust)crates/tauri-plugin-ipc-audio-vad-ort/src/lib.rsSilero-VAD inferencing
Mic Composablepackages/stage-ui/src/composables/micvad.tsStreams mic → VAD plugin
Whisper Plugin (Rust)crates/tauri-plugin-ipc-audio-transcription-ort/src/lib.rsWhisper STT
STT Composablepackages/stage-ui/src/composables/whisper.tsFront-end wrapper
Media Handlerservices/telegram-bot/src/media/photo.tsDownload, vision LLM, embed
Media DB Schemaservices/telegram-bot/src/db/schema.tsStores image & vector
Context Builderservices/telegram-bot/src/llm/actions.tsInjects vision text into prompt