Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

gibb.eri.sh

“Most voice bots suck. I decided to build the one I actually wanted to use.”

This is the documentation for gibb.eri.sh (v0.9.0), a local-first Voice OS for macOS.

The Story

I build voice bots professionally. I’ve seen the sausage made:

  • Latency: Sending audio to the cloud takes 500ms minimum.
  • Privacy: Your voice data is training someone else’s model.
  • Context: Cloud bots don’t know you’re looking at VS Code.

I wanted a tool that felt instant, respected my privacy, and could actually do things on my computer. Since it didn’t exist, I built it.

What is it?

It’s a desktop app that sits in your menu bar. It listens (when you tell it to), transcribes in real-time, and executes Skills.

Key Capabilities

  1. Context Awareness: It polls the OS to know what app is focused. If you’re in a terminal, it enables Git tools. If you’re in Zoom, it enables Note-taking tools.
  2. Zero Latency: We use a custom Zero-Copy Audio Bus in Rust to stream microphone data directly to local ONNX models.
  3. Agent Skills: You can extend the capabilities by dropping a SKILL.md file into a folder. It supports Bash, Python, and Node scripts.

Who is this for?

Developers and Hackers. This is 0.9.0 software. It’s powerful, but it assumes you know what a “terminal” is. Ideally, it will become useful for everyone, but right now, it’s a power tool.

Tech Stack

ComponentTechWhy?
CoreRustMemory safety, threading, no GC pauses.
UITauri + ReactHTML/CSS is flexible, Electron is too heavy.
STTSherpa-ONNXBest streaming accuracy on Apple Silicon.
ReasoningFunctionGemmaOptimized for tool calling, runs locally.

Next Steps

The Golden Path

We hold three core beliefs that drive every line of code in gibb.eri.sh. These aren’t just preferences—they’re non-negotiables that shape every architectural decision.

  1. Privacy First — Your voice never leaves localhost
  2. Zero Latency — Transcription must feel instant
  3. Rust + Tauri — We build for the metal, not the browser

These principles sometimes conflict with “easier” solutions. We choose the harder path because the result is worth it: AI for your OS that serves you, not a corporation.

The Trade-offs We Accept

We SacrificeWe Gain
Cloud scalabilityAbsolute privacy
Development speedRuntime performance
Framework convenienceMemory efficiency
Model varietyPredictable latency

The Trade-offs We Reject

  • “Just use OpenAI” — Privacy is not optional
  • “Electron is fine” — RAM is not free
  • “Good enough latency” — 500ms feels broken

Read on to understand why each principle matters and how we implement it.

Privacy First

Your voice never leaves localhost.

No OpenAI. No Google Speech API. No AWS Transcribe. No cloud anything.

If data doesn’t leave the device, it cannot be intercepted, stored, or analyzed by third parties.

Implementation

All models run on-device using:

  • ONNX Runtime (cross-platform inference)
  • Quantized int8 models
  • CoreML backend on macOS (Apple Neural Engine)
┌─────────────────────────────────────────┐
│              Your Device                │
│  ┌─────────┐    ┌─────────┐    ┌─────┐ │
│  │   Mic   │───▶│  Model  │───▶│ Text│ │
│  └─────────┘    └─────────┘    └─────┘ │
│                                         │
│         Everything stays here           │
└─────────────────────────────────────────┘
              │
              ╳  No network calls
              │

Trade-off

Users download ~500MB of model weights upfront. In exchange:

  • No API bills
  • No network round-trips (lower latency)
  • No data exfiltration possible
  • Works offline

Why Not Hybrid?

“What if we use local for drafts and cloud for final polish?”

No. This creates a false sense of privacy. Users think they’re protected, but their data still leaves the device. We reject half-measures.

The Models We Use

ModelSizeUse Case
Sherpa Zipformer~100MBReal-time streaming
Whisper Small~500MBHigh-accuracy batch
Silero VAD~2MBVoice activity detection
FunctionGemma~200MBIntent recognition

All models are open-source and can be audited.

Low Latency

Time-to-first-token should be < 200ms. Delays above this threshold are noticeable and disrupt the feedback loop.

The Problem with Traditional Architectures

Most speech-to-text systems work like this:

Mic → JavaScript → JSON → HTTP → Server → Model → HTTP → JSON → UI
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                        500-2000ms latency

Every boundary crossing adds latency:

  • JS ↔ Native: Serialization overhead
  • HTTP: Network round-trip
  • JSON: Parsing overhead

Our Implementation

Audio stays in Rust and uses shared memory pointers:

Mic → Rust → Arc<[f32]> → Model → Text → UI
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            45ms latency

Key techniques:

  • Arc<[f32]>: Shared memory pointers
  • Bounded MPSC channels for backpressure
  • Dedicated threads for inference

Measuring Latency

We track latency at every stage using atomic counters:

#![allow(unused)]
fn main() {
pub struct PipelineStatus {
    audio_lag_ms: AtomicI64,      // Time since audio was captured
    inference_time_ms: AtomicU64, // Model execution time
    dropped_chunks: AtomicU64,    // Backpressure indicator
}
}

These metrics are lock-free—measuring latency doesn’t add latency.

The Streaming Advantage

Traditional “batch” transcription waits for you to finish speaking, then processes everything at once. You might wait 2-3 seconds for results.

Streaming transcription processes audio continuously:

TimeBatch ModelStreaming Model
0ms(waiting)(waiting)
100ms(waiting)“The”
200ms(waiting)“The quick”
500ms(waiting)“The quick brown”
1000ms(waiting)“The quick brown fox”
1500ms“The quick brown fox”“The quick brown fox jumps”

Streaming provides immediate visual feedback.

Rust + Tauri

The Stack

LayerTechnologyWhy
Core LogicRustPerformance, safety, no GC
Desktop ShellTauri v2Lightweight, secure
UIReactDeveloper familiarity
InferenceONNX RuntimeUniversal model format

Why Tauri?

Tauri uses the system’s native webview instead of bundling Chromium, which reduces binary size and RAM usage. For a voice assistant that may run continuously, lower idle resource usage helps.

Note: The app requires ~500MB of model downloads on first run, so the binary size savings are offset by the ML models. The main benefit is runtime efficiency.

The Tauri Architecture

┌─────────────────────────────────────────────────┐
│                  Tauri App                       │
│  ┌───────────────────┐  ┌────────────────────┐  │
│  │   Rust Backend    │  │   WebView (UI)     │  │
│  │                   │  │                    │  │
│  │  ┌─────────────┐  │  │  React + TypeScript│  │
│  │  │ Audio Bus   │  │  │                    │  │
│  │  │ STT Engine  │◀─┼──┼─ invoke()          │  │
│  │  │ VAD         │──┼──┼─▶ events           │  │
│  │  └─────────────┘  │  │                    │  │
│  └───────────────────┘  └────────────────────┘  │
└─────────────────────────────────────────────────┘
  • Rust Backend: All heavy lifting (audio, inference, VAD)
  • WebView: Native OS webview (not bundled Chromium)
  • Communication: Tauri’s IPC (commands + events)

The UI is a Passenger

The React frontend is intentionally “dumb”:

  • It displays text from the backend
  • It sends commands (start/stop recording)
  • It never touches audio data directly

This separation means:

  1. UI bugs can’t crash the audio pipeline
  2. The UI can be replaced without touching core logic
  3. Heavy computation never blocks rendering

Security Model

Tauri uses a capability-based permission system:

// plugins/recorder/permissions/default.json
{
  "permissions": ["recorder:start", "recorder:stop"],
  "deny": ["fs:write", "shell:execute"]
}

Each plugin declares exactly what it needs. Everything else is denied by default.

Core Features

gibb.eri.sh isn’t just a transcription tool—it’s an intelligent voice interface.

Feature Overview

Hybrid Inference Engine

Choose your trade-off: instant feedback or maximum accuracy.

  • Streaming Mode: Words appear in real-time (~50ms updates)
  • Batch Mode: Higher accuracy, processed on pauses

Smart Turn Detection

Standard voice detection only hears silence. gibb.eri.sh hears completion.

  • Knows when you’re thinking vs. when you’re done
  • Uses neural analysis, not just timers
  • Configurable sensitivity profiles

Agentic Tools

A local LLM understands your intent and executes actions.

  • “What is the weather in Barcelona” → Opens browser with results
  • Runs entirely offline
  • Extensible tool system

Context Engine

The system knows what you are doing.

  • Dev Mode: Coding in VS Code? Git tools are enabled.
  • Meeting Mode: In a Zoom call? Transcription tools are enabled.
  • Implicit Context: “Summarize this” works on your current selection.

The Interface

Unified Activity Feed

All system events—transcripts, voice commands, and tool results—flow into a single linear feed. This provides a clear “log” of your interaction with the OS.

Mode Badge

A visual indicator in the header shows your current mode (Dev, Meeting, Global). You can click the badge to “pin” a specific mode, overriding automatic detection.

Feature Matrix

FeatureStreamingBatchNotes
Real-time displaySimulatedBatch shows “draft” text
AccuracyGoodExcellentBatch wins on proper nouns
Latency~50ms~500msPer-update latency
LanguagesEnglish99+Whisper supports many
Smart TurnWorks with both modes
AgenticTriggers on commit

Coming Soon

  • Speaker Diarization — “Who said what?”
  • Punctuation Restoration — Automatic commas and periods
  • Custom Wake Words — “Hey gibb.eri.sh”

Hybrid Inference Engine

gibb.eri.sh supports two modes of operation, selectable at runtime. Each has trade-offs.

Streaming Mode (Sherpa-ONNX Zipformer)

Best for: Dictation, live captioning, instant feedback

Audio ─▶ [Transducer] ─▶ Partial results every ~50ms

How It Works

The Zipformer model uses a transducer architecture:

  • Processes audio in small chunks (10-20ms)
  • Maintains internal state between chunks
  • Emits partial hypotheses continuously
  • Refines predictions as context grows

Characteristics

AspectValue
Latency~50ms per update
AccuracyGood (may miss proper nouns)
LanguagesEnglish (primary)
Model Size~100MB

When Streaming Struggles

  • Proper nouns: “Kubernetes” might become “Cooper Netties”
  • Rare words: Technical jargon may be misheard
  • Accents: Less training data for non-standard speech

Batch Mode (Parakeet / Whisper)

Best for: Meetings, archival, accuracy-critical tasks

Audio ─▶ [VAD Buffer] ─▶ [Encoder-Decoder] ─▶ Final text on pause

How It Works

Batch models see the entire utterance before producing output:

  • VAD detects speech boundaries
  • Audio is buffered during speech
  • Model processes the complete segment
  • Result is highly accurate

Characteristics

AspectValue
Latency~500ms after speech ends
AccuracyExcellent
Languages99+ (Whisper)
Model Size~500MB

Simulated Streaming

Users want batch accuracy with streaming feel. We fake it:

  1. Run partial inference every 500ms on the growing buffer
  2. Display “volatile” text (gray, may change)
  3. On VAD trigger, run final inference
  4. Replace volatile text with “stable” text (black, final)
Speaking: "The quick brown fox"

Time 0ms:   [           ] (buffering)
Time 500ms: [The quick  ] volatile
Time 1000ms:[The quick brown] volatile
Time 1200ms: (pause detected)
Time 1400ms:[The quick brown fox.] stable ✓

Switching Modes

Users can switch modes at runtime via the Settings sheet:

// Frontend
await invoke('plugin:stt|set_mode', { mode: 'streaming' });
await invoke('plugin:stt|set_mode', { mode: 'batch' });

The backend handles the transition gracefully, draining any buffered audio.

Model Recommendations

We’ve tested many models. Here are our picks:

Use CaseRecommended Model
General dictationSherpa Zipformer (streaming)
MeetingsWhisper Small (batch)
Non-EnglishWhisper Small (batch)
Low-end hardwareSherpa Zipformer (streaming)

Smart Turn Detection

Standard VAD detects silence. Smart Turn detects completion.

The Problem

Voice Activity Detection (VAD) detects silence. Humans detect pauses.

We pause for many reasons:

  • Thinking: “I want to… [pause] …explain something”
  • Breathing: Natural respiratory pauses
  • Emphasis: “This is… [dramatic pause] …important”
  • Completion: “That’s all I have to say.”

Standard VAD treats all pauses the same. This leads to:

  • Sentences being split mid-thought
  • Awkward commit timing
  • User frustration

The Solution

We implement a Neural Turn Detector inspired by Daily.co’s VAD 3.1 research.

Instead of just measuring silence, we analyze:

  1. Acoustic features: Pitch contour, energy decay
  2. Timing: Duration and pattern of the pause
  3. Semantic probability: Is this a likely sentence ending?

The Algorithm

if (Silence > 300ms AND Probability(EndOfSentence) > 0.5):
    Commit()
else:
    Wait()

Components

ComponentRole
Silero VADDetects raw silence
Smart Turn ModelPredicts sentence completion
Redemption TimerGrace period before commit

Implementation

The Smart Turn detector lives in crates/smart-turn:

#![allow(unused)]
fn main() {
pub struct SmartTurnV31Cpu {
    session: Mutex<Session>,  // ONNX Runtime session
    input_name: String,
    output_name: String,
}

impl TurnDetector for SmartTurnV31Cpu {
    fn predict_endpoint_probability(
        &self,
        audio_16k_mono: &[f32]
    ) -> Result<f32, TurnError> {
        // Returns probability 0.0-1.0 that speaker is done
    }
}
}

Configuration

Users can tune the behavior via Settings:

Redemption Time

The grace period after silence begins before we even consider committing.

SettingValueEffect
Fast200msQuick commits, may split sentences
Balanced300msDefault, good for most users
Relaxed500msWaits longer, better for slow speakers

Sensitivity

How confident must we be that the sentence is complete?

SettingThresholdEffect
Aggressive0.3Commits on weak signals
Normal0.5Balanced
Conservative0.7Only commits on strong endings

The Flow

graph TD
    A[Audio Input] --> B{VAD: Speech?}
    B -->|Yes| C[Buffer Audio]
    B -->|No| D{Silence > Redemption?}
    D -->|No| C
    D -->|Yes| E[Smart Turn Analysis]
    E --> F{P(End) > Threshold?}
    F -->|Yes| G[Commit Text]
    F -->|No| C
    G --> H[Reset State]

Real-World Impact

Without Smart Turn:

User: "I think we should... [thinking pause]"
System: COMMIT → "I think we should"
User: "...consider the alternatives"
System: COMMIT → "consider the alternatives"

With Smart Turn:

User: "I think we should... [thinking pause] ...consider the alternatives"
System: (waiting, P(End) = 0.2)
System: (waiting, P(End) = 0.3)
User: [longer pause, falling intonation]
System: (P(End) = 0.7) COMMIT → "I think we should consider the alternatives"

Agentic Tools

gibb.eri.sh doesn’t just transcribe—it understands. And crucially, it understands context.

The Concept

A local LLM monitors your speech for intents. But unlike dumb assistants, gibb.eri.sh changes its capabilities based on what you are doing.

Contextual Modes

The available tools change dynamically based on your environment.

1. Global Mode (Default)

Always available.

  • Tools: web_search, app_launcher, system_control
  • Example: “Open Figma”, “Turn up the volume”, “What is quantum computing”

2. Meeting Mode

Triggered when: A meeting app (Zoom, Teams, Slack) is using the microphone.

  • Tools: transcript_marker, add_todo
  • Example: “Flag this as important”, “Add action item for Marc”

3. Dev Mode

Triggered when: An IDE (VS Code, IntelliJ, Terminal) is the active window.

  • Tools: git_voice, file_finder
  • Example: “Undo last commit”, “Find the user struct”

How It Works

The Pipeline

Context Engine ─▶ [State: Dev Mode]
                        │
                        ▼
User Speech ───▶ [Router] ───▶ Tool Registry (Filter: Dev + Global)
                                        │
                                        ▼
                                [FunctionGemma LLM]
                                (Only sees ~5 relevant tools)
                                        │
                                        ▼
                                [Executor] ─▶ git_voice

Event-Driven Architecture

The Tools plugin listens for stt:stream_commit events and combines them with the latest ContextState:

#![allow(unused)]
fn main() {
// plugins/tools/src/router.rs

// 1. Get current mode (e.g., Dev)
let mode = state.context.effective_mode();

// 2. Filter registry
let tools = registry.tools_for_mode(mode);

// 3. Build system prompt with ONLY those tools
let prompt = build_prompt(tools);

// 4. Run Inference
let result = llm.infer(prompt, user_text);
}

Why Dynamic Filtering?

  1. Accuracy: The LLM isn’t confused by “Book a flight” when you’re trying to “Book a meeting room”. Smaller search space = fewer hallucinations.
  2. Performance: Less text in the system prompt = faster inference.
  3. Safety: Destructive tools (like git reset) are only exposed when you are explicitly focusing on your code editor.

Context Injection

The LLM doesn’t just see your command—it sees your environment. Before every inference, we inject a context snapshot:

Current Context:
Mode: Dev
Active App: VS Code
Clipboard: "RuntimeError: Connection refused at port 8080"
Date: 2025-12-27

This enables implicit referencing:

You sayLLM infers
“Search this error”web_search{query: "RuntimeError: Connection refused"}
“Open that app”Resolves from active window context
“What does this mean?”Uses clipboard or selection

What Gets Injected

  • Mode: Current mode (Global, Dev, Meeting)
  • Active App: Name of the focused application
  • Clipboard: First ~200 chars of clipboard text
  • Selection: Currently selected text (via Accessibility API)
  • Date: Current date (for scheduling-aware commands)

The Magic Word: “This”

Because gibb.eri.sh knows your context, you can use deictic references:

  • User says: “Summarize this.”
  • Context Engine:
    1. Checks active app (e.g., Chrome).
    2. Grabs currently selected text (via Accessibility API).
    3. The LLM sees this in the context and fills the argument automatically.

We also support “what I just said”:

  • User says: “Create a todo from what I just said.”
  • System: Grabs the last 30 seconds of transcript history.

This allows generic commands to work across any application without specific integrations.

Feedback Loop

Tools don’t just execute—they respond. After a tool runs, the result is fed back to the LLM for summarization.

The Flow

User: "What is quantum computing?"
      │
      ▼
[FunctionGemma] → web_search{query: "quantum computing"}
      │
      ▼
[Wikipedia API] → {title: "Quantum computing", summary: "...uses qubits..."}
      │
      ▼
[FunctionGemma] → "Quantum computing uses qubits instead of classical bits,
                   enabling exponential speedups for certain problems."
      │
      ▼
[UI] → Displays summary (or speaks via TTS)

Why This Matters

  1. Accessibility: You don’t have to read raw JSON or API responses.
  2. Natural Language: Results are summarized conversationally.
  3. Composability: The model can chain thoughts based on results.

Available Tools

Global

  • System Control: Volume, Mute, Media keys.
  • App Launcher: Opens applications.
  • Web Search: Knowledge lookups (Wikipedia by default, extensible to other sources).
  • The Typer: Voice-controlled typing.
    • Smart Injection: Types short phrases char-by-char for natural interaction.
    • Transparent Paste: For long text blocks, it saves your current clipboard, pastes the content instantly via Cmd+V, and restores your original clipboard after a short delay.
    • Context Awareness: “Paste this here” knows to use the active selection as the source.

Meeting

  • Transcript Marker: Inserts [FLAG] or [TODO] tags into the transcript file.
  • Add Todo: Appends a line to your daily notes.

Development

  • Git Voice: Wraps common git commands.
  • File Finder: Uses mdfind (Spotlight) to locate files in the current project context.

Adding Custom Tools

Tools are defined in plugins/tools/src/tools/ and must implement is_available_in(mode):

#![allow(unused)]
fn main() {
impl Tool for GitVoiceTool {
    fn name(&self) -> &'static str { "git_voice" }

    fn modes(&self) -> &'static [Mode] {
        &[Mode::Dev]
    }

    // ...
}
}

Agent Skills

Extend gibb.eri.sh with Bash, Python, or Node.js.

The “Hands” of the Voice OS are extensible. We use the Agent Skills standard (SKILL.md) to let you add new tools without writing Rust.

How it works

  1. Drop a file: Put a SKILL.md file in ~/Library/Application Support/gibb.eri.sh/skills/.
  2. Define the tool: Describe what it does and the command to run.
  3. Speak: The LLM sees your new tool and uses it when relevant.

Example: Summarizer Skill

Create skills/summarize/SKILL.md:

---
name: super_summarizer
version: 1.0.0
description: Extract and summarize content from URLs.
---

## Tools

### extract_content

Extracts clean text from a URL.

**Command:**
```bash
npx @steipete/summarize {{source}} --extract-only

Parameters:

  • source (string, required): The URL.

## The Spec

We support a strict subset of the Agent Skills standard for safety.

### File Format
- **Frontmatter:** YAML with `name` and `description`.
- **Tool Blocks:** Markdown sections defining the tool name, description, command, and parameters.

### Execution Model
- **No Shell:** We execute the binary directly (`program` + `args`). No `sh -c`.
- **Interpolation:** `{{param}}` in the command block is replaced by the JSON argument from the LLM.

### Context Awareness
You can restrict skills to specific modes by adding a `modes` field to the frontmatter:

```yaml
modes: [Dev, Global]

System Architecture

gibb.eri.sh is organized as a Modular Monolith—a single binary with strictly decoupled internal components.

Why Modular Monolith?

ArchitectureProsCons
MonolithSimple deployment, shared memoryTight coupling, hard to test
MicroservicesIndependent scaling, isolationNetwork overhead, complexity
Modular MonolithBest of bothRequires discipline

Performance of a monolith. Maintainability of services.

The Two Layers

┌─────────────────────────────────────────────────────────┐
│                     Tauri App                            │
│  ┌─────────────────────────────────────────────────────┐│
│  │                  plugins/                            ││
│  │  Adapters: Translate between crates and Tauri IPC   ││
│  │  • recorder/  • stt-worker/  • tools/               ││
│  └─────────────────────────────────────────────────────┘│
│                         │                                │
│                         ▼                                │
│  ┌─────────────────────────────────────────────────────┐│
│  │                   crates/                            ││
│  │  Pure Rust: Zero dependencies on Tauri or UI        ││
│  │  • audio/  • bus/  • context/  • stt/  • vad/       ││
│  └─────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────┘

crates/ — The Engine

Pure Rust libraries with no UI dependencies:

  • Can be compiled to CLI tools
  • Can be wrapped with FFI for iOS/Android
  • Fully unit-testable

plugins/ — The Adapters

Tauri-specific glue code:

  • Exposes crate functionality as Tauri commands
  • Handles IPC serialization
  • Manages permissions

Key Design Patterns

Dependency Inversion

High-level modules don’t depend on low-level modules. Both depend on abstractions.

#![allow(unused)]
fn main() {
// crates/application doesn't know about Sherpa or Parakeet
// It only knows about the SttEngine trait
pub fn transcribe(engine: &dyn SttEngine, audio: &[f32]) -> Vec<Segment> {
    engine.transcribe(audio)
}
}

Strategy Pattern

Swap implementations at runtime without changing calling code.

#![allow(unused)]
fn main() {
let engine: Box<dyn SttEngine> = match config.mode {
    Mode::Streaming => Box::new(SherpaEngine::new()?),
    Mode::Batch => Box::new(ParakeetEngine::new()?),
};
}

Event-Driven Communication

Components communicate via events, not direct calls.

#![allow(unused)]
fn main() {
// Producer (STT Worker)
app.emit("stt:stream_commit", &segment)?;

// Consumer (Tools Plugin) - doesn't know about STT internals
app.listen("stt:stream_commit", |event| { ... });
}

Deep Dives

Crate Structure

The crates/ directory contains the domain logic. Each crate has a single responsibility and zero dependencies on Tauri or the UI.

Overview

crates/
├── application/     # Orchestration & State Machine
├── audio/           # Capture, AGC, Resampling
├── bus/             # Zero-copy Audio Pipeline
├── context/         # OS Awareness (Active App, Mic State)
├── detect/          # Meeting App Logic
├── events/          # Shared Event Contracts (DTOs)
├── models/          # Model Registry & Downloads
├── parakeet/        # NVIDIA Parakeet Backend
├── sherpa/          # Sherpa-ONNX Backend
├── smart-turn/      # Semantic Endpointing
├── storage/         # SQLite Persistence
├── stt/             # Engine Traits & Abstractions
├── transcript/      # Data Structures
├── turn/            # Turn Detection Traits
└── vad/             # Silero VAD Integration

Core Components

bus

The nervous system. Delivers audio from recorder to consumers. Key feature: Uses Arc<[f32]> so audio is allocated once and shared across all consumers.

context

The senses. Aggregates system state to drive the context engine.

  • Active App: Which window has focus?
  • Mic State: Is a meeting app using the hardware?
  • Mode: Derives intent (Dev, Meeting, Global).

stt

Defines the SttEngine trait. Infrastructure crates (sherpa, parakeet) implement this.

audio

Handles microphone capture and preprocessing:

  • Resampling: rubato for high-quality sample rate conversion.
  • AGC: Automatic gain control with soft-clipping.

vad

Wraps Silero VAD for voice activity detection.


Dependency Graph

application
    ├── bus
    ├── stt (trait only)
    ├── vad (trait only)
    └── turn (trait only)

sherpa
    └── stt (implements SttEngine)

parakeet
    └── stt (implements SttEngine)

The application crate never imports sherpa or parakeet directly—only their traits.

Audio Bus

The audio bus distributes microphone data to multiple consumers (VAD, STT, visualizer) using shared memory.

Why Shared Memory?

At 16kHz mono, audio is only ~32KB/sec—not “big data.” The issue isn’t throughput, it’s latency consistency. Without shared memory, audio gets copied at each boundary (Mic → JS → Rust → Model → UI), and each copy can introduce jitter. Unpredictable delays destroy the real-time feel even if average latency is low.

Using Arc<[f32]> means one allocation, shared by all consumers. No copying, no jitter from allocations.

Design

Audio is allocated once and shared via Arc<[f32]>:

Mic → Recorder → Arc<[f32]> ─┬─▶ VAD
                             ├─▶ STT
                             └─▶ Visualizer

All consumers read the same memory.

Implementation

AudioChunk

#![allow(unused)]
fn main() {
pub struct AudioChunk {
    pub seq: u64,            // Monotonic sequence number
    pub ts_ms: i64,          // Capture timestamp
    pub sample_rate: u32,    // Always 16000 Hz
    pub samples: Arc<[f32]>, // The actual audio data
}
}

Arc<[f32]> is an atomically reference-counted slice. Memory is freed when the last consumer drops its reference.

AudioBus

#![allow(unused)]
fn main() {
pub struct AudioBus {
    tx: mpsc::Sender<AudioChunk>,
    config: BusConfig,
}

impl AudioBus {
    pub fn publish(&self, chunk: AudioChunk) -> Result<()> {
        self.tx.send(chunk)?;
        Ok(())
    }
}
}

Listener

#![allow(unused)]
fn main() {
pub struct Listener {
    rx: mpsc::Receiver<AudioChunk>,
    dropped: Arc<AtomicU64>,
}

impl Listener {
    pub async fn recv(&mut self) -> Option<AudioChunk> {
        self.rx.recv().await
    }

    pub fn drain_to_latest(&mut self) -> Option<AudioChunk> {
        // Skip old chunks, return only the newest
        let mut latest = None;
        while let Ok(chunk) = self.rx.try_recv() {
            self.dropped.fetch_add(1, Ordering::Relaxed);
            latest = Some(chunk);
        }
        latest
    }
}
}

Backpressure

What if STT can’t keep up with audio? Options:

  1. Block: Producer waits for consumer (bad: causes audio drops)
  2. Buffer: Queue grows unbounded (bad: uses memory, increases latency)
  3. Drop: Discard old data, keep real-time (good: for live transcription)

We use bounded channels with drop policy:

#![allow(unused)]
fn main() {
let (tx, rx) = mpsc::channel(BUFFER_SIZE); // e.g., 100 chunks

// If buffer is full, oldest chunks are available to drain
}

The drain_to_latest() method lets slow consumers catch up by skipping to the newest audio.

Pipeline Status

Performance metrics are tracked with atomic counters:

#![allow(unused)]
fn main() {
pub struct PipelineStatus {
    audio_lag_ms: AtomicI64,      // How far behind real-time
    inference_time_ms: AtomicU64, // Last model execution time
    dropped_chunks: AtomicU64,    // Backpressure indicator
}
}

Diagram

graph LR
    Mic[Microphone] -->|Raw Samples| Recorder
    Recorder -->|Arc&lt;[f32]&gt;| Bus[MPSC Channel]
    Bus -->|recv| VAD[Silero VAD]
    Bus -->|recv| STT[STT Engine]
    STT -->|Text Event| UI[Frontend]

Event System

Components communicate through events, not direct function calls. This enables loose coupling and easy extensibility.

The Contract (crates/events)

We avoid “stringly typed” programming by defining all event payloads in a shared crate.

#![allow(unused)]
fn main() {
// crates/events/src/lib.rs
#[derive(Serialize, Deserialize)]
pub struct StreamCommitEvent {
    pub text: String,
    pub confidence: f32,
}
}

This ensures that:

  1. Type Safety: Producers and consumers must agree on the struct definition.
  2. No Typos: Event names are constants (events::STT_STREAM_COMMIT).
  3. Versioning: Changes to the contract break the build, not runtime.

Two-Tier Architecture

We separate high-frequency data from low-frequency control:

Tier 1: Data (Rust Internal)

High-bandwidth, binary data that never leaves Rust:

ChannelTypePurpose
tokio::sync::mpscBoundedAudio chunks
tokio::sync::broadcastUnboundedControl signals

Tier 2: Control (Rust → Frontend)

Low-bandwidth metadata sent to the UI:

EventPayloadFrequency
stt:stream_commitStreamCommitEvent~1/sec
context:changedContextChangedEventOn focus change

Event Flow

sequenceDiagram
    participant R as Recorder
    participant B as Audio Bus
    participant S as STT Worker
    participant E as Tauri Events
    participant T as Tools Plugin
    participant U as UI (React)

    R->>B: publish(Arc<[f32]>)
    B->>S: recv()
    S->>S: Inference
    S->>E: emit(StreamCommitEvent)
    par Parallel delivery
        E->>U: on(StreamCommitEvent)
        E->>T: listen(StreamCommitEvent)
    end
    T->>T: FunctionGemma Router

Developer Guide

Welcome, contributor! This guide will help you extend gibb.eri.sh.

Prerequisites

  • Rust (stable, 1.75+)
  • Node.js (20+)
  • macOS (for now—Linux/Windows coming)

Quick Start

# Clone
git clone https://github.com/mpuig/gibb.eri.sh
cd gibb.eri.sh

# Install frontend dependencies
cd apps/desktop && npm install

# Run in development mode
npm run tauri dev

Project Structure

gibb.eri.sh/
├── apps/
│   └── desktop/          # Tauri app
│       ├── src/          # React frontend
│       └── src-tauri/    # Rust backend
├── crates/               # Pure Rust libraries
├── plugins/              # Tauri plugin adapters
├── scripts/              # Build & conversion tools
└── docs/                 # This documentation

Development Workflow

Making Changes

  1. Pure logic? → Edit in crates/
  2. UI interaction? → Edit in plugins/
  3. Frontend? → Edit in apps/desktop/src/

Testing

# Run all Rust tests
cargo test --workspace

# Run a specific crate's tests
cargo test -p gibberish-bus

Building

# Debug build
cd apps/desktop && npm run tauri dev

# Release build
npm run tauri build

Guides

Code Style

Rust

  • Use rustfmt (default settings)
  • Prefer Result<T> over panics
  • Document public APIs with ///

TypeScript

  • Use Prettier (default settings)
  • Prefer functional components with hooks
  • Type everything (no any)

Getting Help

Adding Features

gibb.eri.sh is designed to be extensible. Depending on what you want to add, you have two paths: Agent Skills or Native Plugins.

Which path should I take?

GoalPath
Add a tool (Git, Jira, Docker, Scripts)Agent Skill (Recommended)
Add a new audio processor or OS sensorNative Crate/Plugin
Change the core STT/LLM logicNative Crate

1. The Easy Way: Agent Skills

If your feature involves running a CLI command or a script, do not write Rust. Use a Skill Pack. It’s faster, safer, and doesn’t require recompiling the app.


2. The Native Way: Plugins

Use this for features that need low-level OS access or high-performance data processing.

The Golden Rule

Domain logic in crates/. Tauri glue in plugins/.

Never put business logic in plugins. Plugins are thin adapters that translate between Rust and JavaScript.

Step-by-Step Example: Word Counter

Let’s add a “Native” feature that counts words in real-time.

Step 1: Create the Domain Crate

cd crates
cargo new --lib wordcount

crates/wordcount/src/lib.rs:

#![allow(unused)]
fn main() {
pub struct WordCounter {
    total: usize,
}

impl WordCounter {
    pub fn new() -> Self { Self { total: 0 } }
    pub fn add(&mut self, text: &str) -> usize {
        self.total += text.split_whitespace().count();
        self.total
    }
}
}

Step 2: Create the Tauri Plugin

cd plugins
cargo new --lib wordcount

plugins/wordcount/src/lib.rs:

#![allow(unused)]
fn main() {
use gibberish_events::event_names::STT_STREAM_COMMIT;
use gibberish_events::StreamCommitEvent;

pub fn init<R: Runtime>() -> TauriPlugin<R> {
    Builder::new("wordcount")
        .setup(|app, _api| {
            app.manage(Mutex::new(WordCounter::new()));

            // Listen for events using the shared contract
            app.listen_any(STT_STREAM_COMMIT, move |event| {
                if let Ok(payload) = serde_json::from_str::<StreamCommitEvent>(event.payload()) {
                    // Logic here...
                }
            });
            Ok(())
        })
        .build()
}
}

Testing Tips

Dependency Injection

Don’t use std::process::Command directly in your crates. Use the SystemEnvironment trait from plugins/tools. This allows you to mock OS calls in unit tests without actually executing code on the host.

Shared Events

Always use the gibberish-events crate for inter-plugin communication. This prevents runtime “stringly-typed” errors.

Adding Languages

gibb.eri.sh can transcribe any language for which a model exists. Here’s how to add one.

Overview

  1. Find a compatible model (CTC or Transducer)
  2. Convert to ONNX format
  3. Register in the model metadata
  4. Test!

Case Study: Adding Catalan

We added Catalan using a NeMo Conformer CTC model from Hugging Face.

Step 1: Find a Model

Good sources:

Look for:

  • CTC or Transducer architecture (NOT encoder-decoder like Whisper)
  • 16kHz sample rate
  • Good accuracy on your target language

Step 2: Convert to ONNX

Most models are in PyTorch format. We need ONNX for Sherpa.

For NeMo Models

We provide a conversion script:

cd scripts
python export_nemo_ctc.py \
    --model "path/to/model.nemo" \
    --output "catalan-nemo-ctc" \
    --language "ca"

This produces:

  • model.onnx — The neural network
  • tokens.txt — The vocabulary

What the Script Does

import nemo.collections.asr as nemo_asr

# Load PyTorch model
model = nemo_asr.models.EncDecCTCModel.restore_from("model.nemo")

# Create dummy input for tracing
dummy_audio = torch.randn(1, 16000)  # 1 second of audio
dummy_length = torch.tensor([16000])

# Export to ONNX
torch.onnx.export(
    model,
    (dummy_audio, dummy_length),
    "model.onnx",
    input_names=["audio", "length"],
    output_names=["logits"],
    dynamic_axes={
        "audio": {0: "batch", 1: "time"},
        "length": {0: "batch"},
    },
)

# Extract vocabulary
with open("tokens.txt", "w") as f:
    for token in model.decoder.vocabulary:
        f.write(token + "\n")

Step 3: Host the Model

Upload to a public URL. Options:

  • Hugging Face Hub
  • GitHub Releases
  • S3/GCS bucket

Step 4: Register the Model

Edit crates/models/src/metadata.rs:

#![allow(unused)]
fn main() {
pub const MODELS: &[ModelMetadata] = &[
    // ... existing models
    ModelMetadata {
        id: "catalan-nemo-ctc",
        name: "NeMo Conformer (Catalan)",
        language: "ca",
        model_type: ModelType::NemoCtc,
        url: "https://huggingface.co/your-org/catalan-nemo-ctc/resolve/main/model.tar.gz",
        size_mb: 120,
        description: "Catalan speech recognition trained on Common Voice",
    },
];
}

Step 5: Implement the Engine (if needed)

If using an existing architecture (NeMo CTC), the engine already exists:

#![allow(unused)]
fn main() {
// crates/sherpa/src/nemo_ctc.rs
pub struct NemoCtcEngine {
    recognizer: sherpa_rs::OfflineRecognizer,
}

impl SttEngine for NemoCtcEngine {
    fn transcribe(&self, audio: &[f32]) -> Result<Vec<Segment>> {
        // ... implementation
    }
}
}

Step 6: Test

# Unit test
cargo test -p gibberish-sherpa nemo_ctc

# Integration test
cd apps/desktop && npm run tauri dev
# Select "NeMo Conformer (Catalan)" in Settings
# Speak in Catalan!

Model Requirements

Architecture Support

ArchitectureSupportedNotes
CTCNeMo, Wav2Vec2
TransducerZipformer, Conformer
Encoder-DecoderVia WhisperUse Whisper models directly

Audio Format

All models must accept:

  • Sample rate: 16000 Hz
  • Channels: Mono
  • Format: Float32 PCM

Our gibberish-audio crate handles resampling automatically.

Vocabulary Format

tokens.txt should contain one token per line:

<blk>
a
b
c
...
z
'
<space>

Special tokens:

  • <blk> or <blank> — CTC blank token
  • <space> or — Word separator
  • <unk> — Unknown token

Troubleshooting

“Model produces garbage output”

Check vocabulary alignment. The token indices must match exactly.

“Model is slow”

Try quantization:

python -m onnxruntime.quantization.quantize \
    --input model.onnx \
    --output model_int8.onnx \
    --quant_format QDQ

“Model crashes on long audio”

Some models have maximum sequence length. Chunk the audio:

#![allow(unused)]
fn main() {
const MAX_SECONDS: usize = 30;
let chunks = audio.chunks(MAX_SECONDS * 16000);
}

Contributing Models

If you successfully add a language:

  1. Upload to Hugging Face with a clear model card
  2. Add to MODELS in metadata.rs
  3. Submit a PR!

Contributions welcome for:

  • Spanish
  • French
  • German
  • Portuguese

Headless Engine

The core transcription engine has zero dependencies on Tauri or UI. You can use it standalone.

Why Headless?

  • CLI tools: Build command-line transcription utilities
  • Server applications: Run transcription as a service
  • Mobile apps: Wrap with FFI for iOS/Android
  • Testing: Unit test without UI overhead

Architecture

┌─────────────────────────────────────────┐
│           Your Application              │
│  ┌───────────────────────────────────┐  │
│  │     gibberish-application         │  │
│  │  (Orchestration & State Machine)  │  │
│  └───────────────────────────────────┘  │
│                   │                     │
│     ┌─────────────┼─────────────┐      │
│     ▼             ▼             ▼      │
│  ┌──────┐    ┌─────────┐   ┌───────┐  │
│  │ bus  │    │   stt   │   │  vad  │  │
│  └──────┘    └─────────┘   └───────┘  │
└─────────────────────────────────────────┘
         (No Tauri, No React)

Example: CLI Transcriber

Here’s a minimal CLI that transcribes a WAV file:

// examples/cli_transcribe.rs

use gibberish_audio::load_wav;
use gibberish_sherpa::WhisperEngine;
use gibberish_stt::SttEngine;

fn main() -> anyhow::Result<()> {
    let args: Vec<String> = std::env::args().collect();
    let wav_path = args.get(1).expect("Usage: cli_transcribe <file.wav>");

    // Load audio
    let audio = load_wav(wav_path)?;

    // Initialize engine
    let engine = WhisperEngine::new("path/to/whisper-small")?;

    // Transcribe
    let segments = engine.transcribe(&audio)?;

    // Print results
    for segment in segments {
        println!("[{:.2}s - {:.2}s] {}",
            segment.start_ms as f64 / 1000.0,
            segment.end_ms as f64 / 1000.0,
            segment.text
        );
    }

    Ok(())
}

Run it:

cargo run --example cli_transcribe recording.wav

Example: Real-Time Streaming

use gibberish_audio::{AudioCapture, AudioConfig};
use gibberish_bus::{AudioBus, AudioChunk};
use gibberish_sherpa::ZipformerEngine;
use gibberish_vad::SileroVad;

fn main() -> anyhow::Result<()> {
    // Set up audio capture
    let config = AudioConfig {
        sample_rate: 16000,
        channels: 1,
    };
    let capture = AudioCapture::new(config)?;

    // Set up bus
    let (bus, mut listener) = AudioBus::new(100);

    // Set up VAD and STT
    let mut vad = SileroVad::new()?;
    let engine = ZipformerEngine::new("path/to/zipformer")?;

    // Start capture
    capture.start(move |samples| {
        let chunk = AudioChunk::new(samples);
        let _ = bus.publish(chunk);
    })?;

    // Processing loop
    loop {
        if let Some(chunk) = listener.recv().await {
            if vad.is_speech(&chunk.samples)? {
                let result = engine.transcribe_streaming(&chunk.samples)?;
                if !result.text.is_empty() {
                    print!("{}", result.text);
                }
            }
        }
    }
}

FFI: Using from Swift/Kotlin

For mobile apps, expose a C-compatible interface:

Rust Side

#![allow(unused)]
fn main() {
// src/ffi.rs

use std::ffi::{CStr, CString};
use std::os::raw::c_char;

#[no_mangle]
pub extern "C" fn gibberish_init(model_path: *const c_char) -> *mut Engine {
    let path = unsafe { CStr::from_ptr(model_path) }.to_str().unwrap();
    let engine = Box::new(Engine::new(path).unwrap());
    Box::into_raw(engine)
}

#[no_mangle]
pub extern "C" fn gibberish_transcribe(
    engine: *mut Engine,
    audio: *const f32,
    length: usize,
) -> *mut c_char {
    let engine = unsafe { &*engine };
    let samples = unsafe { std::slice::from_raw_parts(audio, length) };

    let result = engine.transcribe(samples).unwrap();
    CString::new(result.text).unwrap().into_raw()
}

#[no_mangle]
pub extern "C" fn gibberish_free(engine: *mut Engine) {
    unsafe { drop(Box::from_raw(engine)); }
}

#[no_mangle]
pub extern "C" fn gibberish_free_string(s: *mut c_char) {
    unsafe { drop(CString::from_raw(s)); }
}
}

Swift Side

// Gibberish.swift

import Foundation

class Gibberish {
    private var engine: OpaquePointer?

    init(modelPath: String) {
        engine = gibberish_init(modelPath)
    }

    deinit {
        if let engine = engine {
            gibberish_free(engine)
        }
    }

    func transcribe(audio: [Float]) -> String {
        guard let engine = engine else { return "" }

        let result = audio.withUnsafeBufferPointer { ptr in
            gibberish_transcribe(engine, ptr.baseAddress, ptr.count)
        }

        defer { gibberish_free_string(result) }
        return String(cString: result!)
    }
}

Building for iOS

# Add iOS targets
rustup target add aarch64-apple-ios

# Build static library
cargo build --release --target aarch64-apple-ios

# The library will be at:
# target/aarch64-apple-ios/release/libgibberish.a

For production FFI, use UniFFI to auto-generate bindings:

# Cargo.toml
[dependencies]
uniffi = "0.25"

[build-dependencies]
uniffi = { version = "0.25", features = ["build"] }
#![allow(unused)]
fn main() {
// src/lib.rs

#[uniffi::export]
pub fn transcribe(model_path: String, audio: Vec<f32>) -> String {
    let engine = Engine::new(&model_path).unwrap();
    engine.transcribe(&audio).unwrap().text
}
}

UniFFI generates Swift, Kotlin, Python, and Ruby bindings automatically.

Performance Considerations

When running headless:

  1. Thread management: You control threading, not Tauri
  2. Memory: No WebView overhead (~100MB savings)
  3. Startup: No UI initialization (~500ms faster)

For servers, consider:

  • Connection pooling for engines (expensive to create)
  • Request queuing during high load
  • Graceful degradation when overloaded

Implementation Details

This section documents implementation details that affect perceived responsiveness.

Simulated Streaming

Making batch models feel real-time.

Silence Injection

Prepending silence to prevent hallucinations.

Lock-Free Metrics

Using atomics for metrics instead of mutexes.

Threading Model

Why we use std::thread instead of tokio::spawn for inference.

Audio Hygiene

Resampling and AGC for consistent input quality.

Meeting Detection

Detecting when Zoom/Teams is running.

Simulated Streaming

Batch models are more accurate but have high latency. We decouple visual feedback from final transcription by running partial inference on growing audio buffers.

The “volatile” text shown during recording isn’t a trick—it’s a valid partial hypothesis based on audio heard so far. Human brains work similarly: we predict words before hearing them fully and revise as needed.

The Problem

Model TypeAccuracyLatencyFeel
StreamingGood~50msLive, responsive
BatchExcellent~2000msSluggish, frustrating

Users are impatient. A 2-second delay feels broken. But batch models are significantly more accurate, especially for:

  • Proper nouns (“Kubernetes” vs “Cooper Netties”)
  • Rare words
  • Accented speech

The Solution

Run partial inference on the growing audio buffer every 500ms.

Time    Buffer              Display           State
─────────────────────────────────────────────────────
0ms     []                  (empty)           waiting
200ms   [audio...]          (empty)           buffering
500ms   [audio......]       "The quick"       volatile
1000ms  [audio..........]   "The quick brown" volatile
1200ms  (pause detected)    "The quick brown" processing
1400ms  (inference done)    "The quick brown fox." stable ✓

Implementation

#![allow(unused)]
fn main() {
pub struct SimulatedStreamer {
    buffer: Vec<f32>,
    engine: Box<dyn SttEngine>,
    partial_interval: Duration,
    last_partial: Instant,
}

impl SimulatedStreamer {
    pub fn push_audio(&mut self, chunk: &[f32]) -> Option<PartialResult> {
        self.buffer.extend_from_slice(chunk);

        // Emit partial every 500ms
        if self.last_partial.elapsed() >= self.partial_interval {
            self.last_partial = Instant::now();

            let result = self.engine.transcribe(&self.buffer).ok()?;
            return Some(PartialResult {
                text: result.text,
                is_final: false,
            });
        }

        None
    }

    pub fn commit(&mut self) -> FinalResult {
        // Run final inference on complete buffer
        let result = self.engine.transcribe(&self.buffer).unwrap();

        // Clear for next utterance
        self.buffer.clear();

        FinalResult {
            text: result.text,
            is_final: true,
        }
    }
}
}

UX: Volatile vs Stable Text

We visually distinguish draft from final:

// Frontend
function TranscriptLine({ segment }: { segment: Segment }) {
    return (
        <span className={segment.is_final ? 'text-white' : 'text-gray-500'}>
            {segment.text}
        </span>
    );
}
  • Volatile (gray): Partial hypothesis, may be revised
  • Stable (white): Final transcription

Edge Cases

Partial Overwrites

Each partial replaces the previous:

Partial 1: "The quick"
Partial 2: "The quick brown"      // Replaces partial 1
Partial 3: "The quick brown fox"  // Replaces partial 2
Final:     "The quick brown fox." // Replaces partial 3

Long Utterances

For very long speech (>30s), we chunk the buffer:

#![allow(unused)]
fn main() {
const MAX_BUFFER_SECONDS: usize = 30;

if self.buffer.len() > MAX_BUFFER_SECONDS * 16000 {
    // Force commit and start fresh
    self.commit();
}
}

Rapid Corrections

If the user speaks, pauses briefly, then continues, we may commit prematurely. Smart Turn detection helps, but isn’t perfect. We accept occasional mis-commits in exchange for responsiveness.

Performance

MetricPure BatchSimulated Streaming
Perceived latency2000ms500ms
AccuracyExcellentExcellent (same model)
CPU usageLowerHigher (repeated inference)

The CPU trade-off is worth it for UX.

When NOT to Use

Simulated streaming adds overhead. Skip it when:

  • Processing pre-recorded files (no need for real-time feel)
  • Running on low-end hardware (CPU budget matters)
  • Accuracy is more important than speed (archival use case)

Silence Injection

The “clear throat” hack that prevents hallucinations.

The Problem

Streaming decoders maintain internal state. When speech ends, this state can get “stuck” in a loop:

User says: "Hello world"
User stops: (silence)
Model outputs: "Hello world. Thank you. Thank you. Thank you..."

The model is hallucinating. It expects more input and fills the gap with plausible-sounding garbage.

Why It Happens

Transducer models have a “joiner” network that predicts the next token based on:

  1. Acoustic features (from audio)
  2. Previous predictions (from decoder state)

During silence, acoustic features are near-zero, but the decoder state still has momentum from the previous words. The model “invents” continuations.

The Solution

Explicitly feed silence into the decoder to reset its state:

#![allow(unused)]
fn main() {
const SILENCE_DURATION_MS: usize = 100;
const SILENCE_SAMPLES: usize = SILENCE_DURATION_MS * 16; // 16 samples/ms at 16kHz

pub fn inject_silence(&mut self) {
    let silence = vec![0.0f32; SILENCE_SAMPLES];
    self.recognizer.accept_waveform(&silence);

    // Force decoder to flush
    self.recognizer.input_finished();
}
}

When to Inject

Trigger silence injection when:

  1. VAD detects speech-end (transition from speech to silence)
  2. A configurable grace period has passed (e.g., 300ms)
  3. Before requesting final output
#![allow(unused)]
fn main() {
impl StreamingTranscriber {
    pub fn on_vad_speech_end(&mut self) {
        // Wait for Smart Turn confirmation
        if self.smart_turn.is_likely_complete() {
            self.inject_silence();
            let final_text = self.recognizer.get_result();
            self.emit_commit(final_text);
            self.reset_state();
        }
    }
}
}

The “Digital Silence”

We inject zeros, not actual recorded silence. Why?

TypeContentsEffect
Recorded silenceRoom noise, humModel might hear “words” in noise
Digital silencePure zerosUnambiguous “nothing to hear”
#![allow(unused)]
fn main() {
// Good: Pure digital silence
let silence = vec![0.0f32; 1600];

// Bad: Recorded silence (might contain noise)
let silence = record_ambient_audio(100);
}

How Much Silence?

We experimentally tuned to 100ms:

DurationEffect
50msSometimes not enough to reset
100msReliable reset, minimal delay
200msWorks but adds unnecessary latency
#![allow(unused)]
fn main() {
const SILENCE_MS: usize = 100;
}

Interaction with Smart Turn

Silence injection happens after Smart Turn confirms completion:

graph TD
    A[VAD: Silence Detected] --> B{Smart Turn?}
    B -->|P(End) < 0.5| C[Keep Listening]
    B -->|P(End) >= 0.5| D[Inject Silence]
    D --> E[Get Final Result]
    E --> F[Emit Commit]
    F --> G[Reset State]

If we inject too early, we cut off the user mid-sentence.

Code

#![allow(unused)]
fn main() {
// crates/sherpa/src/streaming.rs

impl StreamingRecognizer {
    pub fn end_utterance(&mut self) -> String {
        // Inject silence to flush decoder
        let silence = vec![0.0f32; 1600]; // 100ms
        self.accept_waveform(&silence);

        // Mark input as finished
        self.input_finished();

        // Get final result
        let result = self.final_result();

        // Reset for next utterance
        self.reset();

        result
    }
}
}

Without Silence Injection

Input:  "The quick brown fox"
Output: "The quick brown fox jumps over the lazy dog thank you thank you"
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                              Hallucination

With Silence Injection

Input:  "The quick brown fox"
Output: "The quick brown fox"
                              (clean end)

The difference is dramatic for user experience.

Atomic Observability

The audio pipeline updates metrics frequently. Using mutexes would cause contention between the audio thread and UI thread, so we use atomic types instead.

Data Structure

#![allow(unused)]
fn main() {
use std::sync::atomic::{AtomicI64, AtomicU64, Ordering};

pub struct PipelineStatus {
    pub audio_lag_ms: AtomicI64,
    pub inference_time_ms: AtomicU64,
    pub dropped_chunks: AtomicU64,
    pub total_chunks: AtomicU64,
}
}

Atomic operations compile to single CPU instructions and don’t block.

Implementation

Writing (Audio Thread)

#![allow(unused)]
fn main() {
impl PipelineStatus {
    pub fn update_lag(&self, lag_ms: i64) {
        self.audio_lag_ms.store(lag_ms, Ordering::Relaxed);
    }

    pub fn record_inference(&self, time_ms: u64) {
        self.inference_time_ms.store(time_ms, Ordering::Relaxed);
    }

    pub fn increment_dropped(&self) {
        self.dropped_chunks.fetch_add(1, Ordering::Relaxed);
    }
}
}

Reading (UI Thread)

#![allow(unused)]
fn main() {
impl PipelineStatus {
    pub fn snapshot(&self) -> MetricsSnapshot {
        MetricsSnapshot {
            audio_lag_ms: self.audio_lag_ms.load(Ordering::Relaxed),
            inference_time_ms: self.inference_time_ms.load(Ordering::Relaxed),
            dropped_chunks: self.dropped_chunks.load(Ordering::Relaxed),
            total_chunks: self.total_chunks.load(Ordering::Relaxed),
        }
    }
}
}

Memory Ordering

We use Ordering::Relaxed because:

  1. We don’t need synchronization between different metrics
  2. We only care about “eventually consistent” values
  3. It’s the fastest ordering

For metrics dashboards, slightly stale data is acceptable.

Sharing Across Threads

#![allow(unused)]
fn main() {
use std::sync::Arc;

// Create shared status
let status = Arc::new(PipelineStatus::default());

// Clone for audio thread
let audio_status = Arc::clone(&status);
std::thread::spawn(move || {
    loop {
        // Update metrics without blocking
        audio_status.update_lag(compute_lag());
    }
});

// Clone for UI polling
let ui_status = Arc::clone(&status);
tokio::spawn(async move {
    loop {
        let snapshot = ui_status.snapshot();
        emit_metrics(&snapshot);
        tokio::time::sleep(Duration::from_millis(100)).await;
    }
});
}

What We Track

MetricTypeMeaning
audio_lag_msi64Time since audio was captured
inference_time_msu64Last model execution time
dropped_chunksu64Backpressure indicator
total_chunksu64For calculating drop rate

Derived Metrics

#![allow(unused)]
fn main() {
impl MetricsSnapshot {
    pub fn drop_rate(&self) -> f64 {
        if self.total_chunks == 0 {
            0.0
        } else {
            self.dropped_chunks as f64 / self.total_chunks as f64
        }
    }

    pub fn real_time_factor(&self) -> f64 {
        // RTF < 1.0 means faster than real-time
        self.inference_time_ms as f64 / 1000.0 / CHUNK_DURATION_SECONDS
    }
}
}

UI Display

function MetricsDisplay() {
    const [metrics, setMetrics] = useState<Metrics | null>(null);

    useEffect(() => {
        const unlisten = listen<Metrics>('metrics:update', (event) => {
            setMetrics(event.payload);
        });
        return () => { unlisten.then(f => f()); };
    }, []);

    if (!metrics) return null;

    return (
        <div className="text-xs text-gray-500">
            Latency: {metrics.audio_lag_ms}ms |
            RTF: {metrics.real_time_factor.toFixed(2)} |
            Drops: {(metrics.drop_rate * 100).toFixed(1)}%
        </div>
    );
}

Debugging Tip

When logging metrics, take a snapshot first rather than loading individual atomics separately:

#![allow(unused)]
fn main() {
let snap = status.snapshot();
debug!("Metrics: {:?}", snap);
}

Threading Model

We use std::thread for inference, not tokio::spawn. Here’s why.

The Mistake We Made

Our first version looked like this:

#![allow(unused)]
fn main() {
// DON'T DO THIS
tokio::spawn(async move {
    loop {
        let chunk = rx.recv().await?;
        let result = engine.transcribe(&chunk.samples)?; // BLOCKS FOR 100ms!
        app.emit("stt:update", &result)?;
    }
});
}

This worked… until it didn’t. Under load, the UI froze. Audio dropped. Everything felt sluggish.

The Problem

ONNX Runtime inference is CPU-bound and blocking. A single transcribe() call might take 50-200ms of pure CPU work.

Tokio’s async runtime assumes tasks yield frequently. When a task blocks for 100ms, it starves other tasks:

Task A: transcribe() ──────────────────────────────────▶ done
Task B: (waiting for audio)  ..........................  (finally runs)
Task C: (waiting for UI event) ........................  (finally runs)
                              ▲
                              100ms of nothing happening

The Tokio docs explicitly warn about this.

The Fix

Move blocking work to dedicated OS threads:

#![allow(unused)]
fn main() {
// Dedicated thread for inference
std::thread::spawn(move || {
    loop {
        // Block here - it's fine, we're on our own thread
        let chunk = rx.blocking_recv().unwrap();
        let result = engine.transcribe(&chunk.samples).unwrap();

        // Send result back to async world
        result_tx.blocking_send(result).unwrap();
    }
});

// Async task just forwards results
tokio::spawn(async move {
    while let Some(result) = result_rx.recv().await {
        app.emit("stt:update", &result)?;
    }
});
}

Thread Allocation

ThreadPurposePriority
MainTauri/UI event loopNormal
Audiocpal callbackHigh (OS-managed)
STTONNX inferenceNormal
VADSilero inferenceNormal

We don’t set thread priorities manually—the OS scheduler handles it well enough for our needs.

Why Not spawn_blocking?

Tokio provides spawn_blocking() for blocking tasks:

#![allow(unused)]
fn main() {
tokio::task::spawn_blocking(move || {
    engine.transcribe(&samples)
}).await?
}

This works, but:

  1. Creates a new thread per call (overhead)
  2. Limited by max_blocking_threads (defaults to 512)
  3. Threads are pooled but not reused predictably

For a continuous stream of inference calls, a dedicated thread is simpler and more predictable.

Channel Selection

We need channels that bridge sync and async:

#![allow(unused)]
fn main() {
// Option 1: tokio::sync::mpsc (what we use)
let (tx, mut rx) = tokio::sync::mpsc::channel(100);
// tx.blocking_send() from sync thread
// rx.recv().await from async task

// Option 2: crossbeam + tokio wrapper
// More complex, no real benefit for our use case
}

Memory Considerations

Each thread has its own stack (default 2MB on macOS). With 4 threads:

  • Audio thread: ~2MB
  • STT thread: ~2MB + model memory
  • VAD thread: ~2MB + model memory
  • Main thread: ~2MB

The model memory dominates. Thread stacks are negligible.

Debugging

Thread bugs are subtle. Tools that help:

# See thread count
ps -M <pid>

# Profile with Instruments
xcrun xctrace record --template "Time Profiler" --launch ./gibberish

# Logging (add to Cargo.toml)
# tracing = "0.1"
# tracing-subscriber = "0.3"

Error Handling

Threads don’t propagate panics to the main thread. Handle errors explicitly:

#![allow(unused)]
fn main() {
std::thread::spawn(move || {
    let result = std::panic::catch_unwind(|| {
        // Inference loop
    });

    if let Err(e) = result {
        eprintln!("STT thread panicked: {:?}", e);
        // Notify main thread via channel
        error_tx.send(SttError::ThreadPanic).ok();
    }
});
}

Code Reference

The actual implementation lives in plugins/stt-worker/src/worker.rs.

Audio Hygiene

Bad microphones shouldn’t mean bad transcripts. We fix what we can.

The Problems

Consumer microphones vary wildly:

  • Built-in laptop mics pick up fan noise
  • USB mics have different gain settings
  • Sample rates range from 8kHz to 96kHz
  • Some mics clip, others are too quiet

Models expect clean, consistent 16kHz audio. We bridge the gap.

Resampling

All models need 16kHz mono audio. Users have everything else.

Why Sinc Interpolation?

#![allow(unused)]
fn main() {
use rubato::{FftFixedIn, Resampler};

let resampler = FftFixedIn::<f32>::new(
    input_rate,   // e.g., 44100
    16000,        // target
    chunk_size,
    2,            // sub-chunks
    1,            // channels
)?;
}

We use rubato’s FFT-based sinc resampling. Alternatives:

MethodQualitySpeedOur Use
Nearest neighborTerribleFastNever
LinearPoorFastNever
Sinc (rubato)ExcellentMediumYes

Linear interpolation creates aliasing artifacts that sound “robotic.” Speech recognition models weren’t trained on robotic audio—they perform worse.

The CPU cost of proper resampling is negligible compared to inference.

Automatic Gain Control

The Problem

User A (quiet voice):     ▁▁▂▁▁▂▁ (signal barely visible)
User B (loud voice):      ▇▇█▇▇█▇ (signal clipping)
Model expects:            ▃▄▅▄▃▅▄ (normalized range)

Our Solution

Soft-knee compression with tanh:

#![allow(unused)]
fn main() {
const TARGET_DB: f32 = -20.0;
const ATTACK_MS: f32 = 10.0;
const RELEASE_MS: f32 = 100.0;

pub struct Agc {
    gain: f32,
    target_rms: f32,
}

impl Agc {
    pub fn process(&mut self, samples: &mut [f32]) {
        let rms = calculate_rms(samples);
        let target_gain = self.target_rms / rms.max(1e-10);

        // Smooth gain changes to avoid clicks
        self.gain = lerp(self.gain, target_gain, self.smoothing);

        // Apply gain with soft clipping
        for sample in samples.iter_mut() {
            *sample = (*sample * self.gain).tanh();
        }
    }
}
}

The tanh function provides soft clipping—instead of hard clipping at ±1.0 (which sounds harsh), it smoothly compresses peaks.

Target Level

We target -20 dBFS. Why?

  • Leaves headroom for peaks
  • Matches typical model training data
  • Consistent across different mic gains

DC Offset Removal

Some cheap mics have DC offset—the signal “floats” above or below zero:

Bad:   ▄▅▆▅▄▅▆▅▄▅  (offset from zero)
Good:  ▃▄▅▄▃▄▅▄▃▄  (centered on zero)

We use a simple high-pass filter:

#![allow(unused)]
fn main() {
const CUTOFF_HZ: f32 = 20.0; // Remove everything below 20Hz

pub fn remove_dc(samples: &mut [f32], state: &mut f32) {
    let alpha = 1.0 - (2.0 * PI * CUTOFF_HZ / 16000.0);
    for sample in samples.iter_mut() {
        let new_state = *sample + alpha * *state;
        *sample = new_state - *state;
        *state = new_state;
    }
}
}

Noise Gate

We don’t use one. Here’s why:

Noise gates cut audio below a threshold. In theory, they reduce background noise. In practice:

  1. They clip word beginnings (“hello” → “ello”)
  2. Silero VAD already handles speech detection
  3. Models are trained on noisy data and handle it fine

If the environment is so noisy that VAD triggers incorrectly, a noise gate won’t help—the user needs a better mic or quieter room.

Preprocessing Pipeline

Audio flows through these stages in order:

Mic → DC Remove → Resample → AGC → Model

Each stage is independent and stateless (except AGC’s smoothing state).

Testing

We keep a collection of “pathological” audio files:

  • Recorded at 8kHz
  • Heavy background noise
  • Extreme clipping
  • Strong DC offset

CI runs inference on these files. If accuracy drops, we investigate.

Code

  • Resampling: crates/audio/src/resample.rs
  • AGC: crates/audio/src/agc.rs
  • Pipeline: crates/audio/src/stream.rs

The Context Engine

gibb.eri.sh knows what you’re doing. Here’s how.

The Goal

To enable Context-Aware AI, we need to know the user’s state without burning the CPU.

  • Are they coding? (Enable Git tools)
  • Are they in a meeting? (Enable Transcription tools)
  • Are they looking at a specific URL? (Provide deep context)

The Implementation

We use a high-frequency polling loop in crates/context that build a realtime snapshot of the OS state.

1. Active App Detection (Native Cocoa)

We use the macOS Cocoa API (NSWorkspace) via the objc crate to detect the focused application.

Why Native instead of AppleScript?

  • Performance: Sub-millisecond execution. No subprocess fork/exec overhead.
  • Efficiency: Negligible CPU usage even at 1s polling intervals.
  • Reliability: Directly queries the Window Server for the frontmostApplication.

2. Browser Deep Context (URL Detection)

When a supported browser (Chrome, Safari, Arc, Brave) is focused, we go deeper.

  • Mechanism: We use a targeted AppleScript call to fetch the URL of the active tab.
  • Optimization: We only trigger the AppleScript if the active application is a browser, preventing unnecessary overhead.
  • Value: This allows “Summarize this page” to work by feeding the URL directly to our extraction tools.

3. Meeting Detection (The Activity)

We monitor CoreAudio to see if known meeting apps (Zoom, Teams) are accessing the microphone.

  • Crate: crates/detect (wrapped by context)
  • Logic: is_mic_active && is_meeting_app(bundle_id)

Privacy

  • Local Only: No context data leaves the device.
  • Targeted: We only care about specific bundle_ids. We don’t read window titles or keystrokes.
  • Incognito Awareness: We attempt to detect and ignore private browsing windows to avoid leaking sensitive URLs into the LLM context.

Clean Architecture

We use Dependency Inversion to keep the codebase maintainable. Here’s the pattern.

The Problem We Avoided

Imagine adding a new STT engine:

#![allow(unused)]
fn main() {
// BAD: Direct dependencies everywhere
match config.engine {
    Engine::Sherpa => sherpa::transcribe(&audio),
    Engine::Parakeet => parakeet::transcribe(&audio),
    Engine::NewEngine => new_engine::transcribe(&audio), // ADD HERE
}
// ... and here, and here, and here
}

Every new engine means touching multiple files. Tests break. Things get coupled.

The Solution: Trait-Based Abstraction

Define a trait. Implement it. Inject the implementation.

The SttEngine Trait

#![allow(unused)]
fn main() {
// crates/stt/src/engine.rs

pub trait SttEngine: Send + Sync {
    fn transcribe(&self, audio: &[f32]) -> Result<Vec<Segment>>;
    fn is_streaming_capable(&self) -> bool;
    fn model_name(&self) -> &str;
    fn supported_languages(&self) -> Vec<&'static str>;
}
}

Implementations

#![allow(unused)]
fn main() {
// crates/sherpa/src/zipformer.rs
impl SttEngine for ZipformerEngine {
    fn transcribe(&self, audio: &[f32]) -> Result<Vec<Segment>> {
        // Sherpa-specific implementation
    }
    // ...
}

// crates/parakeet/src/lib.rs
impl SttEngine for ParakeetEngine {
    fn transcribe(&self, audio: &[f32]) -> Result<Vec<Segment>> {
        // Parakeet-specific implementation
    }
    // ...
}
}

Usage

The application layer never knows which engine it’s using:

#![allow(unused)]
fn main() {
// crates/application/src/transcriber.rs

pub struct Transcriber {
    engine: Box<dyn SttEngine>,
}

impl Transcriber {
    pub fn new(engine: Box<dyn SttEngine>) -> Self {
        Self { engine }
    }

    pub fn process(&self, audio: &[f32]) -> Result<Vec<Segment>> {
        self.engine.transcribe(audio)
    }
}
}

The Factory Pattern

How do we create the right engine at runtime?

#![allow(unused)]
fn main() {
// crates/stt/src/loader.rs

pub trait EngineLoader: Send + Sync {
    fn name(&self) -> &str;
    fn can_load(&self, model_id: &str) -> bool;
    fn load(&self, model_path: &Path) -> Result<Box<dyn SttEngine>>;
}

// Usage
pub fn create_engine(
    loaders: &[Box<dyn EngineLoader>],
    model_id: &str,
    path: &Path,
) -> Result<Box<dyn SttEngine>> {
    for loader in loaders {
        if loader.can_load(model_id) {
            return loader.load(path);
        }
    }
    Err(Error::UnknownModel(model_id.to_string()))
}
}

Adding a New Engine

Adding WhisperTurbo requires:

  1. Create crates/whisper-turbo/
  2. Implement SttEngine
  3. Implement EngineLoader
  4. Register the loader at startup

No changes to crates/application/. No changes to existing engines. No changes to the UI.

#![allow(unused)]
fn main() {
// crates/whisper-turbo/src/lib.rs

pub struct WhisperTurboEngine { /* ... */ }

impl SttEngine for WhisperTurboEngine {
    fn transcribe(&self, audio: &[f32]) -> Result<Vec<Segment>> {
        // Implementation
    }
    // ...
}

pub struct WhisperTurboLoader;

impl EngineLoader for WhisperTurboLoader {
    fn name(&self) -> &str { "whisper-turbo" }

    fn can_load(&self, model_id: &str) -> bool {
        model_id.starts_with("whisper-turbo")
    }

    fn load(&self, path: &Path) -> Result<Box<dyn SttEngine>> {
        Ok(Box::new(WhisperTurboEngine::new(path)?))
    }
}
}

The Dependency Graph

                    ┌─────────────────┐
                    │   application   │
                    │  (orchestration)│
                    └────────┬────────┘
                             │ depends on trait
                             ▼
                    ┌─────────────────┐
                    │       stt       │
                    │   (SttEngine)   │
                    └────────┬────────┘
                             │ implemented by
            ┌────────────────┼────────────────┐
            ▼                ▼                ▼
     ┌──────────┐     ┌──────────┐     ┌──────────┐
     │  sherpa  │     │ parakeet │     │ whisper  │
     └──────────┘     └──────────┘     └──────────┘

application never imports sherpa, parakeet, or whisper directly. It only knows SttEngine.

Testing

Trait-based design makes testing simple:

#![allow(unused)]
fn main() {
struct MockEngine {
    response: Vec<Segment>,
}

impl SttEngine for MockEngine {
    fn transcribe(&self, _audio: &[f32]) -> Result<Vec<Segment>> {
        Ok(self.response.clone())
    }
    // ...
}

#[test]
fn test_transcriber() {
    let mock = MockEngine {
        response: vec![Segment { text: "hello".into(), ..Default::default() }],
    };

    let transcriber = Transcriber::new(Box::new(mock));
    let result = transcriber.process(&[0.0; 1600]).unwrap();

    assert_eq!(result[0].text, "hello");
}
}

No model files needed. No inference overhead. Fast tests.

Other Traits

The same pattern applies elsewhere:

TraitLocationImplementations
SttEnginecrates/sttSherpa, Parakeet
VoiceActivityDetectorcrates/vadSilero
TurnDetectorcrates/turnSmartTurn, Simple
SessionStoragecrates/storageSQLite

The Trade-Off

Trait objects have runtime cost:

  • Dynamic dispatch (vtable lookup)
  • Can’t be inlined

For inference (already 50-200ms), this overhead is negligible. We measured <1μs per trait call.

If performance mattered here, we’d use generics:

#![allow(unused)]
fn main() {
// Generic (faster, but less flexible)
pub struct Transcriber<E: SttEngine> {
    engine: E,
}

// Trait object (our choice: flexible, dynamic)
pub struct Transcriber {
    engine: Box<dyn SttEngine>,
}
}

We chose flexibility. Runtime engine switching is worth the microseconds.