gibb.eri.sh

“Most voice bots suck. I decided to build the one I actually wanted to use.”

This is the documentation for gibb.eri.sh (v0.9.0), a local-first Voice OS for macOS.

The Story

I build voice bots professionally. I’ve seen the sausage made:

Latency: Sending audio to the cloud takes 500ms minimum.
Privacy: Your voice data is training someone else’s model.
Context: Cloud bots don’t know you’re looking at VS Code.

I wanted a tool that felt instant, respected my privacy, and could actually do things on my computer. Since it didn’t exist, I built it.

What is it?

It’s a desktop app that sits in your menu bar. It listens (when you tell it to), transcribes in real-time, and executes Skills.

Key Capabilities

Context Awareness: It polls the OS to know what app is focused. If you’re in a terminal, it enables Git tools. If you’re in Zoom, it enables Note-taking tools.
Zero Latency: We use a custom Zero-Copy Audio Bus in Rust to stream microphone data directly to local ONNX models.
Agent Skills: You can extend the capabilities by dropping a SKILL.md file into a folder. It supports Bash, Python, and Node scripts.

Who is this for?

Developers and Hackers. This is 0.9.0 software. It’s powerful, but it assumes you know what a “terminal” is. Ideally, it will become useful for everyone, but right now, it’s a power tool.

Tech Stack

Component	Tech	Why?
Core	Rust	Memory safety, threading, no GC pauses.
UI	Tauri + React	HTML/CSS is flexible, Electron is too heavy.
STT	Sherpa-ONNX	Best streaming accuracy on Apple Silicon.
Reasoning	FunctionGemma	Optimized for tool calling, runs locally.

The Golden Path

We hold three core beliefs that drive every line of code in gibb.eri.sh. These aren’t just preferences—they’re non-negotiables that shape every architectural decision.

Privacy First — Your voice never leaves localhost
Zero Latency — Transcription must feel instant
Rust + Tauri — We build for the metal, not the browser

These principles sometimes conflict with “easier” solutions. We choose the harder path because the result is worth it: AI for your OS that serves you, not a corporation.

The Trade-offs We Accept

We Sacrifice	We Gain
Cloud scalability	Absolute privacy
Development speed	Runtime performance
Framework convenience	Memory efficiency
Model variety	Predictable latency

The Trade-offs We Reject

“Just use OpenAI” — Privacy is not optional
“Electron is fine” — RAM is not free
“Good enough latency” — 500ms feels broken

Read on to understand why each principle matters and how we implement it.

Privacy First

Your voice never leaves localhost.

No OpenAI. No Google Speech API. No AWS Transcribe. No cloud anything.

If data doesn’t leave the device, it cannot be intercepted, stored, or analyzed by third parties.

Implementation

All models run on-device using:

ONNX Runtime (cross-platform inference)
Quantized int8 models
CoreML backend on macOS (Apple Neural Engine)

┌─────────────────────────────────────────┐
│              Your Device                │
│  ┌─────────┐    ┌─────────┐    ┌─────┐ │
│  │   Mic   │───▶│  Model  │───▶│ Text│ │
│  └─────────┘    └─────────┘    └─────┘ │
│                                         │
│         Everything stays here           │
└─────────────────────────────────────────┘
              │
              ╳  No network calls
              │

Trade-off

Users download ~500MB of model weights upfront. In exchange:

No API bills
No network round-trips (lower latency)
No data exfiltration possible
Works offline

Why Not Hybrid?

“What if we use local for drafts and cloud for final polish?”

No. This creates a false sense of privacy. Users think they’re protected, but their data still leaves the device. We reject half-measures.

The Models We Use

Model	Size	Use Case
Sherpa Zipformer	~100MB	Real-time streaming
Whisper Small	~500MB	High-accuracy batch
Silero VAD	~2MB	Voice activity detection
FunctionGemma	~200MB	Intent recognition

All models are open-source and can be audited.

Low Latency

Time-to-first-token should be < 200ms. Delays above this threshold are noticeable and disrupt the feedback loop.

The Problem with Traditional Architectures

Most speech-to-text systems work like this:

Mic → JavaScript → JSON → HTTP → Server → Model → HTTP → JSON → UI
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                        500-2000ms latency

Every boundary crossing adds latency:

JS ↔ Native: Serialization overhead
HTTP: Network round-trip
JSON: Parsing overhead

Our Implementation

Audio stays in Rust and uses shared memory pointers:

Mic → Rust → Arc<[f32]> → Model → Text → UI
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            45ms latency

Key techniques:

Arc<[f32]>: Shared memory pointers
Bounded MPSC channels for backpressure
Dedicated threads for inference

Measuring Latency

We track latency at every stage using atomic counters:

#![allow(unused)]
fn main() {
pub struct PipelineStatus {
    audio_lag_ms: AtomicI64,      // Time since audio was captured
    inference_time_ms: AtomicU64, // Model execution time
    dropped_chunks: AtomicU64,    // Backpressure indicator
}
}

These metrics are lock-free—measuring latency doesn’t add latency.

The Streaming Advantage

Traditional “batch” transcription waits for you to finish speaking, then processes everything at once. You might wait 2-3 seconds for results.

Streaming transcription processes audio continuously:

Time	Batch Model	Streaming Model
0ms	(waiting)	(waiting)
100ms	(waiting)	“The”
200ms	(waiting)	“The quick”
500ms	(waiting)	“The quick brown”
1000ms	(waiting)	“The quick brown fox”
1500ms	“The quick brown fox”	“The quick brown fox jumps”

Streaming provides immediate visual feedback.

Rust + Tauri

The Stack

Layer	Technology	Why
Core Logic	Rust	Performance, safety, no GC
Desktop Shell	Tauri v2	Lightweight, secure
UI	React	Developer familiarity
Inference	ONNX Runtime	Universal model format

Why Tauri?

Tauri uses the system’s native webview instead of bundling Chromium, which reduces binary size and RAM usage. For a voice assistant that may run continuously, lower idle resource usage helps.

Note: The app requires ~500MB of model downloads on first run, so the binary size savings are offset by the ML models. The main benefit is runtime efficiency.

The Tauri Architecture

┌─────────────────────────────────────────────────┐
│                  Tauri App                       │
│  ┌───────────────────┐  ┌────────────────────┐  │
│  │   Rust Backend    │  │   WebView (UI)     │  │
│  │                   │  │                    │  │
│  │  ┌─────────────┐  │  │  React + TypeScript│  │
│  │  │ Audio Bus   │  │  │                    │  │
│  │  │ STT Engine  │◀─┼──┼─ invoke()          │  │
│  │  │ VAD         │──┼──┼─▶ events           │  │
│  │  └─────────────┘  │  │                    │  │
│  └───────────────────┘  └────────────────────┘  │
└─────────────────────────────────────────────────┘

Rust Backend: All heavy lifting (audio, inference, VAD)
WebView: Native OS webview (not bundled Chromium)
Communication: Tauri’s IPC (commands + events)

The UI is a Passenger

The React frontend is intentionally “dumb”:

It displays text from the backend
It sends commands (start/stop recording)
It never touches audio data directly

This separation means:

UI bugs can’t crash the audio pipeline
The UI can be replaced without touching core logic
Heavy computation never blocks rendering

Security Model

Tauri uses a capability-based permission system:

// plugins/recorder/permissions/default.json
{
  "permissions": ["recorder:start", "recorder:stop"],
  "deny": ["fs:write", "shell:execute"]
}

Each plugin declares exactly what it needs. Everything else is denied by default.

Core Features

gibb.eri.sh isn’t just a transcription tool—it’s an intelligent voice interface.

Feature Overview

Hybrid Inference Engine

Choose your trade-off: instant feedback or maximum accuracy.

Streaming Mode: Words appear in real-time (~50ms updates)
Batch Mode: Higher accuracy, processed on pauses

Smart Turn Detection

Standard voice detection only hears silence. gibb.eri.sh hears completion.

Knows when you’re thinking vs. when you’re done
Uses neural analysis, not just timers
Configurable sensitivity profiles

Agentic Tools

A local LLM understands your intent and executes actions.

“What is the weather in Barcelona” → Opens browser with results
Runs entirely offline
Extensible tool system

Context Engine

The system knows what you are doing.

Dev Mode: Coding in VS Code? Git tools are enabled.
Meeting Mode: In a Zoom call? Transcription tools are enabled.
Implicit Context: “Summarize this” works on your current selection.

The Interface

Unified Activity Feed

All system events—transcripts, voice commands, and tool results—flow into a single linear feed. This provides a clear “log” of your interaction with the OS.

Mode Badge

A visual indicator in the header shows your current mode (Dev, Meeting, Global). You can click the badge to “pin” a specific mode, overriding automatic detection.

Feature Matrix

Feature	Streaming	Batch	Notes
Real-time display	✓	Simulated	Batch shows “draft” text
Accuracy	Good	Excellent	Batch wins on proper nouns
Latency	~50ms	~500ms	Per-update latency
Languages	English	99+	Whisper supports many
Smart Turn	✓	✓	Works with both modes
Agentic	✓	✓	Triggers on commit

Coming Soon

Speaker Diarization — “Who said what?”
Punctuation Restoration — Automatic commas and periods
Custom Wake Words — “Hey gibb.eri.sh”

Hybrid Inference Engine

gibb.eri.sh supports two modes of operation, selectable at runtime. Each has trade-offs.

Streaming Mode (Sherpa-ONNX Zipformer)

Best for: Dictation, live captioning, instant feedback

Audio ─▶ [Transducer] ─▶ Partial results every ~50ms

How It Works

The Zipformer model uses a transducer architecture:

Processes audio in small chunks (10-20ms)
Maintains internal state between chunks
Emits partial hypotheses continuously
Refines predictions as context grows

Characteristics

Aspect	Value
Latency	~50ms per update
Accuracy	Good (may miss proper nouns)
Languages	English (primary)
Model Size	~100MB

When Streaming Struggles

Proper nouns: “Kubernetes” might become “Cooper Netties”
Rare words: Technical jargon may be misheard
Accents: Less training data for non-standard speech

Batch Mode (Parakeet / Whisper)

Best for: Meetings, archival, accuracy-critical tasks

Audio ─▶ [VAD Buffer] ─▶ [Encoder-Decoder] ─▶ Final text on pause

How It Works

Batch models see the entire utterance before producing output:

VAD detects speech boundaries
Audio is buffered during speech
Model processes the complete segment
Result is highly accurate

Characteristics

Aspect	Value
Latency	~500ms after speech ends
Accuracy	Excellent
Languages	99+ (Whisper)
Model Size	~500MB

Simulated Streaming

Users want batch accuracy with streaming feel. We fake it:

Run partial inference every 500ms on the growing buffer
Display “volatile” text (gray, may change)
On VAD trigger, run final inference
Replace volatile text with “stable” text (black, final)

Speaking: "The quick brown fox"

Time 0ms:   [           ] (buffering)
Time 500ms: [The quick  ] volatile
Time 1000ms:[The quick brown] volatile
Time 1200ms: (pause detected)
Time 1400ms:[The quick brown fox.] stable ✓

Switching Modes

Users can switch modes at runtime via the Settings sheet:

// Frontend
await invoke('plugin:stt|set_mode', { mode: 'streaming' });
await invoke('plugin:stt|set_mode', { mode: 'batch' });

The backend handles the transition gracefully, draining any buffered audio.

Model Recommendations

We’ve tested many models. Here are our picks:

Use Case	Recommended Model
General dictation	Sherpa Zipformer (streaming)
Meetings	Whisper Small (batch)
Non-English	Whisper Small (batch)
Low-end hardware	Sherpa Zipformer (streaming)

Smart Turn Detection

Standard VAD detects silence. Smart Turn detects completion.

The Problem

Voice Activity Detection (VAD) detects silence. Humans detect pauses.

We pause for many reasons:

Thinking: “I want to… [pause] …explain something”
Breathing: Natural respiratory pauses
Emphasis: “This is… [dramatic pause] …important”
Completion: “That’s all I have to say.”

Standard VAD treats all pauses the same. This leads to:

Sentences being split mid-thought
Awkward commit timing
User frustration

The Solution

We implement a Neural Turn Detector inspired by Daily.co’s VAD 3.1 research.

Instead of just measuring silence, we analyze:

Acoustic features: Pitch contour, energy decay
Timing: Duration and pattern of the pause
Semantic probability: Is this a likely sentence ending?

The Algorithm

if (Silence > 300ms AND Probability(EndOfSentence) > 0.5):
    Commit()
else:
    Wait()

Components

Component	Role
Silero VAD	Detects raw silence
Smart Turn Model	Predicts sentence completion
Redemption Timer	Grace period before commit

Implementation

The Smart Turn detector lives in crates/smart-turn:

#![allow(unused)]
fn main() {
pub struct SmartTurnV31Cpu {
    session: Mutex<Session>,  // ONNX Runtime session
    input_name: String,
    output_name: String,
}

impl TurnDetector for SmartTurnV31Cpu {
    fn predict_endpoint_probability(
        &self,
        audio_16k_mono: &[f32]
    ) -> Result<f32, TurnError> {
        // Returns probability 0.0-1.0 that speaker is done
    }
}
}

Configuration

Users can tune the behavior via Settings:

Redemption Time

The grace period after silence begins before we even consider committing.

Setting	Value	Effect
Fast	200ms	Quick commits, may split sentences
Balanced	300ms	Default, good for most users
Relaxed	500ms	Waits longer, better for slow speakers

Sensitivity

How confident must we be that the sentence is complete?

Setting	Threshold	Effect
Aggressive	0.3	Commits on weak signals
Normal	0.5	Balanced
Conservative	0.7	Only commits on strong endings

The Flow

graph TD
    A[Audio Input] --> B{VAD: Speech?}
    B -->|Yes| C[Buffer Audio]
    B -->|No| D{Silence > Redemption?}
    D -->|No| C
    D -->|Yes| E[Smart Turn Analysis]
    E --> F{P(End) > Threshold?}
    F -->|Yes| G[Commit Text]
    F -->|No| C
    G --> H[Reset State]

Real-World Impact

Without Smart Turn:

User: "I think we should... [thinking pause]"
System: COMMIT → "I think we should"
User: "...consider the alternatives"
System: COMMIT → "consider the alternatives"

With Smart Turn:

User: "I think we should... [thinking pause] ...consider the alternatives"
System: (waiting, P(End) = 0.2)
System: (waiting, P(End) = 0.3)
User: [longer pause, falling intonation]
System: (P(End) = 0.7) COMMIT → "I think we should consider the alternatives"

Agentic Tools

gibb.eri.sh doesn’t just transcribe—it understands. And crucially, it understands context.

The Concept

A local LLM monitors your speech for intents. But unlike dumb assistants, gibb.eri.sh changes its capabilities based on what you are doing.

Contextual Modes

The available tools change dynamically based on your environment.

1. Global Mode (Default)

Always available.

Tools: web_search, app_launcher, system_control
Example: “Open Figma”, “Turn up the volume”, “What is quantum computing”

2. Meeting Mode

Triggered when: A meeting app (Zoom, Teams, Slack) is using the microphone.

Tools: transcript_marker, add_todo
Example: “Flag this as important”, “Add action item for Marc”

3. Dev Mode

Triggered when: An IDE (VS Code, IntelliJ, Terminal) is the active window.

Tools: git_voice, file_finder
Example: “Undo last commit”, “Find the user struct”

How It Works

The Pipeline

Context Engine ─▶ [State: Dev Mode]
                        │
                        ▼
User Speech ───▶ [Router] ───▶ Tool Registry (Filter: Dev + Global)
                                        │
                                        ▼
                                [FunctionGemma LLM]
                                (Only sees ~5 relevant tools)
                                        │
                                        ▼
                                [Executor] ─▶ git_voice

Event-Driven Architecture

The Tools plugin listens for stt:stream_commit events and combines them with the latest ContextState:

#![allow(unused)]
fn main() {
// plugins/tools/src/router.rs

// 1. Get current mode (e.g., Dev)
let mode = state.context.effective_mode();

// 2. Filter registry
let tools = registry.tools_for_mode(mode);

// 3. Build system prompt with ONLY those tools
let prompt = build_prompt(tools);

// 4. Run Inference
let result = llm.infer(prompt, user_text);
}

Why Dynamic Filtering?

Accuracy: The LLM isn’t confused by “Book a flight” when you’re trying to “Book a meeting room”. Smaller search space = fewer hallucinations.
Performance: Less text in the system prompt = faster inference.
Safety: Destructive tools (like git reset) are only exposed when you are explicitly focusing on your code editor.

Context Injection

The LLM doesn’t just see your command—it sees your environment. Before every inference, we inject a context snapshot:

Current Context:
Mode: Dev
Active App: VS Code
Clipboard: "RuntimeError: Connection refused at port 8080"
Date: 2025-12-27

This enables implicit referencing:

You say	LLM infers
“Search this error”	`web_search{query: "RuntimeError: Connection refused"}`
“Open that app”	Resolves from active window context
“What does this mean?”	Uses clipboard or selection

What Gets Injected

Mode: Current mode (Global, Dev, Meeting)
Active App: Name of the focused application
Clipboard: First ~200 chars of clipboard text
Selection: Currently selected text (via Accessibility API)
Date: Current date (for scheduling-aware commands)

The Magic Word: “This”

Because gibb.eri.sh knows your context, you can use deictic references:

User says: “Summarize this.”
Context Engine:
1. Checks active app (e.g., Chrome).
2. Grabs currently selected text (via Accessibility API).
3. The LLM sees this in the context and fills the argument automatically.

We also support “what I just said”:

User says: “Create a todo from what I just said.”
System: Grabs the last 30 seconds of transcript history.

This allows generic commands to work across any application without specific integrations.

Feedback Loop

Tools don’t just execute—they respond. After a tool runs, the result is fed back to the LLM for summarization.

The Flow

User: "What is quantum computing?"
      │
      ▼
[FunctionGemma] → web_search{query: "quantum computing"}
      │
      ▼
[Wikipedia API] → {title: "Quantum computing", summary: "...uses qubits..."}
      │
      ▼
[FunctionGemma] → "Quantum computing uses qubits instead of classical bits,
                   enabling exponential speedups for certain problems."
      │
      ▼
[UI] → Displays summary (or speaks via TTS)

Why This Matters

Accessibility: You don’t have to read raw JSON or API responses.
Natural Language: Results are summarized conversationally.
Composability: The model can chain thoughts based on results.

Available Tools

Global

System Control: Volume, Mute, Media keys.
App Launcher: Opens applications.
Web Search: Knowledge lookups (Wikipedia by default, extensible to other sources).
The Typer: Voice-controlled typing.
- Smart Injection: Types short phrases char-by-char for natural interaction.
- Transparent Paste: For long text blocks, it saves your current clipboard, pastes the content instantly via Cmd+V, and restores your original clipboard after a short delay.
- Context Awareness: “Paste this here” knows to use the active selection as the source.

Meeting

Transcript Marker: Inserts [FLAG] or [TODO] tags into the transcript file.
Add Todo: Appends a line to your daily notes.

Development

Git Voice: Wraps common git commands.
File Finder: Uses mdfind (Spotlight) to locate files in the current project context.

Adding Custom Tools

Tools are defined in plugins/tools/src/tools/ and must implement is_available_in(mode):

#![allow(unused)]
fn main() {
impl Tool for GitVoiceTool {
    fn name(&self) -> &'static str { "git_voice" }

    fn modes(&self) -> &'static [Mode] {
        &[Mode::Dev]
    }

    // ...
}
}

Agent Skills

Extend gibb.eri.sh with Bash, Python, or Node.js.

The “Hands” of the Voice OS are extensible. We use the Agent Skills standard (SKILL.md) to let you add new tools without writing Rust.

How it works

Drop a file: Put a SKILL.md file in ~/Library/Application Support/gibb.eri.sh/skills/.
Define the tool: Describe what it does and the command to run.
Speak: The LLM sees your new tool and uses it when relevant.

Example: Summarizer Skill

Create skills/summarize/SKILL.md:

---
name: super_summarizer
version: 1.0.0
description: Extract and summarize content from URLs.
---

## Tools

### extract_content

Extracts clean text from a URL.

**Command:**
```bash
npx @steipete/summarize {{source}} --extract-only

Parameters:

source (string, required): The URL.


## The Spec

We support a strict subset of the Agent Skills standard for safety.

### File Format
- **Frontmatter:** YAML with `name` and `description`.
- **Tool Blocks:** Markdown sections defining the tool name, description, command, and parameters.

### Execution Model
- **No Shell:** We execute the binary directly (`program` + `args`). No `sh -c`.
- **Interpolation:** `{{param}}` in the command block is replaced by the JSON argument from the LLM.

### Context Awareness
You can restrict skills to specific modes by adding a `modes` field to the frontmatter:

```yaml
modes: [Dev, Global]

System Architecture

gibb.eri.sh is organized as a Modular Monolith—a single binary with strictly decoupled internal components.

Why Modular Monolith?

Architecture	Pros	Cons
Monolith	Simple deployment, shared memory	Tight coupling, hard to test
Microservices	Independent scaling, isolation	Network overhead, complexity
Modular Monolith	Best of both	Requires discipline

Performance of a monolith. Maintainability of services.

The Two Layers

┌─────────────────────────────────────────────────────────┐
│                     Tauri App                            │
│  ┌─────────────────────────────────────────────────────┐│
│  │                  plugins/                            ││
│  │  Adapters: Translate between crates and Tauri IPC   ││
│  │  • recorder/  • stt-worker/  • tools/               ││
│  └─────────────────────────────────────────────────────┘│
│                         │                                │
│                         ▼                                │
│  ┌─────────────────────────────────────────────────────┐│
│  │                   crates/                            ││
│  │  Pure Rust: Zero dependencies on Tauri or UI        ││
│  │  • audio/  • bus/  • context/  • stt/  • vad/       ││
│  └─────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────┘

`crates/` — The Engine

Pure Rust libraries with no UI dependencies:

Can be compiled to CLI tools
Can be wrapped with FFI for iOS/Android
Fully unit-testable

`plugins/` — The Adapters

Tauri-specific glue code:

Exposes crate functionality as Tauri commands
Handles IPC serialization
Manages permissions

Key Design Patterns

Dependency Inversion

High-level modules don’t depend on low-level modules. Both depend on abstractions.

#![allow(unused)]
fn main() {
// crates/application doesn't know about Sherpa or Parakeet
// It only knows about the SttEngine trait
pub fn transcribe(engine: &dyn SttEngine, audio: &[f32]) -> Vec<Segment> {
    engine.transcribe(audio)
}
}

Strategy Pattern

Swap implementations at runtime without changing calling code.

#![allow(unused)]
fn main() {
let engine: Box<dyn SttEngine> = match config.mode {
    Mode::Streaming => Box::new(SherpaEngine::new()?),
    Mode::Batch => Box::new(ParakeetEngine::new()?),
};
}

Event-Driven Communication

Components communicate via events, not direct calls.

#![allow(unused)]
fn main() {
// Producer (STT Worker)
app.emit("stt:stream_commit", &segment)?;

// Consumer (Tools Plugin) - doesn't know about STT internals
app.listen("stt:stream_commit", |event| { ... });
}

Deep Dives

Crate Structure — What each crate does
Audio Bus — How audio is distributed to consumers
Event System — How components communicate

Crate Structure

The crates/ directory contains the domain logic. Each crate has a single responsibility and zero dependencies on Tauri or the UI.

Overview

crates/
├── application/     # Orchestration & State Machine
├── audio/           # Capture, AGC, Resampling
├── bus/             # Zero-copy Audio Pipeline
├── context/         # OS Awareness (Active App, Mic State)
├── detect/          # Meeting App Logic
├── events/          # Shared Event Contracts (DTOs)
├── models/          # Model Registry & Downloads
├── parakeet/        # NVIDIA Parakeet Backend
├── sherpa/          # Sherpa-ONNX Backend
├── smart-turn/      # Semantic Endpointing
├── storage/         # SQLite Persistence
├── stt/             # Engine Traits & Abstractions
├── transcript/      # Data Structures
├── turn/            # Turn Detection Traits
└── vad/             # Silero VAD Integration

Core Components

bus

The nervous system. Delivers audio from recorder to consumers. Key feature: Uses Arc<[f32]> so audio is allocated once and shared across all consumers.

context

The senses. Aggregates system state to drive the context engine.

Active App: Which window has focus?
Mic State: Is a meeting app using the hardware?
Mode: Derives intent (Dev, Meeting, Global).

stt

Defines the SttEngine trait. Infrastructure crates (sherpa, parakeet) implement this.

audio

Handles microphone capture and preprocessing:

Resampling: rubato for high-quality sample rate conversion.
AGC: Automatic gain control with soft-clipping.

vad

Wraps Silero VAD for voice activity detection.

Dependency Graph

application
    ├── bus
    ├── stt (trait only)
    ├── vad (trait only)
    └── turn (trait only)

sherpa
    └── stt (implements SttEngine)

parakeet
    └── stt (implements SttEngine)

The application crate never imports sherpa or parakeet directly—only their traits.

Audio Bus

The audio bus distributes microphone data to multiple consumers (VAD, STT, visualizer) using shared memory.

At 16kHz mono, audio is only ~32KB/sec—not “big data.” The issue isn’t throughput, it’s latency consistency. Without shared memory, audio gets copied at each boundary (Mic → JS → Rust → Model → UI), and each copy can introduce jitter. Unpredictable delays destroy the real-time feel even if average latency is low.

Using Arc<[f32]> means one allocation, shared by all consumers. No copying, no jitter from allocations.

Design

Audio is allocated once and shared via Arc<[f32]>:

Mic → Recorder → Arc<[f32]> ─┬─▶ VAD
                             ├─▶ STT
                             └─▶ Visualizer

All consumers read the same memory.

Implementation

AudioChunk

#![allow(unused)]
fn main() {
pub struct AudioChunk {
    pub seq: u64,            // Monotonic sequence number
    pub ts_ms: i64,          // Capture timestamp
    pub sample_rate: u32,    // Always 16000 Hz
    pub samples: Arc<[f32]>, // The actual audio data
}
}

Arc<[f32]> is an atomically reference-counted slice. Memory is freed when the last consumer drops its reference.

AudioBus

#![allow(unused)]
fn main() {
pub struct AudioBus {
    tx: mpsc::Sender<AudioChunk>,
    config: BusConfig,
}

impl AudioBus {
    pub fn publish(&self, chunk: AudioChunk) -> Result<()> {
        self.tx.send(chunk)?;
        Ok(())
    }
}
}

Listener

#![allow(unused)]
fn main() {
pub struct Listener {
    rx: mpsc::Receiver<AudioChunk>,
    dropped: Arc<AtomicU64>,
}

impl Listener {
    pub async fn recv(&mut self) -> Option<AudioChunk> {
        self.rx.recv().await
    }

    pub fn drain_to_latest(&mut self) -> Option<AudioChunk> {
        // Skip old chunks, return only the newest
        let mut latest = None;
        while let Ok(chunk) = self.rx.try_recv() {
            self.dropped.fetch_add(1, Ordering::Relaxed);
            latest = Some(chunk);
        }
        latest
    }
}
}

Backpressure

What if STT can’t keep up with audio? Options:

Block: Producer waits for consumer (bad: causes audio drops)
Buffer: Queue grows unbounded (bad: uses memory, increases latency)
Drop: Discard old data, keep real-time (good: for live transcription)

We use bounded channels with drop policy:

#![allow(unused)]
fn main() {
let (tx, rx) = mpsc::channel(BUFFER_SIZE); // e.g., 100 chunks

// If buffer is full, oldest chunks are available to drain
}

The drain_to_latest() method lets slow consumers catch up by skipping to the newest audio.

Pipeline Status

Performance metrics are tracked with atomic counters:

#![allow(unused)]
fn main() {
pub struct PipelineStatus {
    audio_lag_ms: AtomicI64,      // How far behind real-time
    inference_time_ms: AtomicU64, // Last model execution time
    dropped_chunks: AtomicU64,    // Backpressure indicator
}
}

Diagram

graph LR
    Mic[Microphone] -->|Raw Samples| Recorder
    Recorder -->|Arc&lt;[f32]&gt;| Bus[MPSC Channel]
    Bus -->|recv| VAD[Silero VAD]
    Bus -->|recv| STT[STT Engine]
    STT -->|Text Event| UI[Frontend]

Event System

Components communicate through events, not direct function calls. This enables loose coupling and easy extensibility.

The Contract (`crates/events`)

We avoid “stringly typed” programming by defining all event payloads in a shared crate.

#![allow(unused)]
fn main() {
// crates/events/src/lib.rs
#[derive(Serialize, Deserialize)]
pub struct StreamCommitEvent {
    pub text: String,
    pub confidence: f32,
}
}

This ensures that:

Type Safety: Producers and consumers must agree on the struct definition.
No Typos: Event names are constants (events::STT_STREAM_COMMIT).
Versioning: Changes to the contract break the build, not runtime.

Two-Tier Architecture

We separate high-frequency data from low-frequency control:

Tier 1: Data (Rust Internal)

High-bandwidth, binary data that never leaves Rust:

Channel	Type	Purpose
`tokio::sync::mpsc`	Bounded	Audio chunks
`tokio::sync::broadcast`	Unbounded	Control signals

Tier 2: Control (Rust → Frontend)

Low-bandwidth metadata sent to the UI:

Event	Payload	Frequency
`stt:stream_commit`	`StreamCommitEvent`	~1/sec
`context:changed`	`ContextChangedEvent`	On focus change

Event Flow

sequenceDiagram
    participant R as Recorder
    participant B as Audio Bus
    participant S as STT Worker
    participant E as Tauri Events
    participant T as Tools Plugin
    participant U as UI (React)

    R->>B: publish(Arc<[f32]>)
    B->>S: recv()
    S->>S: Inference
    S->>E: emit(StreamCommitEvent)
    par Parallel delivery
        E->>U: on(StreamCommitEvent)
        E->>T: listen(StreamCommitEvent)
    end
    T->>T: FunctionGemma Router

Developer Guide

Welcome, contributor! This guide will help you extend gibb.eri.sh.

Prerequisites

Rust (stable, 1.75+)
Node.js (20+)
macOS (for now—Linux/Windows coming)

Quick Start

# Clone
git clone https://github.com/mpuig/gibb.eri.sh
cd gibb.eri.sh

# Install frontend dependencies
cd apps/desktop && npm install

# Run in development mode
npm run tauri dev

Project Structure

gibb.eri.sh/
├── apps/
│   └── desktop/          # Tauri app
│       ├── src/          # React frontend
│       └── src-tauri/    # Rust backend
├── crates/               # Pure Rust libraries
├── plugins/              # Tauri plugin adapters
├── scripts/              # Build & conversion tools
└── docs/                 # This documentation

Development Workflow

Making Changes

Pure logic? → Edit in crates/
UI interaction? → Edit in plugins/
Frontend? → Edit in apps/desktop/src/

Testing

# Run all Rust tests
cargo test --workspace

# Run a specific crate's tests
cargo test -p gibberish-bus

Building

# Debug build
cd apps/desktop && npm run tauri dev

# Release build
npm run tauri build

Guides

Adding Features — The proper way to extend functionality
Adding Languages — Support new languages via NeMo CTC
Headless Engine — Use the core without UI

Code Style

Rust

Use rustfmt (default settings)
Prefer Result<T> over panics
Document public APIs with ///

TypeScript

Use Prettier (default settings)
Prefer functional components with hooks
Type everything (no any)

Getting Help

Issues: GitHub Issues
Discussions: GitHub Discussions

Adding Features

gibb.eri.sh is designed to be extensible. Depending on what you want to add, you have two paths: Agent Skills or Native Plugins.

Which path should I take?

Goal	Path
Add a tool (Git, Jira, Docker, Scripts)	Agent Skill (Recommended)
Add a new audio processor or OS sensor	Native Crate/Plugin
Change the core STT/LLM logic	Native Crate

1. The Easy Way: Agent Skills

If your feature involves running a CLI command or a script, do not write Rust. Use a Skill Pack. It’s faster, safer, and doesn’t require recompiling the app.

2. The Native Way: Plugins

Use this for features that need low-level OS access or high-performance data processing.

The Golden Rule

Domain logic in crates/. Tauri glue in plugins/.

Never put business logic in plugins. Plugins are thin adapters that translate between Rust and JavaScript.

Step-by-Step Example: Word Counter

Let’s add a “Native” feature that counts words in real-time.

Step 1: Create the Domain Crate

cd crates
cargo new --lib wordcount

crates/wordcount/src/lib.rs:

#![allow(unused)]
fn main() {
pub struct WordCounter {
    total: usize,
}

impl WordCounter {
    pub fn new() -> Self { Self { total: 0 } }
    pub fn add(&mut self, text: &str) -> usize {
        self.total += text.split_whitespace().count();
        self.total
    }
}
}

Step 2: Create the Tauri Plugin

cd plugins
cargo new --lib wordcount

plugins/wordcount/src/lib.rs:

#![allow(unused)]
fn main() {
use gibberish_events::event_names::STT_STREAM_COMMIT;
use gibberish_events::StreamCommitEvent;

pub fn init<R: Runtime>() -> TauriPlugin<R> {
    Builder::new("wordcount")
        .setup(|app, _api| {
            app.manage(Mutex::new(WordCounter::new()));

            // Listen for events using the shared contract
            app.listen_any(STT_STREAM_COMMIT, move |event| {
                if let Ok(payload) = serde_json::from_str::<StreamCommitEvent>(event.payload()) {
                    // Logic here...
                }
            });
            Ok(())
        })
        .build()
}
}

Testing Tips

Dependency Injection

Don’t use std::process::Command directly in your crates. Use the SystemEnvironment trait from plugins/tools. This allows you to mock OS calls in unit tests without actually executing code on the host.

Shared Events

Always use the gibberish-events crate for inter-plugin communication. This prevents runtime “stringly-typed” errors.

Adding Languages

gibb.eri.sh can transcribe any language for which a model exists. Here’s how to add one.

Overview

Find a compatible model (CTC or Transducer)
Convert to ONNX format
Register in the model metadata
Test!

Case Study: Adding Catalan

We added Catalan using a NeMo Conformer CTC model from Hugging Face.

Step 1: Find a Model

Good sources:

Look for:

CTC or Transducer architecture (NOT encoder-decoder like Whisper)
16kHz sample rate
Good accuracy on your target language

Step 2: Convert to ONNX

Most models are in PyTorch format. We need ONNX for Sherpa.

For NeMo Models

We provide a conversion script:

cd scripts
python export_nemo_ctc.py \
    --model "path/to/model.nemo" \
    --output "catalan-nemo-ctc" \
    --language "ca"

This produces:

model.onnx — The neural network
tokens.txt — The vocabulary

What the Script Does

import nemo.collections.asr as nemo_asr

# Load PyTorch model
model = nemo_asr.models.EncDecCTCModel.restore_from("model.nemo")

# Create dummy input for tracing
dummy_audio = torch.randn(1, 16000)  # 1 second of audio
dummy_length = torch.tensor([16000])

# Export to ONNX
torch.onnx.export(
    model,
    (dummy_audio, dummy_length),
    "model.onnx",
    input_names=["audio", "length"],
    output_names=["logits"],
    dynamic_axes={
        "audio": {0: "batch", 1: "time"},
        "length": {0: "batch"},
    },
)

# Extract vocabulary
with open("tokens.txt", "w") as f:
    for token in model.decoder.vocabulary:
        f.write(token + "\n")

Step 3: Host the Model

Upload to a public URL. Options:

Hugging Face Hub
GitHub Releases
S3/GCS bucket

Step 4: Register the Model

Edit crates/models/src/metadata.rs:

#![allow(unused)]
fn main() {
pub const MODELS: &[ModelMetadata] = &[
    // ... existing models
    ModelMetadata {
        id: "catalan-nemo-ctc",
        name: "NeMo Conformer (Catalan)",
        language: "ca",
        model_type: ModelType::NemoCtc,
        url: "https://huggingface.co/your-org/catalan-nemo-ctc/resolve/main/model.tar.gz",
        size_mb: 120,
        description: "Catalan speech recognition trained on Common Voice",
    },
];
}

Step 5: Implement the Engine (if needed)

If using an existing architecture (NeMo CTC), the engine already exists:

#![allow(unused)]
fn main() {
// crates/sherpa/src/nemo_ctc.rs
pub struct NemoCtcEngine {
    recognizer: sherpa_rs::OfflineRecognizer,
}

impl SttEngine for NemoCtcEngine {
    fn transcribe(&self, audio: &[f32]) -> Result<Vec<Segment>> {
        // ... implementation
    }
}
}

Step 6: Test

# Unit test
cargo test -p gibberish-sherpa nemo_ctc

# Integration test
cd apps/desktop && npm run tauri dev
# Select "NeMo Conformer (Catalan)" in Settings
# Speak in Catalan!

Model Requirements

Architecture Support

Architecture	Supported	Notes
CTC	✓	NeMo, Wav2Vec2
Transducer	✓	Zipformer, Conformer
Encoder-Decoder	Via Whisper	Use Whisper models directly

Audio Format

All models must accept:

Sample rate: 16000 Hz
Channels: Mono
Format: Float32 PCM

Our gibberish-audio crate handles resampling automatically.

Vocabulary Format

tokens.txt should contain one token per line:

<blk>
a
b
c
...
z
'
<space>

Special tokens:

<blk> or <blank> — CTC blank token
<space> or ▁ — Word separator
<unk> — Unknown token

Troubleshooting

“Model produces garbage output”

Check vocabulary alignment. The token indices must match exactly.

“Model is slow”

Try quantization:

python -m onnxruntime.quantization.quantize \
    --input model.onnx \
    --output model_int8.onnx \
    --quant_format QDQ

“Model crashes on long audio”

Some models have maximum sequence length. Chunk the audio:

#![allow(unused)]
fn main() {
const MAX_SECONDS: usize = 30;
let chunks = audio.chunks(MAX_SECONDS * 16000);
}

Contributing Models

If you successfully add a language:

Upload to Hugging Face with a clear model card
Add to MODELS in metadata.rs
Submit a PR!

Contributions welcome for:

Spanish
French
German
Portuguese

Headless Engine

The core transcription engine has zero dependencies on Tauri or UI. You can use it standalone.

Why Headless?

CLI tools: Build command-line transcription utilities
Server applications: Run transcription as a service
Mobile apps: Wrap with FFI for iOS/Android
Testing: Unit test without UI overhead

Architecture

┌─────────────────────────────────────────┐
│           Your Application              │
│  ┌───────────────────────────────────┐  │
│  │     gibberish-application         │  │
│  │  (Orchestration & State Machine)  │  │
│  └───────────────────────────────────┘  │
│                   │                     │
│     ┌─────────────┼─────────────┐      │
│     ▼             ▼             ▼      │
│  ┌──────┐    ┌─────────┐   ┌───────┐  │
│  │ bus  │    │   stt   │   │  vad  │  │
│  └──────┘    └─────────┘   └───────┘  │
└─────────────────────────────────────────┘
         (No Tauri, No React)

Example: CLI Transcriber

Here’s a minimal CLI that transcribes a WAV file:

// examples/cli_transcribe.rs

use gibberish_audio::load_wav;
use gibberish_sherpa::WhisperEngine;
use gibberish_stt::SttEngine;

fn main() -> anyhow::Result<()> {
    let args: Vec<String> = std::env::args().collect();
    let wav_path = args.get(1).expect("Usage: cli_transcribe <file.wav>");

    // Load audio
    let audio = load_wav(wav_path)?;

    // Initialize engine
    let engine = WhisperEngine::new("path/to/whisper-small")?;

    // Transcribe
    let segments = engine.transcribe(&audio)?;

    // Print results
    for segment in segments {
        println!("[{:.2}s - {:.2}s] {}",
            segment.start_ms as f64 / 1000.0,
            segment.end_ms as f64 / 1000.0,
            segment.text
        );
    }

    Ok(())
}

Run it:

cargo run --example cli_transcribe recording.wav

Example: Real-Time Streaming

use gibberish_audio::{AudioCapture, AudioConfig};
use gibberish_bus::{AudioBus, AudioChunk};
use gibberish_sherpa::ZipformerEngine;
use gibberish_vad::SileroVad;

fn main() -> anyhow::Result<()> {
    // Set up audio capture
    let config = AudioConfig {
        sample_rate: 16000,
        channels: 1,
    };
    let capture = AudioCapture::new(config)?;

    // Set up bus
    let (bus, mut listener) = AudioBus::new(100);

    // Set up VAD and STT
    let mut vad = SileroVad::new()?;
    let engine = ZipformerEngine::new("path/to/zipformer")?;

    // Start capture
    capture.start(move |samples| {
        let chunk = AudioChunk::new(samples);
        let _ = bus.publish(chunk);
    })?;

    // Processing loop
    loop {
        if let Some(chunk) = listener.recv().await {
            if vad.is_speech(&chunk.samples)? {
                let result = engine.transcribe_streaming(&chunk.samples)?;
                if !result.text.is_empty() {
                    print!("{}", result.text);
                }
            }
        }
    }
}

FFI: Using from Swift/Kotlin

For mobile apps, expose a C-compatible interface:

Rust Side

#![allow(unused)]
fn main() {
// src/ffi.rs

use std::ffi::{CStr, CString};
use std::os::raw::c_char;

#[no_mangle]
pub extern "C" fn gibberish_init(model_path: *const c_char) -> *mut Engine {
    let path = unsafe { CStr::from_ptr(model_path) }.to_str().unwrap();
    let engine = Box::new(Engine::new(path).unwrap());
    Box::into_raw(engine)
}

#[no_mangle]
pub extern "C" fn gibberish_transcribe(
    engine: *mut Engine,
    audio: *const f32,
    length: usize,
) -> *mut c_char {
    let engine = unsafe { &*engine };
    let samples = unsafe { std::slice::from_raw_parts(audio, length) };

    let result = engine.transcribe(samples).unwrap();
    CString::new(result.text).unwrap().into_raw()
}

#[no_mangle]
pub extern "C" fn gibberish_free(engine: *mut Engine) {
    unsafe { drop(Box::from_raw(engine)); }
}

#[no_mangle]
pub extern "C" fn gibberish_free_string(s: *mut c_char) {
    unsafe { drop(CString::from_raw(s)); }
}
}

Swift Side

// Gibberish.swift

import Foundation

class Gibberish {
    private var engine: OpaquePointer?

    init(modelPath: String) {
        engine = gibberish_init(modelPath)
    }

    deinit {
        if let engine = engine {
            gibberish_free(engine)
        }
    }

    func transcribe(audio: [Float]) -> String {
        guard let engine = engine else { return "" }

        let result = audio.withUnsafeBufferPointer { ptr in
            gibberish_transcribe(engine, ptr.baseAddress, ptr.count)
        }

        defer { gibberish_free_string(result) }
        return String(cString: result!)
    }
}

Building for iOS

# Add iOS targets
rustup target add aarch64-apple-ios

# Build static library
cargo build --release --target aarch64-apple-ios

# The library will be at:
# target/aarch64-apple-ios/release/libgibberish.a

Using UniFFI (Recommended)

For production FFI, use UniFFI to auto-generate bindings:

# Cargo.toml
[dependencies]
uniffi = "0.25"

[build-dependencies]
uniffi = { version = "0.25", features = ["build"] }

#![allow(unused)]
fn main() {
// src/lib.rs

#[uniffi::export]
pub fn transcribe(model_path: String, audio: Vec<f32>) -> String {
    let engine = Engine::new(&model_path).unwrap();
    engine.transcribe(&audio).unwrap().text
}
}

UniFFI generates Swift, Kotlin, Python, and Ruby bindings automatically.

Performance Considerations

When running headless:

Thread management: You control threading, not Tauri
Memory: No WebView overhead (~100MB savings)
Startup: No UI initialization (~500ms faster)

For servers, consider:

Connection pooling for engines (expensive to create)
Request queuing during high load
Graceful degradation when overloaded

Implementation Details

This section documents implementation details that affect perceived responsiveness.

Simulated Streaming

Making batch models feel real-time.

Silence Injection

Prepending silence to prevent hallucinations.

Lock-Free Metrics

Using atomics for metrics instead of mutexes.

Threading Model

Why we use std::thread instead of tokio::spawn for inference.

Audio Hygiene

Resampling and AGC for consistent input quality.

Meeting Detection

Detecting when Zoom/Teams is running.

Simulated Streaming

Batch models are more accurate but have high latency. We decouple visual feedback from final transcription by running partial inference on growing audio buffers.

The “volatile” text shown during recording isn’t a trick—it’s a valid partial hypothesis based on audio heard so far. Human brains work similarly: we predict words before hearing them fully and revise as needed.

The Problem

Model Type	Accuracy	Latency	Feel
Streaming	Good	~50ms	Live, responsive
Batch	Excellent	~2000ms	Sluggish, frustrating

Users are impatient. A 2-second delay feels broken. But batch models are significantly more accurate, especially for:

Proper nouns (“Kubernetes” vs “Cooper Netties”)
Rare words
Accented speech

The Solution

Run partial inference on the growing audio buffer every 500ms.

Time    Buffer              Display           State
─────────────────────────────────────────────────────
0ms     []                  (empty)           waiting
200ms   [audio...]          (empty)           buffering
500ms   [audio......]       "The quick"       volatile
1000ms  [audio..........]   "The quick brown" volatile
1200ms  (pause detected)    "The quick brown" processing
1400ms  (inference done)    "The quick brown fox." stable ✓

Implementation

#![allow(unused)]
fn main() {
pub struct SimulatedStreamer {
    buffer: Vec<f32>,
    engine: Box<dyn SttEngine>,
    partial_interval: Duration,
    last_partial: Instant,
}

impl SimulatedStreamer {
    pub fn push_audio(&mut self, chunk: &[f32]) -> Option<PartialResult> {
        self.buffer.extend_from_slice(chunk);

        // Emit partial every 500ms
        if self.last_partial.elapsed() >= self.partial_interval {
            self.last_partial = Instant::now();

            let result = self.engine.transcribe(&self.buffer).ok()?;
            return Some(PartialResult {
                text: result.text,
                is_final: false,
            });
        }

        None
    }

    pub fn commit(&mut self) -> FinalResult {
        // Run final inference on complete buffer
        let result = self.engine.transcribe(&self.buffer).unwrap();

        // Clear for next utterance
        self.buffer.clear();

        FinalResult {
            text: result.text,
            is_final: true,
        }
    }
}
}

UX: Volatile vs Stable Text

We visually distinguish draft from final:

// Frontend
function TranscriptLine({ segment }: { segment: Segment }) {
    return (
        <span className={segment.is_final ? 'text-white' : 'text-gray-500'}>
            {segment.text}
        </span>
    );
}

Volatile (gray): Partial hypothesis, may be revised
Stable (white): Final transcription

Edge Cases

Partial Overwrites

Each partial replaces the previous:

Partial 1: "The quick"
Partial 2: "The quick brown"      // Replaces partial 1
Partial 3: "The quick brown fox"  // Replaces partial 2
Final:     "The quick brown fox." // Replaces partial 3

Long Utterances

For very long speech (>30s), we chunk the buffer:

#![allow(unused)]
fn main() {
const MAX_BUFFER_SECONDS: usize = 30;

if self.buffer.len() > MAX_BUFFER_SECONDS * 16000 {
    // Force commit and start fresh
    self.commit();
}
}

Rapid Corrections

If the user speaks, pauses briefly, then continues, we may commit prematurely. Smart Turn detection helps, but isn’t perfect. We accept occasional mis-commits in exchange for responsiveness.

Performance

Metric	Pure Batch	Simulated Streaming
Perceived latency	2000ms	500ms
Accuracy	Excellent	Excellent (same model)
CPU usage	Lower	Higher (repeated inference)

The CPU trade-off is worth it for UX.

When NOT to Use

Simulated streaming adds overhead. Skip it when:

Processing pre-recorded files (no need for real-time feel)
Running on low-end hardware (CPU budget matters)
Accuracy is more important than speed (archival use case)

Silence Injection

The “clear throat” hack that prevents hallucinations.

The Problem

Streaming decoders maintain internal state. When speech ends, this state can get “stuck” in a loop:

User says: "Hello world"
User stops: (silence)
Model outputs: "Hello world. Thank you. Thank you. Thank you..."

The model is hallucinating. It expects more input and fills the gap with plausible-sounding garbage.

Why It Happens

Transducer models have a “joiner” network that predicts the next token based on:

Acoustic features (from audio)
Previous predictions (from decoder state)

During silence, acoustic features are near-zero, but the decoder state still has momentum from the previous words. The model “invents” continuations.

The Solution

Explicitly feed silence into the decoder to reset its state:

#![allow(unused)]
fn main() {
const SILENCE_DURATION_MS: usize = 100;
const SILENCE_SAMPLES: usize = SILENCE_DURATION_MS * 16; // 16 samples/ms at 16kHz

pub fn inject_silence(&mut self) {
    let silence = vec![0.0f32; SILENCE_SAMPLES];
    self.recognizer.accept_waveform(&silence);

    // Force decoder to flush
    self.recognizer.input_finished();
}
}

When to Inject

Trigger silence injection when:

VAD detects speech-end (transition from speech to silence)
A configurable grace period has passed (e.g., 300ms)
Before requesting final output

#![allow(unused)]
fn main() {
impl StreamingTranscriber {
    pub fn on_vad_speech_end(&mut self) {
        // Wait for Smart Turn confirmation
        if self.smart_turn.is_likely_complete() {
            self.inject_silence();
            let final_text = self.recognizer.get_result();
            self.emit_commit(final_text);
            self.reset_state();
        }
    }
}
}

The “Digital Silence”

We inject zeros, not actual recorded silence. Why?

Type	Contents	Effect
Recorded silence	Room noise, hum	Model might hear “words” in noise
Digital silence	Pure zeros	Unambiguous “nothing to hear”

#![allow(unused)]
fn main() {
// Good: Pure digital silence
let silence = vec![0.0f32; 1600];

// Bad: Recorded silence (might contain noise)
let silence = record_ambient_audio(100);
}

How Much Silence?

We experimentally tuned to 100ms:

Duration	Effect
50ms	Sometimes not enough to reset
100ms	Reliable reset, minimal delay
200ms	Works but adds unnecessary latency

#![allow(unused)]
fn main() {
const SILENCE_MS: usize = 100;
}

Interaction with Smart Turn

Silence injection happens after Smart Turn confirms completion:

graph TD
    A[VAD: Silence Detected] --> B{Smart Turn?}
    B -->|P(End) < 0.5| C[Keep Listening]
    B -->|P(End) >= 0.5| D[Inject Silence]
    D --> E[Get Final Result]
    E --> F[Emit Commit]
    F --> G[Reset State]

If we inject too early, we cut off the user mid-sentence.

Code

#![allow(unused)]
fn main() {
// crates/sherpa/src/streaming.rs

impl StreamingRecognizer {
    pub fn end_utterance(&mut self) -> String {
        // Inject silence to flush decoder
        let silence = vec![0.0f32; 1600]; // 100ms
        self.accept_waveform(&silence);

        // Mark input as finished
        self.input_finished();

        // Get final result
        let result = self.final_result();

        // Reset for next utterance
        self.reset();

        result
    }
}
}

Without Silence Injection

Input:  "The quick brown fox"
Output: "The quick brown fox jumps over the lazy dog thank you thank you"
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                              Hallucination

With Silence Injection

Input:  "The quick brown fox"
Output: "The quick brown fox"
                              (clean end)

The difference is dramatic for user experience.

Atomic Observability

The audio pipeline updates metrics frequently. Using mutexes would cause contention between the audio thread and UI thread, so we use atomic types instead.

Data Structure

#![allow(unused)]
fn main() {
use std::sync::atomic::{AtomicI64, AtomicU64, Ordering};

pub struct PipelineStatus {
    pub audio_lag_ms: AtomicI64,
    pub inference_time_ms: AtomicU64,
    pub dropped_chunks: AtomicU64,
    pub total_chunks: AtomicU64,
}
}

Atomic operations compile to single CPU instructions and don’t block.

Implementation

Writing (Audio Thread)

#![allow(unused)]
fn main() {
impl PipelineStatus {
    pub fn update_lag(&self, lag_ms: i64) {
        self.audio_lag_ms.store(lag_ms, Ordering::Relaxed);
    }

    pub fn record_inference(&self, time_ms: u64) {
        self.inference_time_ms.store(time_ms, Ordering::Relaxed);
    }

    pub fn increment_dropped(&self) {
        self.dropped_chunks.fetch_add(1, Ordering::Relaxed);
    }
}
}

Reading (UI Thread)

#![allow(unused)]
fn main() {
impl PipelineStatus {
    pub fn snapshot(&self) -> MetricsSnapshot {
        MetricsSnapshot {
            audio_lag_ms: self.audio_lag_ms.load(Ordering::Relaxed),
            inference_time_ms: self.inference_time_ms.load(Ordering::Relaxed),
            dropped_chunks: self.dropped_chunks.load(Ordering::Relaxed),
            total_chunks: self.total_chunks.load(Ordering::Relaxed),
        }
    }
}
}

Memory Ordering

We use Ordering::Relaxed because:

We don’t need synchronization between different metrics
We only care about “eventually consistent” values
It’s the fastest ordering

For metrics dashboards, slightly stale data is acceptable.

#![allow(unused)]
fn main() {
use std::sync::Arc;

// Create shared status
let status = Arc::new(PipelineStatus::default());

// Clone for audio thread
let audio_status = Arc::clone(&status);
std::thread::spawn(move || {
    loop {
        // Update metrics without blocking
        audio_status.update_lag(compute_lag());
    }
});

// Clone for UI polling
let ui_status = Arc::clone(&status);
tokio::spawn(async move {
    loop {
        let snapshot = ui_status.snapshot();
        emit_metrics(&snapshot);
        tokio::time::sleep(Duration::from_millis(100)).await;
    }
});
}

What We Track

Metric	Type	Meaning
`audio_lag_ms`	i64	Time since audio was captured
`inference_time_ms`	u64	Last model execution time
`dropped_chunks`	u64	Backpressure indicator
`total_chunks`	u64	For calculating drop rate

Derived Metrics

#![allow(unused)]
fn main() {
impl MetricsSnapshot {
    pub fn drop_rate(&self) -> f64 {
        if self.total_chunks == 0 {
            0.0
        } else {
            self.dropped_chunks as f64 / self.total_chunks as f64
        }
    }

    pub fn real_time_factor(&self) -> f64 {
        // RTF < 1.0 means faster than real-time
        self.inference_time_ms as f64 / 1000.0 / CHUNK_DURATION_SECONDS
    }
}
}

UI Display

function MetricsDisplay() {
    const [metrics, setMetrics] = useState<Metrics | null>(null);

    useEffect(() => {
        const unlisten = listen<Metrics>('metrics:update', (event) => {
            setMetrics(event.payload);
        });
        return () => { unlisten.then(f => f()); };
    }, []);

    if (!metrics) return null;

    return (
        <div className="text-xs text-gray-500">
            Latency: {metrics.audio_lag_ms}ms |
            RTF: {metrics.real_time_factor.toFixed(2)} |
            Drops: {(metrics.drop_rate * 100).toFixed(1)}%
        </div>
    );
}

Debugging Tip

When logging metrics, take a snapshot first rather than loading individual atomics separately:

#![allow(unused)]
fn main() {
let snap = status.snapshot();
debug!("Metrics: {:?}", snap);
}

Threading Model

We use std::thread for inference, not tokio::spawn. Here’s why.

The Mistake We Made

Our first version looked like this:

#![allow(unused)]
fn main() {
// DON'T DO THIS
tokio::spawn(async move {
    loop {
        let chunk = rx.recv().await?;
        let result = engine.transcribe(&chunk.samples)?; // BLOCKS FOR 100ms!
        app.emit("stt:update", &result)?;
    }
});
}

This worked… until it didn’t. Under load, the UI froze. Audio dropped. Everything felt sluggish.

The Problem

ONNX Runtime inference is CPU-bound and blocking. A single transcribe() call might take 50-200ms of pure CPU work.

Tokio’s async runtime assumes tasks yield frequently. When a task blocks for 100ms, it starves other tasks:

Task A: transcribe() ──────────────────────────────────▶ done
Task B: (waiting for audio)  ..........................  (finally runs)
Task C: (waiting for UI event) ........................  (finally runs)
                              ▲
                              100ms of nothing happening

The Tokio docs explicitly warn about this.

The Fix

Move blocking work to dedicated OS threads:

#![allow(unused)]
fn main() {
// Dedicated thread for inference
std::thread::spawn(move || {
    loop {
        // Block here - it's fine, we're on our own thread
        let chunk = rx.blocking_recv().unwrap();
        let result = engine.transcribe(&chunk.samples).unwrap();

        // Send result back to async world
        result_tx.blocking_send(result).unwrap();
    }
});

// Async task just forwards results
tokio::spawn(async move {
    while let Some(result) = result_rx.recv().await {
        app.emit("stt:update", &result)?;
    }
});
}

Thread Allocation

Thread	Purpose	Priority
Main	Tauri/UI event loop	Normal
Audio	cpal callback	High (OS-managed)
STT	ONNX inference	Normal
VAD	Silero inference	Normal

We don’t set thread priorities manually—the OS scheduler handles it well enough for our needs.

Why Not `spawn_blocking`?

Tokio provides spawn_blocking() for blocking tasks:

#![allow(unused)]
fn main() {
tokio::task::spawn_blocking(move || {
    engine.transcribe(&samples)
}).await?
}

This works, but:

Creates a new thread per call (overhead)
Limited by max_blocking_threads (defaults to 512)
Threads are pooled but not reused predictably

For a continuous stream of inference calls, a dedicated thread is simpler and more predictable.

Channel Selection

We need channels that bridge sync and async:

#![allow(unused)]
fn main() {
// Option 1: tokio::sync::mpsc (what we use)
let (tx, mut rx) = tokio::sync::mpsc::channel(100);
// tx.blocking_send() from sync thread
// rx.recv().await from async task

// Option 2: crossbeam + tokio wrapper
// More complex, no real benefit for our use case
}

Memory Considerations

Each thread has its own stack (default 2MB on macOS). With 4 threads:

Audio thread: ~2MB
STT thread: ~2MB + model memory
VAD thread: ~2MB + model memory
Main thread: ~2MB

The model memory dominates. Thread stacks are negligible.

Debugging

Thread bugs are subtle. Tools that help:

# See thread count
ps -M <pid>

# Profile with Instruments
xcrun xctrace record --template "Time Profiler" --launch ./gibberish

# Logging (add to Cargo.toml)
# tracing = "0.1"
# tracing-subscriber = "0.3"

Error Handling

Threads don’t propagate panics to the main thread. Handle errors explicitly:

#![allow(unused)]
fn main() {
std::thread::spawn(move || {
    let result = std::panic::catch_unwind(|| {
        // Inference loop
    });

    if let Err(e) = result {
        eprintln!("STT thread panicked: {:?}", e);
        // Notify main thread via channel
        error_tx.send(SttError::ThreadPanic).ok();
    }
});
}

Code Reference

The actual implementation lives in plugins/stt-worker/src/worker.rs.

Audio Hygiene

Bad microphones shouldn’t mean bad transcripts. We fix what we can.

The Problems

Consumer microphones vary wildly:

Built-in laptop mics pick up fan noise
USB mics have different gain settings
Sample rates range from 8kHz to 96kHz
Some mics clip, others are too quiet

Models expect clean, consistent 16kHz audio. We bridge the gap.

Resampling

All models need 16kHz mono audio. Users have everything else.

Why Sinc Interpolation?

#![allow(unused)]
fn main() {
use rubato::{FftFixedIn, Resampler};

let resampler = FftFixedIn::<f32>::new(
    input_rate,   // e.g., 44100
    16000,        // target
    chunk_size,
    2,            // sub-chunks
    1,            // channels
)?;
}

We use rubato’s FFT-based sinc resampling. Alternatives:

Method	Quality	Speed	Our Use
Nearest neighbor	Terrible	Fast	Never
Linear	Poor	Fast	Never
Sinc (rubato)	Excellent	Medium	Yes

Linear interpolation creates aliasing artifacts that sound “robotic.” Speech recognition models weren’t trained on robotic audio—they perform worse.

The CPU cost of proper resampling is negligible compared to inference.

Automatic Gain Control

The Problem

User A (quiet voice):     ▁▁▂▁▁▂▁ (signal barely visible)
User B (loud voice):      ▇▇█▇▇█▇ (signal clipping)
Model expects:            ▃▄▅▄▃▅▄ (normalized range)

Our Solution

Soft-knee compression with tanh:

#![allow(unused)]
fn main() {
const TARGET_DB: f32 = -20.0;
const ATTACK_MS: f32 = 10.0;
const RELEASE_MS: f32 = 100.0;

pub struct Agc {
    gain: f32,
    target_rms: f32,
}

impl Agc {
    pub fn process(&mut self, samples: &mut [f32]) {
        let rms = calculate_rms(samples);
        let target_gain = self.target_rms / rms.max(1e-10);

        // Smooth gain changes to avoid clicks
        self.gain = lerp(self.gain, target_gain, self.smoothing);

        // Apply gain with soft clipping
        for sample in samples.iter_mut() {
            *sample = (*sample * self.gain).tanh();
        }
    }
}
}

The tanh function provides soft clipping—instead of hard clipping at ±1.0 (which sounds harsh), it smoothly compresses peaks.

Target Level

We target -20 dBFS. Why?

Leaves headroom for peaks
Matches typical model training data
Consistent across different mic gains

DC Offset Removal

Some cheap mics have DC offset—the signal “floats” above or below zero:

Bad:   ▄▅▆▅▄▅▆▅▄▅  (offset from zero)
Good:  ▃▄▅▄▃▄▅▄▃▄  (centered on zero)

We use a simple high-pass filter:

#![allow(unused)]
fn main() {
const CUTOFF_HZ: f32 = 20.0; // Remove everything below 20Hz

pub fn remove_dc(samples: &mut [f32], state: &mut f32) {
    let alpha = 1.0 - (2.0 * PI * CUTOFF_HZ / 16000.0);
    for sample in samples.iter_mut() {
        let new_state = *sample + alpha * *state;
        *sample = new_state - *state;
        *state = new_state;
    }
}
}

Noise Gate

We don’t use one. Here’s why:

Noise gates cut audio below a threshold. In theory, they reduce background noise. In practice:

They clip word beginnings (“hello” → “ello”)
Silero VAD already handles speech detection
Models are trained on noisy data and handle it fine

If the environment is so noisy that VAD triggers incorrectly, a noise gate won’t help—the user needs a better mic or quieter room.

Preprocessing Pipeline

Audio flows through these stages in order:

Mic → DC Remove → Resample → AGC → Model

Each stage is independent and stateless (except AGC’s smoothing state).

Testing

We keep a collection of “pathological” audio files:

Recorded at 8kHz
Heavy background noise
Extreme clipping
Strong DC offset

CI runs inference on these files. If accuracy drops, we investigate.

Code

Resampling: crates/audio/src/resample.rs
AGC: crates/audio/src/agc.rs
Pipeline: crates/audio/src/stream.rs

The Context Engine

gibb.eri.sh knows what you’re doing. Here’s how.

The Goal

To enable Context-Aware AI, we need to know the user’s state without burning the CPU.

Are they coding? (Enable Git tools)
Are they in a meeting? (Enable Transcription tools)
Are they looking at a specific URL? (Provide deep context)

The Implementation

We use a high-frequency polling loop in crates/context that build a realtime snapshot of the OS state.

1. Active App Detection (Native Cocoa)

We use the macOS Cocoa API (NSWorkspace) via the objc crate to detect the focused application.

Why Native instead of AppleScript?

Performance: Sub-millisecond execution. No subprocess fork/exec overhead.
Efficiency: Negligible CPU usage even at 1s polling intervals.
Reliability: Directly queries the Window Server for the frontmostApplication.

2. Browser Deep Context (URL Detection)

When a supported browser (Chrome, Safari, Arc, Brave) is focused, we go deeper.

Mechanism: We use a targeted AppleScript call to fetch the URL of the active tab.
Optimization: We only trigger the AppleScript if the active application is a browser, preventing unnecessary overhead.
Value: This allows “Summarize this page” to work by feeding the URL directly to our extraction tools.

3. Meeting Detection (The Activity)

We monitor CoreAudio to see if known meeting apps (Zoom, Teams) are accessing the microphone.

Crate: crates/detect (wrapped by context)
Logic: is_mic_active && is_meeting_app(bundle_id)

Privacy

Local Only: No context data leaves the device.
Targeted: We only care about specific bundle_ids. We don’t read window titles or keystrokes.
Incognito Awareness: We attempt to detect and ignore private browsing windows to avoid leaking sensitive URLs into the LLM context.

Clean Architecture

We use Dependency Inversion to keep the codebase maintainable. Here’s the pattern.

The Problem We Avoided

Imagine adding a new STT engine:

#![allow(unused)]
fn main() {
// BAD: Direct dependencies everywhere
match config.engine {
    Engine::Sherpa => sherpa::transcribe(&audio),
    Engine::Parakeet => parakeet::transcribe(&audio),
    Engine::NewEngine => new_engine::transcribe(&audio), // ADD HERE
}
// ... and here, and here, and here
}

Every new engine means touching multiple files. Tests break. Things get coupled.

The Solution: Trait-Based Abstraction

Define a trait. Implement it. Inject the implementation.

The `SttEngine` Trait

#![allow(unused)]
fn main() {
// crates/stt/src/engine.rs

pub trait SttEngine: Send + Sync {
    fn transcribe(&self, audio: &[f32]) -> Result<Vec<Segment>>;
    fn is_streaming_capable(&self) -> bool;
    fn model_name(&self) -> &str;
    fn supported_languages(&self) -> Vec<&'static str>;
}
}

Implementations

#![allow(unused)]
fn main() {
// crates/sherpa/src/zipformer.rs
impl SttEngine for ZipformerEngine {
    fn transcribe(&self, audio: &[f32]) -> Result<Vec<Segment>> {
        // Sherpa-specific implementation
    }
    // ...
}

// crates/parakeet/src/lib.rs
impl SttEngine for ParakeetEngine {
    fn transcribe(&self, audio: &[f32]) -> Result<Vec<Segment>> {
        // Parakeet-specific implementation
    }
    // ...
}
}

Usage

The application layer never knows which engine it’s using:

#![allow(unused)]
fn main() {
// crates/application/src/transcriber.rs

pub struct Transcriber {
    engine: Box<dyn SttEngine>,
}

impl Transcriber {
    pub fn new(engine: Box<dyn SttEngine>) -> Self {
        Self { engine }
    }

    pub fn process(&self, audio: &[f32]) -> Result<Vec<Segment>> {
        self.engine.transcribe(audio)
    }
}
}

The Factory Pattern

How do we create the right engine at runtime?

#![allow(unused)]
fn main() {
// crates/stt/src/loader.rs

pub trait EngineLoader: Send + Sync {
    fn name(&self) -> &str;
    fn can_load(&self, model_id: &str) -> bool;
    fn load(&self, model_path: &Path) -> Result<Box<dyn SttEngine>>;
}

// Usage
pub fn create_engine(
    loaders: &[Box<dyn EngineLoader>],
    model_id: &str,
    path: &Path,
) -> Result<Box<dyn SttEngine>> {
    for loader in loaders {
        if loader.can_load(model_id) {
            return loader.load(path);
        }
    }
    Err(Error::UnknownModel(model_id.to_string()))
}
}

Adding a New Engine

Adding WhisperTurbo requires:

Create crates/whisper-turbo/
Implement SttEngine
Implement EngineLoader
Register the loader at startup

No changes to crates/application/. No changes to existing engines. No changes to the UI.

#![allow(unused)]
fn main() {
// crates/whisper-turbo/src/lib.rs

pub struct WhisperTurboEngine { /* ... */ }

impl SttEngine for WhisperTurboEngine {
    fn transcribe(&self, audio: &[f32]) -> Result<Vec<Segment>> {
        // Implementation
    }
    // ...
}

pub struct WhisperTurboLoader;

impl EngineLoader for WhisperTurboLoader {
    fn name(&self) -> &str { "whisper-turbo" }

    fn can_load(&self, model_id: &str) -> bool {
        model_id.starts_with("whisper-turbo")
    }

    fn load(&self, path: &Path) -> Result<Box<dyn SttEngine>> {
        Ok(Box::new(WhisperTurboEngine::new(path)?))
    }
}
}

The Dependency Graph

                    ┌─────────────────┐
                    │   application   │
                    │  (orchestration)│
                    └────────┬────────┘
                             │ depends on trait
                             ▼
                    ┌─────────────────┐
                    │       stt       │
                    │   (SttEngine)   │
                    └────────┬────────┘
                             │ implemented by
            ┌────────────────┼────────────────┐
            ▼                ▼                ▼
     ┌──────────┐     ┌──────────┐     ┌──────────┐
     │  sherpa  │     │ parakeet │     │ whisper  │
     └──────────┘     └──────────┘     └──────────┘

application never imports sherpa, parakeet, or whisper directly. It only knows SttEngine.

Testing

Trait-based design makes testing simple:

#![allow(unused)]
fn main() {
struct MockEngine {
    response: Vec<Segment>,
}

impl SttEngine for MockEngine {
    fn transcribe(&self, _audio: &[f32]) -> Result<Vec<Segment>> {
        Ok(self.response.clone())
    }
    // ...
}

#[test]
fn test_transcriber() {
    let mock = MockEngine {
        response: vec![Segment { text: "hello".into(), ..Default::default() }],
    };

    let transcriber = Transcriber::new(Box::new(mock));
    let result = transcriber.process(&[0.0; 1600]).unwrap();

    assert_eq!(result[0].text, "hello");
}
}

No model files needed. No inference overhead. Fast tests.

Other Traits

The same pattern applies elsewhere:

Trait	Location	Implementations
`SttEngine`	`crates/stt`	Sherpa, Parakeet
`VoiceActivityDetector`	`crates/vad`	Silero
`TurnDetector`	`crates/turn`	SmartTurn, Simple
`SessionStorage`	`crates/storage`	SQLite

The Trade-Off

Trait objects have runtime cost:

Dynamic dispatch (vtable lookup)
Can’t be inlined

For inference (already 50-200ms), this overhead is negligible. We measured <1μs per trait call.

If performance mattered here, we’d use generics:

#![allow(unused)]
fn main() {
// Generic (faster, but less flexible)
pub struct Transcriber<E: SttEngine> {
    engine: E,
}

// Trait object (our choice: flexible, dynamic)
pub struct Transcriber {
    engine: Box<dyn SttEngine>,
}
}

We chose flexibility. Runtime engine switching is worth the microseconds.

Keyboard shortcuts

Gibberish Documentation