gibb.eri.sh
“Most voice bots suck. I decided to build the one I actually wanted to use.”
This is the documentation for gibb.eri.sh (v0.9.0), a local-first Voice OS for macOS.
The Story
I build voice bots professionally. I’ve seen the sausage made:
- Latency: Sending audio to the cloud takes 500ms minimum.
- Privacy: Your voice data is training someone else’s model.
- Context: Cloud bots don’t know you’re looking at VS Code.
I wanted a tool that felt instant, respected my privacy, and could actually do things on my computer. Since it didn’t exist, I built it.
What is it?
It’s a desktop app that sits in your menu bar. It listens (when you tell it to), transcribes in real-time, and executes Skills.
Key Capabilities
- Context Awareness: It polls the OS to know what app is focused. If you’re in a terminal, it enables Git tools. If you’re in Zoom, it enables Note-taking tools.
- Zero Latency: We use a custom Zero-Copy Audio Bus in Rust to stream microphone data directly to local ONNX models.
- Agent Skills: You can extend the capabilities by dropping a
SKILL.mdfile into a folder. It supports Bash, Python, and Node scripts.
Who is this for?
Developers and Hackers. This is 0.9.0 software. It’s powerful, but it assumes you know what a “terminal” is. Ideally, it will become useful for everyone, but right now, it’s a power tool.
Tech Stack
| Component | Tech | Why? |
|---|---|---|
| Core | Rust | Memory safety, threading, no GC pauses. |
| UI | Tauri + React | HTML/CSS is flexible, Electron is too heavy. |
| STT | Sherpa-ONNX | Best streaming accuracy on Apple Silicon. |
| Reasoning | FunctionGemma | Optimized for tool calling, runs locally. |
Next Steps
The Golden Path
We hold three core beliefs that drive every line of code in gibb.eri.sh. These aren’t just preferences—they’re non-negotiables that shape every architectural decision.
- Privacy First — Your voice never leaves localhost
- Zero Latency — Transcription must feel instant
- Rust + Tauri — We build for the metal, not the browser
These principles sometimes conflict with “easier” solutions. We choose the harder path because the result is worth it: AI for your OS that serves you, not a corporation.
The Trade-offs We Accept
| We Sacrifice | We Gain |
|---|---|
| Cloud scalability | Absolute privacy |
| Development speed | Runtime performance |
| Framework convenience | Memory efficiency |
| Model variety | Predictable latency |
The Trade-offs We Reject
- “Just use OpenAI” — Privacy is not optional
- “Electron is fine” — RAM is not free
- “Good enough latency” — 500ms feels broken
Read on to understand why each principle matters and how we implement it.
Privacy First
Your voice never leaves
localhost.
No OpenAI. No Google Speech API. No AWS Transcribe. No cloud anything.
If data doesn’t leave the device, it cannot be intercepted, stored, or analyzed by third parties.
Implementation
All models run on-device using:
- ONNX Runtime (cross-platform inference)
- Quantized int8 models
- CoreML backend on macOS (Apple Neural Engine)
┌─────────────────────────────────────────┐
│ Your Device │
│ ┌─────────┐ ┌─────────┐ ┌─────┐ │
│ │ Mic │───▶│ Model │───▶│ Text│ │
│ └─────────┘ └─────────┘ └─────┘ │
│ │
│ Everything stays here │
└─────────────────────────────────────────┘
│
╳ No network calls
│
Trade-off
Users download ~500MB of model weights upfront. In exchange:
- No API bills
- No network round-trips (lower latency)
- No data exfiltration possible
- Works offline
Why Not Hybrid?
“What if we use local for drafts and cloud for final polish?”
No. This creates a false sense of privacy. Users think they’re protected, but their data still leaves the device. We reject half-measures.
The Models We Use
| Model | Size | Use Case |
|---|---|---|
| Sherpa Zipformer | ~100MB | Real-time streaming |
| Whisper Small | ~500MB | High-accuracy batch |
| Silero VAD | ~2MB | Voice activity detection |
| FunctionGemma | ~200MB | Intent recognition |
All models are open-source and can be audited.
Low Latency
Time-to-first-token should be < 200ms. Delays above this threshold are noticeable and disrupt the feedback loop.
The Problem with Traditional Architectures
Most speech-to-text systems work like this:
Mic → JavaScript → JSON → HTTP → Server → Model → HTTP → JSON → UI
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
500-2000ms latency
Every boundary crossing adds latency:
- JS ↔ Native: Serialization overhead
- HTTP: Network round-trip
- JSON: Parsing overhead
Our Implementation
Audio stays in Rust and uses shared memory pointers:
Mic → Rust → Arc<[f32]> → Model → Text → UI
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
45ms latency
Key techniques:
Arc<[f32]>: Shared memory pointers- Bounded MPSC channels for backpressure
- Dedicated threads for inference
Measuring Latency
We track latency at every stage using atomic counters:
#![allow(unused)]
fn main() {
pub struct PipelineStatus {
audio_lag_ms: AtomicI64, // Time since audio was captured
inference_time_ms: AtomicU64, // Model execution time
dropped_chunks: AtomicU64, // Backpressure indicator
}
}
These metrics are lock-free—measuring latency doesn’t add latency.
The Streaming Advantage
Traditional “batch” transcription waits for you to finish speaking, then processes everything at once. You might wait 2-3 seconds for results.
Streaming transcription processes audio continuously:
| Time | Batch Model | Streaming Model |
|---|---|---|
| 0ms | (waiting) | (waiting) |
| 100ms | (waiting) | “The” |
| 200ms | (waiting) | “The quick” |
| 500ms | (waiting) | “The quick brown” |
| 1000ms | (waiting) | “The quick brown fox” |
| 1500ms | “The quick brown fox” | “The quick brown fox jumps” |
Streaming provides immediate visual feedback.
Rust + Tauri
The Stack
| Layer | Technology | Why |
|---|---|---|
| Core Logic | Rust | Performance, safety, no GC |
| Desktop Shell | Tauri v2 | Lightweight, secure |
| UI | React | Developer familiarity |
| Inference | ONNX Runtime | Universal model format |
Why Tauri?
Tauri uses the system’s native webview instead of bundling Chromium, which reduces binary size and RAM usage. For a voice assistant that may run continuously, lower idle resource usage helps.
Note: The app requires ~500MB of model downloads on first run, so the binary size savings are offset by the ML models. The main benefit is runtime efficiency.
The Tauri Architecture
┌─────────────────────────────────────────────────┐
│ Tauri App │
│ ┌───────────────────┐ ┌────────────────────┐ │
│ │ Rust Backend │ │ WebView (UI) │ │
│ │ │ │ │ │
│ │ ┌─────────────┐ │ │ React + TypeScript│ │
│ │ │ Audio Bus │ │ │ │ │
│ │ │ STT Engine │◀─┼──┼─ invoke() │ │
│ │ │ VAD │──┼──┼─▶ events │ │
│ │ └─────────────┘ │ │ │ │
│ └───────────────────┘ └────────────────────┘ │
└─────────────────────────────────────────────────┘
- Rust Backend: All heavy lifting (audio, inference, VAD)
- WebView: Native OS webview (not bundled Chromium)
- Communication: Tauri’s IPC (commands + events)
The UI is a Passenger
The React frontend is intentionally “dumb”:
- It displays text from the backend
- It sends commands (start/stop recording)
- It never touches audio data directly
This separation means:
- UI bugs can’t crash the audio pipeline
- The UI can be replaced without touching core logic
- Heavy computation never blocks rendering
Security Model
Tauri uses a capability-based permission system:
// plugins/recorder/permissions/default.json
{
"permissions": ["recorder:start", "recorder:stop"],
"deny": ["fs:write", "shell:execute"]
}
Each plugin declares exactly what it needs. Everything else is denied by default.
Core Features
gibb.eri.sh isn’t just a transcription tool—it’s an intelligent voice interface.
Feature Overview
Hybrid Inference Engine
Choose your trade-off: instant feedback or maximum accuracy.
- Streaming Mode: Words appear in real-time (~50ms updates)
- Batch Mode: Higher accuracy, processed on pauses
Smart Turn Detection
Standard voice detection only hears silence. gibb.eri.sh hears completion.
- Knows when you’re thinking vs. when you’re done
- Uses neural analysis, not just timers
- Configurable sensitivity profiles
Agentic Tools
A local LLM understands your intent and executes actions.
- “What is the weather in Barcelona” → Opens browser with results
- Runs entirely offline
- Extensible tool system
Context Engine
The system knows what you are doing.
- Dev Mode: Coding in VS Code? Git tools are enabled.
- Meeting Mode: In a Zoom call? Transcription tools are enabled.
- Implicit Context: “Summarize this” works on your current selection.
The Interface
Unified Activity Feed
All system events—transcripts, voice commands, and tool results—flow into a single linear feed. This provides a clear “log” of your interaction with the OS.
Mode Badge
A visual indicator in the header shows your current mode (Dev, Meeting, Global). You can click the badge to “pin” a specific mode, overriding automatic detection.
Feature Matrix
| Feature | Streaming | Batch | Notes |
|---|---|---|---|
| Real-time display | ✓ | Simulated | Batch shows “draft” text |
| Accuracy | Good | Excellent | Batch wins on proper nouns |
| Latency | ~50ms | ~500ms | Per-update latency |
| Languages | English | 99+ | Whisper supports many |
| Smart Turn | ✓ | ✓ | Works with both modes |
| Agentic | ✓ | ✓ | Triggers on commit |
Coming Soon
- Speaker Diarization — “Who said what?”
- Punctuation Restoration — Automatic commas and periods
- Custom Wake Words — “Hey gibb.eri.sh”
Hybrid Inference Engine
gibb.eri.sh supports two modes of operation, selectable at runtime. Each has trade-offs.
Streaming Mode (Sherpa-ONNX Zipformer)
Best for: Dictation, live captioning, instant feedback
Audio ─▶ [Transducer] ─▶ Partial results every ~50ms
How It Works
The Zipformer model uses a transducer architecture:
- Processes audio in small chunks (10-20ms)
- Maintains internal state between chunks
- Emits partial hypotheses continuously
- Refines predictions as context grows
Characteristics
| Aspect | Value |
|---|---|
| Latency | ~50ms per update |
| Accuracy | Good (may miss proper nouns) |
| Languages | English (primary) |
| Model Size | ~100MB |
When Streaming Struggles
- Proper nouns: “Kubernetes” might become “Cooper Netties”
- Rare words: Technical jargon may be misheard
- Accents: Less training data for non-standard speech
Batch Mode (Parakeet / Whisper)
Best for: Meetings, archival, accuracy-critical tasks
Audio ─▶ [VAD Buffer] ─▶ [Encoder-Decoder] ─▶ Final text on pause
How It Works
Batch models see the entire utterance before producing output:
- VAD detects speech boundaries
- Audio is buffered during speech
- Model processes the complete segment
- Result is highly accurate
Characteristics
| Aspect | Value |
|---|---|
| Latency | ~500ms after speech ends |
| Accuracy | Excellent |
| Languages | 99+ (Whisper) |
| Model Size | ~500MB |
Simulated Streaming
Users want batch accuracy with streaming feel. We fake it:
- Run partial inference every 500ms on the growing buffer
- Display “volatile” text (gray, may change)
- On VAD trigger, run final inference
- Replace volatile text with “stable” text (black, final)
Speaking: "The quick brown fox"
Time 0ms: [ ] (buffering)
Time 500ms: [The quick ] volatile
Time 1000ms:[The quick brown] volatile
Time 1200ms: (pause detected)
Time 1400ms:[The quick brown fox.] stable ✓
Switching Modes
Users can switch modes at runtime via the Settings sheet:
// Frontend
await invoke('plugin:stt|set_mode', { mode: 'streaming' });
await invoke('plugin:stt|set_mode', { mode: 'batch' });
The backend handles the transition gracefully, draining any buffered audio.
Model Recommendations
We’ve tested many models. Here are our picks:
| Use Case | Recommended Model |
|---|---|
| General dictation | Sherpa Zipformer (streaming) |
| Meetings | Whisper Small (batch) |
| Non-English | Whisper Small (batch) |
| Low-end hardware | Sherpa Zipformer (streaming) |
Smart Turn Detection
Standard VAD detects silence. Smart Turn detects completion.
The Problem
Voice Activity Detection (VAD) detects silence. Humans detect pauses.
We pause for many reasons:
- Thinking: “I want to… [pause] …explain something”
- Breathing: Natural respiratory pauses
- Emphasis: “This is… [dramatic pause] …important”
- Completion: “That’s all I have to say.”
Standard VAD treats all pauses the same. This leads to:
- Sentences being split mid-thought
- Awkward commit timing
- User frustration
The Solution
We implement a Neural Turn Detector inspired by Daily.co’s VAD 3.1 research.
Instead of just measuring silence, we analyze:
- Acoustic features: Pitch contour, energy decay
- Timing: Duration and pattern of the pause
- Semantic probability: Is this a likely sentence ending?
The Algorithm
if (Silence > 300ms AND Probability(EndOfSentence) > 0.5):
Commit()
else:
Wait()
Components
| Component | Role |
|---|---|
| Silero VAD | Detects raw silence |
| Smart Turn Model | Predicts sentence completion |
| Redemption Timer | Grace period before commit |
Implementation
The Smart Turn detector lives in crates/smart-turn:
#![allow(unused)]
fn main() {
pub struct SmartTurnV31Cpu {
session: Mutex<Session>, // ONNX Runtime session
input_name: String,
output_name: String,
}
impl TurnDetector for SmartTurnV31Cpu {
fn predict_endpoint_probability(
&self,
audio_16k_mono: &[f32]
) -> Result<f32, TurnError> {
// Returns probability 0.0-1.0 that speaker is done
}
}
}
Configuration
Users can tune the behavior via Settings:
Redemption Time
The grace period after silence begins before we even consider committing.
| Setting | Value | Effect |
|---|---|---|
| Fast | 200ms | Quick commits, may split sentences |
| Balanced | 300ms | Default, good for most users |
| Relaxed | 500ms | Waits longer, better for slow speakers |
Sensitivity
How confident must we be that the sentence is complete?
| Setting | Threshold | Effect |
|---|---|---|
| Aggressive | 0.3 | Commits on weak signals |
| Normal | 0.5 | Balanced |
| Conservative | 0.7 | Only commits on strong endings |
The Flow
graph TD
A[Audio Input] --> B{VAD: Speech?}
B -->|Yes| C[Buffer Audio]
B -->|No| D{Silence > Redemption?}
D -->|No| C
D -->|Yes| E[Smart Turn Analysis]
E --> F{P(End) > Threshold?}
F -->|Yes| G[Commit Text]
F -->|No| C
G --> H[Reset State]
Real-World Impact
Without Smart Turn:
User: "I think we should... [thinking pause]"
System: COMMIT → "I think we should"
User: "...consider the alternatives"
System: COMMIT → "consider the alternatives"
With Smart Turn:
User: "I think we should... [thinking pause] ...consider the alternatives"
System: (waiting, P(End) = 0.2)
System: (waiting, P(End) = 0.3)
User: [longer pause, falling intonation]
System: (P(End) = 0.7) COMMIT → "I think we should consider the alternatives"
Agentic Tools
gibb.eri.sh doesn’t just transcribe—it understands. And crucially, it understands context.
The Concept
A local LLM monitors your speech for intents. But unlike dumb assistants, gibb.eri.sh changes its capabilities based on what you are doing.
Contextual Modes
The available tools change dynamically based on your environment.
1. Global Mode (Default)
Always available.
- Tools:
web_search,app_launcher,system_control - Example: “Open Figma”, “Turn up the volume”, “What is quantum computing”
2. Meeting Mode
Triggered when: A meeting app (Zoom, Teams, Slack) is using the microphone.
- Tools:
transcript_marker,add_todo - Example: “Flag this as important”, “Add action item for Marc”
3. Dev Mode
Triggered when: An IDE (VS Code, IntelliJ, Terminal) is the active window.
- Tools:
git_voice,file_finder - Example: “Undo last commit”, “Find the user struct”
How It Works
The Pipeline
Context Engine ─▶ [State: Dev Mode]
│
▼
User Speech ───▶ [Router] ───▶ Tool Registry (Filter: Dev + Global)
│
▼
[FunctionGemma LLM]
(Only sees ~5 relevant tools)
│
▼
[Executor] ─▶ git_voice
Event-Driven Architecture
The Tools plugin listens for stt:stream_commit events and combines them with the latest ContextState:
#![allow(unused)]
fn main() {
// plugins/tools/src/router.rs
// 1. Get current mode (e.g., Dev)
let mode = state.context.effective_mode();
// 2. Filter registry
let tools = registry.tools_for_mode(mode);
// 3. Build system prompt with ONLY those tools
let prompt = build_prompt(tools);
// 4. Run Inference
let result = llm.infer(prompt, user_text);
}
Why Dynamic Filtering?
- Accuracy: The LLM isn’t confused by “Book a flight” when you’re trying to “Book a meeting room”. Smaller search space = fewer hallucinations.
- Performance: Less text in the system prompt = faster inference.
- Safety: Destructive tools (like
git reset) are only exposed when you are explicitly focusing on your code editor.
Context Injection
The LLM doesn’t just see your command—it sees your environment. Before every inference, we inject a context snapshot:
Current Context:
Mode: Dev
Active App: VS Code
Clipboard: "RuntimeError: Connection refused at port 8080"
Date: 2025-12-27
This enables implicit referencing:
| You say | LLM infers |
|---|---|
| “Search this error” | web_search{query: "RuntimeError: Connection refused"} |
| “Open that app” | Resolves from active window context |
| “What does this mean?” | Uses clipboard or selection |
What Gets Injected
- Mode: Current mode (Global, Dev, Meeting)
- Active App: Name of the focused application
- Clipboard: First ~200 chars of clipboard text
- Selection: Currently selected text (via Accessibility API)
- Date: Current date (for scheduling-aware commands)
The Magic Word: “This”
Because gibb.eri.sh knows your context, you can use deictic references:
- User says: “Summarize this.”
- Context Engine:
- Checks active app (e.g., Chrome).
- Grabs currently selected text (via Accessibility API).
- The LLM sees this in the context and fills the argument automatically.
We also support “what I just said”:
- User says: “Create a todo from what I just said.”
- System: Grabs the last 30 seconds of transcript history.
This allows generic commands to work across any application without specific integrations.
Feedback Loop
Tools don’t just execute—they respond. After a tool runs, the result is fed back to the LLM for summarization.
The Flow
User: "What is quantum computing?"
│
▼
[FunctionGemma] → web_search{query: "quantum computing"}
│
▼
[Wikipedia API] → {title: "Quantum computing", summary: "...uses qubits..."}
│
▼
[FunctionGemma] → "Quantum computing uses qubits instead of classical bits,
enabling exponential speedups for certain problems."
│
▼
[UI] → Displays summary (or speaks via TTS)
Why This Matters
- Accessibility: You don’t have to read raw JSON or API responses.
- Natural Language: Results are summarized conversationally.
- Composability: The model can chain thoughts based on results.
Available Tools
Global
- System Control: Volume, Mute, Media keys.
- App Launcher: Opens applications.
- Web Search: Knowledge lookups (Wikipedia by default, extensible to other sources).
- The Typer: Voice-controlled typing.
- Smart Injection: Types short phrases char-by-char for natural interaction.
- Transparent Paste: For long text blocks, it saves your current clipboard, pastes the content instantly via
Cmd+V, and restores your original clipboard after a short delay. - Context Awareness: “Paste this here” knows to use the active selection as the source.
Meeting
- Transcript Marker: Inserts
[FLAG]or[TODO]tags into the transcript file. - Add Todo: Appends a line to your daily notes.
Development
- Git Voice: Wraps common git commands.
- File Finder: Uses
mdfind(Spotlight) to locate files in the current project context.
Adding Custom Tools
Tools are defined in plugins/tools/src/tools/ and must implement is_available_in(mode):
#![allow(unused)]
fn main() {
impl Tool for GitVoiceTool {
fn name(&self) -> &'static str { "git_voice" }
fn modes(&self) -> &'static [Mode] {
&[Mode::Dev]
}
// ...
}
}
Agent Skills
Extend gibb.eri.sh with Bash, Python, or Node.js.
The “Hands” of the Voice OS are extensible. We use the Agent Skills standard (SKILL.md) to let you add new tools without writing Rust.
How it works
- Drop a file: Put a
SKILL.mdfile in~/Library/Application Support/gibb.eri.sh/skills/. - Define the tool: Describe what it does and the command to run.
- Speak: The LLM sees your new tool and uses it when relevant.
Example: Summarizer Skill
Create skills/summarize/SKILL.md:
---
name: super_summarizer
version: 1.0.0
description: Extract and summarize content from URLs.
---
## Tools
### extract_content
Extracts clean text from a URL.
**Command:**
```bash
npx @steipete/summarize {{source}} --extract-only
Parameters:
source(string, required): The URL.
## The Spec
We support a strict subset of the Agent Skills standard for safety.
### File Format
- **Frontmatter:** YAML with `name` and `description`.
- **Tool Blocks:** Markdown sections defining the tool name, description, command, and parameters.
### Execution Model
- **No Shell:** We execute the binary directly (`program` + `args`). No `sh -c`.
- **Interpolation:** `{{param}}` in the command block is replaced by the JSON argument from the LLM.
### Context Awareness
You can restrict skills to specific modes by adding a `modes` field to the frontmatter:
```yaml
modes: [Dev, Global]
System Architecture
gibb.eri.sh is organized as a Modular Monolith—a single binary with strictly decoupled internal components.
Why Modular Monolith?
| Architecture | Pros | Cons |
|---|---|---|
| Monolith | Simple deployment, shared memory | Tight coupling, hard to test |
| Microservices | Independent scaling, isolation | Network overhead, complexity |
| Modular Monolith | Best of both | Requires discipline |
Performance of a monolith. Maintainability of services.
The Two Layers
┌─────────────────────────────────────────────────────────┐
│ Tauri App │
│ ┌─────────────────────────────────────────────────────┐│
│ │ plugins/ ││
│ │ Adapters: Translate between crates and Tauri IPC ││
│ │ • recorder/ • stt-worker/ • tools/ ││
│ └─────────────────────────────────────────────────────┘│
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐│
│ │ crates/ ││
│ │ Pure Rust: Zero dependencies on Tauri or UI ││
│ │ • audio/ • bus/ • context/ • stt/ • vad/ ││
│ └─────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────┘
crates/ — The Engine
Pure Rust libraries with no UI dependencies:
- Can be compiled to CLI tools
- Can be wrapped with FFI for iOS/Android
- Fully unit-testable
plugins/ — The Adapters
Tauri-specific glue code:
- Exposes crate functionality as Tauri commands
- Handles IPC serialization
- Manages permissions
Key Design Patterns
Dependency Inversion
High-level modules don’t depend on low-level modules. Both depend on abstractions.
#![allow(unused)]
fn main() {
// crates/application doesn't know about Sherpa or Parakeet
// It only knows about the SttEngine trait
pub fn transcribe(engine: &dyn SttEngine, audio: &[f32]) -> Vec<Segment> {
engine.transcribe(audio)
}
}
Strategy Pattern
Swap implementations at runtime without changing calling code.
#![allow(unused)]
fn main() {
let engine: Box<dyn SttEngine> = match config.mode {
Mode::Streaming => Box::new(SherpaEngine::new()?),
Mode::Batch => Box::new(ParakeetEngine::new()?),
};
}
Event-Driven Communication
Components communicate via events, not direct calls.
#![allow(unused)]
fn main() {
// Producer (STT Worker)
app.emit("stt:stream_commit", &segment)?;
// Consumer (Tools Plugin) - doesn't know about STT internals
app.listen("stt:stream_commit", |event| { ... });
}
Deep Dives
- Crate Structure — What each crate does
- Audio Bus — How audio is distributed to consumers
- Event System — How components communicate
Crate Structure
The crates/ directory contains the domain logic. Each crate has a single responsibility and zero dependencies on Tauri or the UI.
Overview
crates/
├── application/ # Orchestration & State Machine
├── audio/ # Capture, AGC, Resampling
├── bus/ # Zero-copy Audio Pipeline
├── context/ # OS Awareness (Active App, Mic State)
├── detect/ # Meeting App Logic
├── events/ # Shared Event Contracts (DTOs)
├── models/ # Model Registry & Downloads
├── parakeet/ # NVIDIA Parakeet Backend
├── sherpa/ # Sherpa-ONNX Backend
├── smart-turn/ # Semantic Endpointing
├── storage/ # SQLite Persistence
├── stt/ # Engine Traits & Abstractions
├── transcript/ # Data Structures
├── turn/ # Turn Detection Traits
└── vad/ # Silero VAD Integration
Core Components
bus
The nervous system. Delivers audio from recorder to consumers.
Key feature: Uses Arc<[f32]> so audio is allocated once and shared across all consumers.
context
The senses. Aggregates system state to drive the context engine.
- Active App: Which window has focus?
- Mic State: Is a meeting app using the hardware?
- Mode: Derives intent (Dev, Meeting, Global).
stt
Defines the SttEngine trait. Infrastructure crates (sherpa, parakeet) implement this.
audio
Handles microphone capture and preprocessing:
- Resampling:
rubatofor high-quality sample rate conversion. - AGC: Automatic gain control with soft-clipping.
vad
Wraps Silero VAD for voice activity detection.
Dependency Graph
application
├── bus
├── stt (trait only)
├── vad (trait only)
└── turn (trait only)
sherpa
└── stt (implements SttEngine)
parakeet
└── stt (implements SttEngine)
The application crate never imports sherpa or parakeet directly—only their traits.
Audio Bus
The audio bus distributes microphone data to multiple consumers (VAD, STT, visualizer) using shared memory.
Why Shared Memory?
At 16kHz mono, audio is only ~32KB/sec—not “big data.” The issue isn’t throughput, it’s latency consistency. Without shared memory, audio gets copied at each boundary (Mic → JS → Rust → Model → UI), and each copy can introduce jitter. Unpredictable delays destroy the real-time feel even if average latency is low.
Using Arc<[f32]> means one allocation, shared by all consumers. No copying, no jitter from allocations.
Design
Audio is allocated once and shared via Arc<[f32]>:
Mic → Recorder → Arc<[f32]> ─┬─▶ VAD
├─▶ STT
└─▶ Visualizer
All consumers read the same memory.
Implementation
AudioChunk
#![allow(unused)]
fn main() {
pub struct AudioChunk {
pub seq: u64, // Monotonic sequence number
pub ts_ms: i64, // Capture timestamp
pub sample_rate: u32, // Always 16000 Hz
pub samples: Arc<[f32]>, // The actual audio data
}
}
Arc<[f32]> is an atomically reference-counted slice. Memory is freed when the last consumer drops its reference.
AudioBus
#![allow(unused)]
fn main() {
pub struct AudioBus {
tx: mpsc::Sender<AudioChunk>,
config: BusConfig,
}
impl AudioBus {
pub fn publish(&self, chunk: AudioChunk) -> Result<()> {
self.tx.send(chunk)?;
Ok(())
}
}
}
Listener
#![allow(unused)]
fn main() {
pub struct Listener {
rx: mpsc::Receiver<AudioChunk>,
dropped: Arc<AtomicU64>,
}
impl Listener {
pub async fn recv(&mut self) -> Option<AudioChunk> {
self.rx.recv().await
}
pub fn drain_to_latest(&mut self) -> Option<AudioChunk> {
// Skip old chunks, return only the newest
let mut latest = None;
while let Ok(chunk) = self.rx.try_recv() {
self.dropped.fetch_add(1, Ordering::Relaxed);
latest = Some(chunk);
}
latest
}
}
}
Backpressure
What if STT can’t keep up with audio? Options:
- Block: Producer waits for consumer (bad: causes audio drops)
- Buffer: Queue grows unbounded (bad: uses memory, increases latency)
- Drop: Discard old data, keep real-time (good: for live transcription)
We use bounded channels with drop policy:
#![allow(unused)]
fn main() {
let (tx, rx) = mpsc::channel(BUFFER_SIZE); // e.g., 100 chunks
// If buffer is full, oldest chunks are available to drain
}
The drain_to_latest() method lets slow consumers catch up by skipping to the newest audio.
Pipeline Status
Performance metrics are tracked with atomic counters:
#![allow(unused)]
fn main() {
pub struct PipelineStatus {
audio_lag_ms: AtomicI64, // How far behind real-time
inference_time_ms: AtomicU64, // Last model execution time
dropped_chunks: AtomicU64, // Backpressure indicator
}
}
Diagram
graph LR
Mic[Microphone] -->|Raw Samples| Recorder
Recorder -->|Arc<[f32]>| Bus[MPSC Channel]
Bus -->|recv| VAD[Silero VAD]
Bus -->|recv| STT[STT Engine]
STT -->|Text Event| UI[Frontend]
Event System
Components communicate through events, not direct function calls. This enables loose coupling and easy extensibility.
The Contract (crates/events)
We avoid “stringly typed” programming by defining all event payloads in a shared crate.
#![allow(unused)]
fn main() {
// crates/events/src/lib.rs
#[derive(Serialize, Deserialize)]
pub struct StreamCommitEvent {
pub text: String,
pub confidence: f32,
}
}
This ensures that:
- Type Safety: Producers and consumers must agree on the struct definition.
- No Typos: Event names are constants (
events::STT_STREAM_COMMIT). - Versioning: Changes to the contract break the build, not runtime.
Two-Tier Architecture
We separate high-frequency data from low-frequency control:
Tier 1: Data (Rust Internal)
High-bandwidth, binary data that never leaves Rust:
| Channel | Type | Purpose |
|---|---|---|
tokio::sync::mpsc | Bounded | Audio chunks |
tokio::sync::broadcast | Unbounded | Control signals |
Tier 2: Control (Rust → Frontend)
Low-bandwidth metadata sent to the UI:
| Event | Payload | Frequency |
|---|---|---|
stt:stream_commit | StreamCommitEvent | ~1/sec |
context:changed | ContextChangedEvent | On focus change |
Event Flow
sequenceDiagram
participant R as Recorder
participant B as Audio Bus
participant S as STT Worker
participant E as Tauri Events
participant T as Tools Plugin
participant U as UI (React)
R->>B: publish(Arc<[f32]>)
B->>S: recv()
S->>S: Inference
S->>E: emit(StreamCommitEvent)
par Parallel delivery
E->>U: on(StreamCommitEvent)
E->>T: listen(StreamCommitEvent)
end
T->>T: FunctionGemma Router
Developer Guide
Welcome, contributor! This guide will help you extend gibb.eri.sh.
Prerequisites
- Rust (stable, 1.75+)
- Node.js (20+)
- macOS (for now—Linux/Windows coming)
Quick Start
# Clone
git clone https://github.com/mpuig/gibb.eri.sh
cd gibb.eri.sh
# Install frontend dependencies
cd apps/desktop && npm install
# Run in development mode
npm run tauri dev
Project Structure
gibb.eri.sh/
├── apps/
│ └── desktop/ # Tauri app
│ ├── src/ # React frontend
│ └── src-tauri/ # Rust backend
├── crates/ # Pure Rust libraries
├── plugins/ # Tauri plugin adapters
├── scripts/ # Build & conversion tools
└── docs/ # This documentation
Development Workflow
Making Changes
- Pure logic? → Edit in
crates/ - UI interaction? → Edit in
plugins/ - Frontend? → Edit in
apps/desktop/src/
Testing
# Run all Rust tests
cargo test --workspace
# Run a specific crate's tests
cargo test -p gibberish-bus
Building
# Debug build
cd apps/desktop && npm run tauri dev
# Release build
npm run tauri build
Guides
- Adding Features — The proper way to extend functionality
- Adding Languages — Support new languages via NeMo CTC
- Headless Engine — Use the core without UI
Code Style
Rust
- Use
rustfmt(default settings) - Prefer
Result<T>over panics - Document public APIs with
///
TypeScript
- Use Prettier (default settings)
- Prefer functional components with hooks
- Type everything (no
any)
Getting Help
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Adding Features
gibb.eri.sh is designed to be extensible. Depending on what you want to add, you have two paths: Agent Skills or Native Plugins.
Which path should I take?
| Goal | Path |
|---|---|
| Add a tool (Git, Jira, Docker, Scripts) | Agent Skill (Recommended) |
| Add a new audio processor or OS sensor | Native Crate/Plugin |
| Change the core STT/LLM logic | Native Crate |
1. The Easy Way: Agent Skills
If your feature involves running a CLI command or a script, do not write Rust. Use a Skill Pack. It’s faster, safer, and doesn’t require recompiling the app.
2. The Native Way: Plugins
Use this for features that need low-level OS access or high-performance data processing.
The Golden Rule
Domain logic in
crates/. Tauri glue inplugins/.
Never put business logic in plugins. Plugins are thin adapters that translate between Rust and JavaScript.
Step-by-Step Example: Word Counter
Let’s add a “Native” feature that counts words in real-time.
Step 1: Create the Domain Crate
cd crates
cargo new --lib wordcount
crates/wordcount/src/lib.rs:
#![allow(unused)]
fn main() {
pub struct WordCounter {
total: usize,
}
impl WordCounter {
pub fn new() -> Self { Self { total: 0 } }
pub fn add(&mut self, text: &str) -> usize {
self.total += text.split_whitespace().count();
self.total
}
}
}
Step 2: Create the Tauri Plugin
cd plugins
cargo new --lib wordcount
plugins/wordcount/src/lib.rs:
#![allow(unused)]
fn main() {
use gibberish_events::event_names::STT_STREAM_COMMIT;
use gibberish_events::StreamCommitEvent;
pub fn init<R: Runtime>() -> TauriPlugin<R> {
Builder::new("wordcount")
.setup(|app, _api| {
app.manage(Mutex::new(WordCounter::new()));
// Listen for events using the shared contract
app.listen_any(STT_STREAM_COMMIT, move |event| {
if let Ok(payload) = serde_json::from_str::<StreamCommitEvent>(event.payload()) {
// Logic here...
}
});
Ok(())
})
.build()
}
}
Testing Tips
Dependency Injection
Don’t use std::process::Command directly in your crates. Use the SystemEnvironment trait from plugins/tools. This allows you to mock OS calls in unit tests without actually executing code on the host.
Shared Events
Always use the gibberish-events crate for inter-plugin communication. This prevents runtime “stringly-typed” errors.
Adding Languages
gibb.eri.sh can transcribe any language for which a model exists. Here’s how to add one.
Overview
- Find a compatible model (CTC or Transducer)
- Convert to ONNX format
- Register in the model metadata
- Test!
Case Study: Adding Catalan
We added Catalan using a NeMo Conformer CTC model from Hugging Face.
Step 1: Find a Model
Good sources:
Look for:
- CTC or Transducer architecture (NOT encoder-decoder like Whisper)
- 16kHz sample rate
- Good accuracy on your target language
Step 2: Convert to ONNX
Most models are in PyTorch format. We need ONNX for Sherpa.
For NeMo Models
We provide a conversion script:
cd scripts
python export_nemo_ctc.py \
--model "path/to/model.nemo" \
--output "catalan-nemo-ctc" \
--language "ca"
This produces:
model.onnx— The neural networktokens.txt— The vocabulary
What the Script Does
import nemo.collections.asr as nemo_asr
# Load PyTorch model
model = nemo_asr.models.EncDecCTCModel.restore_from("model.nemo")
# Create dummy input for tracing
dummy_audio = torch.randn(1, 16000) # 1 second of audio
dummy_length = torch.tensor([16000])
# Export to ONNX
torch.onnx.export(
model,
(dummy_audio, dummy_length),
"model.onnx",
input_names=["audio", "length"],
output_names=["logits"],
dynamic_axes={
"audio": {0: "batch", 1: "time"},
"length": {0: "batch"},
},
)
# Extract vocabulary
with open("tokens.txt", "w") as f:
for token in model.decoder.vocabulary:
f.write(token + "\n")
Step 3: Host the Model
Upload to a public URL. Options:
- Hugging Face Hub
- GitHub Releases
- S3/GCS bucket
Step 4: Register the Model
Edit crates/models/src/metadata.rs:
#![allow(unused)]
fn main() {
pub const MODELS: &[ModelMetadata] = &[
// ... existing models
ModelMetadata {
id: "catalan-nemo-ctc",
name: "NeMo Conformer (Catalan)",
language: "ca",
model_type: ModelType::NemoCtc,
url: "https://huggingface.co/your-org/catalan-nemo-ctc/resolve/main/model.tar.gz",
size_mb: 120,
description: "Catalan speech recognition trained on Common Voice",
},
];
}
Step 5: Implement the Engine (if needed)
If using an existing architecture (NeMo CTC), the engine already exists:
#![allow(unused)]
fn main() {
// crates/sherpa/src/nemo_ctc.rs
pub struct NemoCtcEngine {
recognizer: sherpa_rs::OfflineRecognizer,
}
impl SttEngine for NemoCtcEngine {
fn transcribe(&self, audio: &[f32]) -> Result<Vec<Segment>> {
// ... implementation
}
}
}
Step 6: Test
# Unit test
cargo test -p gibberish-sherpa nemo_ctc
# Integration test
cd apps/desktop && npm run tauri dev
# Select "NeMo Conformer (Catalan)" in Settings
# Speak in Catalan!
Model Requirements
Architecture Support
| Architecture | Supported | Notes |
|---|---|---|
| CTC | ✓ | NeMo, Wav2Vec2 |
| Transducer | ✓ | Zipformer, Conformer |
| Encoder-Decoder | Via Whisper | Use Whisper models directly |
Audio Format
All models must accept:
- Sample rate: 16000 Hz
- Channels: Mono
- Format: Float32 PCM
Our gibberish-audio crate handles resampling automatically.
Vocabulary Format
tokens.txt should contain one token per line:
<blk>
a
b
c
...
z
'
<space>
Special tokens:
<blk>or<blank>— CTC blank token<space>or▁— Word separator<unk>— Unknown token
Troubleshooting
“Model produces garbage output”
Check vocabulary alignment. The token indices must match exactly.
“Model is slow”
Try quantization:
python -m onnxruntime.quantization.quantize \
--input model.onnx \
--output model_int8.onnx \
--quant_format QDQ
“Model crashes on long audio”
Some models have maximum sequence length. Chunk the audio:
#![allow(unused)]
fn main() {
const MAX_SECONDS: usize = 30;
let chunks = audio.chunks(MAX_SECONDS * 16000);
}
Contributing Models
If you successfully add a language:
- Upload to Hugging Face with a clear model card
- Add to
MODELSinmetadata.rs - Submit a PR!
Contributions welcome for:
- Spanish
- French
- German
- Portuguese
Headless Engine
The core transcription engine has zero dependencies on Tauri or UI. You can use it standalone.
Why Headless?
- CLI tools: Build command-line transcription utilities
- Server applications: Run transcription as a service
- Mobile apps: Wrap with FFI for iOS/Android
- Testing: Unit test without UI overhead
Architecture
┌─────────────────────────────────────────┐
│ Your Application │
│ ┌───────────────────────────────────┐ │
│ │ gibberish-application │ │
│ │ (Orchestration & State Machine) │ │
│ └───────────────────────────────────┘ │
│ │ │
│ ┌─────────────┼─────────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────┐ ┌─────────┐ ┌───────┐ │
│ │ bus │ │ stt │ │ vad │ │
│ └──────┘ └─────────┘ └───────┘ │
└─────────────────────────────────────────┘
(No Tauri, No React)
Example: CLI Transcriber
Here’s a minimal CLI that transcribes a WAV file:
// examples/cli_transcribe.rs
use gibberish_audio::load_wav;
use gibberish_sherpa::WhisperEngine;
use gibberish_stt::SttEngine;
fn main() -> anyhow::Result<()> {
let args: Vec<String> = std::env::args().collect();
let wav_path = args.get(1).expect("Usage: cli_transcribe <file.wav>");
// Load audio
let audio = load_wav(wav_path)?;
// Initialize engine
let engine = WhisperEngine::new("path/to/whisper-small")?;
// Transcribe
let segments = engine.transcribe(&audio)?;
// Print results
for segment in segments {
println!("[{:.2}s - {:.2}s] {}",
segment.start_ms as f64 / 1000.0,
segment.end_ms as f64 / 1000.0,
segment.text
);
}
Ok(())
}
Run it:
cargo run --example cli_transcribe recording.wav
Example: Real-Time Streaming
use gibberish_audio::{AudioCapture, AudioConfig};
use gibberish_bus::{AudioBus, AudioChunk};
use gibberish_sherpa::ZipformerEngine;
use gibberish_vad::SileroVad;
fn main() -> anyhow::Result<()> {
// Set up audio capture
let config = AudioConfig {
sample_rate: 16000,
channels: 1,
};
let capture = AudioCapture::new(config)?;
// Set up bus
let (bus, mut listener) = AudioBus::new(100);
// Set up VAD and STT
let mut vad = SileroVad::new()?;
let engine = ZipformerEngine::new("path/to/zipformer")?;
// Start capture
capture.start(move |samples| {
let chunk = AudioChunk::new(samples);
let _ = bus.publish(chunk);
})?;
// Processing loop
loop {
if let Some(chunk) = listener.recv().await {
if vad.is_speech(&chunk.samples)? {
let result = engine.transcribe_streaming(&chunk.samples)?;
if !result.text.is_empty() {
print!("{}", result.text);
}
}
}
}
}
FFI: Using from Swift/Kotlin
For mobile apps, expose a C-compatible interface:
Rust Side
#![allow(unused)]
fn main() {
// src/ffi.rs
use std::ffi::{CStr, CString};
use std::os::raw::c_char;
#[no_mangle]
pub extern "C" fn gibberish_init(model_path: *const c_char) -> *mut Engine {
let path = unsafe { CStr::from_ptr(model_path) }.to_str().unwrap();
let engine = Box::new(Engine::new(path).unwrap());
Box::into_raw(engine)
}
#[no_mangle]
pub extern "C" fn gibberish_transcribe(
engine: *mut Engine,
audio: *const f32,
length: usize,
) -> *mut c_char {
let engine = unsafe { &*engine };
let samples = unsafe { std::slice::from_raw_parts(audio, length) };
let result = engine.transcribe(samples).unwrap();
CString::new(result.text).unwrap().into_raw()
}
#[no_mangle]
pub extern "C" fn gibberish_free(engine: *mut Engine) {
unsafe { drop(Box::from_raw(engine)); }
}
#[no_mangle]
pub extern "C" fn gibberish_free_string(s: *mut c_char) {
unsafe { drop(CString::from_raw(s)); }
}
}
Swift Side
// Gibberish.swift
import Foundation
class Gibberish {
private var engine: OpaquePointer?
init(modelPath: String) {
engine = gibberish_init(modelPath)
}
deinit {
if let engine = engine {
gibberish_free(engine)
}
}
func transcribe(audio: [Float]) -> String {
guard let engine = engine else { return "" }
let result = audio.withUnsafeBufferPointer { ptr in
gibberish_transcribe(engine, ptr.baseAddress, ptr.count)
}
defer { gibberish_free_string(result) }
return String(cString: result!)
}
}
Building for iOS
# Add iOS targets
rustup target add aarch64-apple-ios
# Build static library
cargo build --release --target aarch64-apple-ios
# The library will be at:
# target/aarch64-apple-ios/release/libgibberish.a
Using UniFFI (Recommended)
For production FFI, use UniFFI to auto-generate bindings:
# Cargo.toml
[dependencies]
uniffi = "0.25"
[build-dependencies]
uniffi = { version = "0.25", features = ["build"] }
#![allow(unused)]
fn main() {
// src/lib.rs
#[uniffi::export]
pub fn transcribe(model_path: String, audio: Vec<f32>) -> String {
let engine = Engine::new(&model_path).unwrap();
engine.transcribe(&audio).unwrap().text
}
}
UniFFI generates Swift, Kotlin, Python, and Ruby bindings automatically.
Performance Considerations
When running headless:
- Thread management: You control threading, not Tauri
- Memory: No WebView overhead (~100MB savings)
- Startup: No UI initialization (~500ms faster)
For servers, consider:
- Connection pooling for engines (expensive to create)
- Request queuing during high load
- Graceful degradation when overloaded
Implementation Details
This section documents implementation details that affect perceived responsiveness.
Simulated Streaming
Making batch models feel real-time.
Silence Injection
Prepending silence to prevent hallucinations.
Lock-Free Metrics
Using atomics for metrics instead of mutexes.
Threading Model
Why we use std::thread instead of tokio::spawn for inference.
Audio Hygiene
Resampling and AGC for consistent input quality.
Meeting Detection
Detecting when Zoom/Teams is running.
Simulated Streaming
Batch models are more accurate but have high latency. We decouple visual feedback from final transcription by running partial inference on growing audio buffers.
The “volatile” text shown during recording isn’t a trick—it’s a valid partial hypothesis based on audio heard so far. Human brains work similarly: we predict words before hearing them fully and revise as needed.
The Problem
| Model Type | Accuracy | Latency | Feel |
|---|---|---|---|
| Streaming | Good | ~50ms | Live, responsive |
| Batch | Excellent | ~2000ms | Sluggish, frustrating |
Users are impatient. A 2-second delay feels broken. But batch models are significantly more accurate, especially for:
- Proper nouns (“Kubernetes” vs “Cooper Netties”)
- Rare words
- Accented speech
The Solution
Run partial inference on the growing audio buffer every 500ms.
Time Buffer Display State
─────────────────────────────────────────────────────
0ms [] (empty) waiting
200ms [audio...] (empty) buffering
500ms [audio......] "The quick" volatile
1000ms [audio..........] "The quick brown" volatile
1200ms (pause detected) "The quick brown" processing
1400ms (inference done) "The quick brown fox." stable ✓
Implementation
#![allow(unused)]
fn main() {
pub struct SimulatedStreamer {
buffer: Vec<f32>,
engine: Box<dyn SttEngine>,
partial_interval: Duration,
last_partial: Instant,
}
impl SimulatedStreamer {
pub fn push_audio(&mut self, chunk: &[f32]) -> Option<PartialResult> {
self.buffer.extend_from_slice(chunk);
// Emit partial every 500ms
if self.last_partial.elapsed() >= self.partial_interval {
self.last_partial = Instant::now();
let result = self.engine.transcribe(&self.buffer).ok()?;
return Some(PartialResult {
text: result.text,
is_final: false,
});
}
None
}
pub fn commit(&mut self) -> FinalResult {
// Run final inference on complete buffer
let result = self.engine.transcribe(&self.buffer).unwrap();
// Clear for next utterance
self.buffer.clear();
FinalResult {
text: result.text,
is_final: true,
}
}
}
}
UX: Volatile vs Stable Text
We visually distinguish draft from final:
// Frontend
function TranscriptLine({ segment }: { segment: Segment }) {
return (
<span className={segment.is_final ? 'text-white' : 'text-gray-500'}>
{segment.text}
</span>
);
}
- Volatile (gray): Partial hypothesis, may be revised
- Stable (white): Final transcription
Edge Cases
Partial Overwrites
Each partial replaces the previous:
Partial 1: "The quick"
Partial 2: "The quick brown" // Replaces partial 1
Partial 3: "The quick brown fox" // Replaces partial 2
Final: "The quick brown fox." // Replaces partial 3
Long Utterances
For very long speech (>30s), we chunk the buffer:
#![allow(unused)]
fn main() {
const MAX_BUFFER_SECONDS: usize = 30;
if self.buffer.len() > MAX_BUFFER_SECONDS * 16000 {
// Force commit and start fresh
self.commit();
}
}
Rapid Corrections
If the user speaks, pauses briefly, then continues, we may commit prematurely. Smart Turn detection helps, but isn’t perfect. We accept occasional mis-commits in exchange for responsiveness.
Performance
| Metric | Pure Batch | Simulated Streaming |
|---|---|---|
| Perceived latency | 2000ms | 500ms |
| Accuracy | Excellent | Excellent (same model) |
| CPU usage | Lower | Higher (repeated inference) |
The CPU trade-off is worth it for UX.
When NOT to Use
Simulated streaming adds overhead. Skip it when:
- Processing pre-recorded files (no need for real-time feel)
- Running on low-end hardware (CPU budget matters)
- Accuracy is more important than speed (archival use case)
Silence Injection
The “clear throat” hack that prevents hallucinations.
The Problem
Streaming decoders maintain internal state. When speech ends, this state can get “stuck” in a loop:
User says: "Hello world"
User stops: (silence)
Model outputs: "Hello world. Thank you. Thank you. Thank you..."
The model is hallucinating. It expects more input and fills the gap with plausible-sounding garbage.
Why It Happens
Transducer models have a “joiner” network that predicts the next token based on:
- Acoustic features (from audio)
- Previous predictions (from decoder state)
During silence, acoustic features are near-zero, but the decoder state still has momentum from the previous words. The model “invents” continuations.
The Solution
Explicitly feed silence into the decoder to reset its state:
#![allow(unused)]
fn main() {
const SILENCE_DURATION_MS: usize = 100;
const SILENCE_SAMPLES: usize = SILENCE_DURATION_MS * 16; // 16 samples/ms at 16kHz
pub fn inject_silence(&mut self) {
let silence = vec![0.0f32; SILENCE_SAMPLES];
self.recognizer.accept_waveform(&silence);
// Force decoder to flush
self.recognizer.input_finished();
}
}
When to Inject
Trigger silence injection when:
- VAD detects speech-end (transition from speech to silence)
- A configurable grace period has passed (e.g., 300ms)
- Before requesting final output
#![allow(unused)]
fn main() {
impl StreamingTranscriber {
pub fn on_vad_speech_end(&mut self) {
// Wait for Smart Turn confirmation
if self.smart_turn.is_likely_complete() {
self.inject_silence();
let final_text = self.recognizer.get_result();
self.emit_commit(final_text);
self.reset_state();
}
}
}
}
The “Digital Silence”
We inject zeros, not actual recorded silence. Why?
| Type | Contents | Effect |
|---|---|---|
| Recorded silence | Room noise, hum | Model might hear “words” in noise |
| Digital silence | Pure zeros | Unambiguous “nothing to hear” |
#![allow(unused)]
fn main() {
// Good: Pure digital silence
let silence = vec![0.0f32; 1600];
// Bad: Recorded silence (might contain noise)
let silence = record_ambient_audio(100);
}
How Much Silence?
We experimentally tuned to 100ms:
| Duration | Effect |
|---|---|
| 50ms | Sometimes not enough to reset |
| 100ms | Reliable reset, minimal delay |
| 200ms | Works but adds unnecessary latency |
#![allow(unused)]
fn main() {
const SILENCE_MS: usize = 100;
}
Interaction with Smart Turn
Silence injection happens after Smart Turn confirms completion:
graph TD
A[VAD: Silence Detected] --> B{Smart Turn?}
B -->|P(End) < 0.5| C[Keep Listening]
B -->|P(End) >= 0.5| D[Inject Silence]
D --> E[Get Final Result]
E --> F[Emit Commit]
F --> G[Reset State]
If we inject too early, we cut off the user mid-sentence.
Code
#![allow(unused)]
fn main() {
// crates/sherpa/src/streaming.rs
impl StreamingRecognizer {
pub fn end_utterance(&mut self) -> String {
// Inject silence to flush decoder
let silence = vec![0.0f32; 1600]; // 100ms
self.accept_waveform(&silence);
// Mark input as finished
self.input_finished();
// Get final result
let result = self.final_result();
// Reset for next utterance
self.reset();
result
}
}
}
Without Silence Injection
Input: "The quick brown fox"
Output: "The quick brown fox jumps over the lazy dog thank you thank you"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Hallucination
With Silence Injection
Input: "The quick brown fox"
Output: "The quick brown fox"
(clean end)
The difference is dramatic for user experience.
Atomic Observability
The audio pipeline updates metrics frequently. Using mutexes would cause contention between the audio thread and UI thread, so we use atomic types instead.
Data Structure
#![allow(unused)]
fn main() {
use std::sync::atomic::{AtomicI64, AtomicU64, Ordering};
pub struct PipelineStatus {
pub audio_lag_ms: AtomicI64,
pub inference_time_ms: AtomicU64,
pub dropped_chunks: AtomicU64,
pub total_chunks: AtomicU64,
}
}
Atomic operations compile to single CPU instructions and don’t block.
Implementation
Writing (Audio Thread)
#![allow(unused)]
fn main() {
impl PipelineStatus {
pub fn update_lag(&self, lag_ms: i64) {
self.audio_lag_ms.store(lag_ms, Ordering::Relaxed);
}
pub fn record_inference(&self, time_ms: u64) {
self.inference_time_ms.store(time_ms, Ordering::Relaxed);
}
pub fn increment_dropped(&self) {
self.dropped_chunks.fetch_add(1, Ordering::Relaxed);
}
}
}
Reading (UI Thread)
#![allow(unused)]
fn main() {
impl PipelineStatus {
pub fn snapshot(&self) -> MetricsSnapshot {
MetricsSnapshot {
audio_lag_ms: self.audio_lag_ms.load(Ordering::Relaxed),
inference_time_ms: self.inference_time_ms.load(Ordering::Relaxed),
dropped_chunks: self.dropped_chunks.load(Ordering::Relaxed),
total_chunks: self.total_chunks.load(Ordering::Relaxed),
}
}
}
}
Memory Ordering
We use Ordering::Relaxed because:
- We don’t need synchronization between different metrics
- We only care about “eventually consistent” values
- It’s the fastest ordering
For metrics dashboards, slightly stale data is acceptable.
Sharing Across Threads
#![allow(unused)]
fn main() {
use std::sync::Arc;
// Create shared status
let status = Arc::new(PipelineStatus::default());
// Clone for audio thread
let audio_status = Arc::clone(&status);
std::thread::spawn(move || {
loop {
// Update metrics without blocking
audio_status.update_lag(compute_lag());
}
});
// Clone for UI polling
let ui_status = Arc::clone(&status);
tokio::spawn(async move {
loop {
let snapshot = ui_status.snapshot();
emit_metrics(&snapshot);
tokio::time::sleep(Duration::from_millis(100)).await;
}
});
}
What We Track
| Metric | Type | Meaning |
|---|---|---|
audio_lag_ms | i64 | Time since audio was captured |
inference_time_ms | u64 | Last model execution time |
dropped_chunks | u64 | Backpressure indicator |
total_chunks | u64 | For calculating drop rate |
Derived Metrics
#![allow(unused)]
fn main() {
impl MetricsSnapshot {
pub fn drop_rate(&self) -> f64 {
if self.total_chunks == 0 {
0.0
} else {
self.dropped_chunks as f64 / self.total_chunks as f64
}
}
pub fn real_time_factor(&self) -> f64 {
// RTF < 1.0 means faster than real-time
self.inference_time_ms as f64 / 1000.0 / CHUNK_DURATION_SECONDS
}
}
}
UI Display
function MetricsDisplay() {
const [metrics, setMetrics] = useState<Metrics | null>(null);
useEffect(() => {
const unlisten = listen<Metrics>('metrics:update', (event) => {
setMetrics(event.payload);
});
return () => { unlisten.then(f => f()); };
}, []);
if (!metrics) return null;
return (
<div className="text-xs text-gray-500">
Latency: {metrics.audio_lag_ms}ms |
RTF: {metrics.real_time_factor.toFixed(2)} |
Drops: {(metrics.drop_rate * 100).toFixed(1)}%
</div>
);
}
Debugging Tip
When logging metrics, take a snapshot first rather than loading individual atomics separately:
#![allow(unused)]
fn main() {
let snap = status.snapshot();
debug!("Metrics: {:?}", snap);
}
Threading Model
We use std::thread for inference, not tokio::spawn. Here’s why.
The Mistake We Made
Our first version looked like this:
#![allow(unused)]
fn main() {
// DON'T DO THIS
tokio::spawn(async move {
loop {
let chunk = rx.recv().await?;
let result = engine.transcribe(&chunk.samples)?; // BLOCKS FOR 100ms!
app.emit("stt:update", &result)?;
}
});
}
This worked… until it didn’t. Under load, the UI froze. Audio dropped. Everything felt sluggish.
The Problem
ONNX Runtime inference is CPU-bound and blocking. A single transcribe() call might take 50-200ms of pure CPU work.
Tokio’s async runtime assumes tasks yield frequently. When a task blocks for 100ms, it starves other tasks:
Task A: transcribe() ──────────────────────────────────▶ done
Task B: (waiting for audio) .......................... (finally runs)
Task C: (waiting for UI event) ........................ (finally runs)
▲
100ms of nothing happening
The Tokio docs explicitly warn about this.
The Fix
Move blocking work to dedicated OS threads:
#![allow(unused)]
fn main() {
// Dedicated thread for inference
std::thread::spawn(move || {
loop {
// Block here - it's fine, we're on our own thread
let chunk = rx.blocking_recv().unwrap();
let result = engine.transcribe(&chunk.samples).unwrap();
// Send result back to async world
result_tx.blocking_send(result).unwrap();
}
});
// Async task just forwards results
tokio::spawn(async move {
while let Some(result) = result_rx.recv().await {
app.emit("stt:update", &result)?;
}
});
}
Thread Allocation
| Thread | Purpose | Priority |
|---|---|---|
| Main | Tauri/UI event loop | Normal |
| Audio | cpal callback | High (OS-managed) |
| STT | ONNX inference | Normal |
| VAD | Silero inference | Normal |
We don’t set thread priorities manually—the OS scheduler handles it well enough for our needs.
Why Not spawn_blocking?
Tokio provides spawn_blocking() for blocking tasks:
#![allow(unused)]
fn main() {
tokio::task::spawn_blocking(move || {
engine.transcribe(&samples)
}).await?
}
This works, but:
- Creates a new thread per call (overhead)
- Limited by
max_blocking_threads(defaults to 512) - Threads are pooled but not reused predictably
For a continuous stream of inference calls, a dedicated thread is simpler and more predictable.
Channel Selection
We need channels that bridge sync and async:
#![allow(unused)]
fn main() {
// Option 1: tokio::sync::mpsc (what we use)
let (tx, mut rx) = tokio::sync::mpsc::channel(100);
// tx.blocking_send() from sync thread
// rx.recv().await from async task
// Option 2: crossbeam + tokio wrapper
// More complex, no real benefit for our use case
}
Memory Considerations
Each thread has its own stack (default 2MB on macOS). With 4 threads:
- Audio thread: ~2MB
- STT thread: ~2MB + model memory
- VAD thread: ~2MB + model memory
- Main thread: ~2MB
The model memory dominates. Thread stacks are negligible.
Debugging
Thread bugs are subtle. Tools that help:
# See thread count
ps -M <pid>
# Profile with Instruments
xcrun xctrace record --template "Time Profiler" --launch ./gibberish
# Logging (add to Cargo.toml)
# tracing = "0.1"
# tracing-subscriber = "0.3"
Error Handling
Threads don’t propagate panics to the main thread. Handle errors explicitly:
#![allow(unused)]
fn main() {
std::thread::spawn(move || {
let result = std::panic::catch_unwind(|| {
// Inference loop
});
if let Err(e) = result {
eprintln!("STT thread panicked: {:?}", e);
// Notify main thread via channel
error_tx.send(SttError::ThreadPanic).ok();
}
});
}
Code Reference
The actual implementation lives in plugins/stt-worker/src/worker.rs.
Audio Hygiene
Bad microphones shouldn’t mean bad transcripts. We fix what we can.
The Problems
Consumer microphones vary wildly:
- Built-in laptop mics pick up fan noise
- USB mics have different gain settings
- Sample rates range from 8kHz to 96kHz
- Some mics clip, others are too quiet
Models expect clean, consistent 16kHz audio. We bridge the gap.
Resampling
All models need 16kHz mono audio. Users have everything else.
Why Sinc Interpolation?
#![allow(unused)]
fn main() {
use rubato::{FftFixedIn, Resampler};
let resampler = FftFixedIn::<f32>::new(
input_rate, // e.g., 44100
16000, // target
chunk_size,
2, // sub-chunks
1, // channels
)?;
}
We use rubato’s FFT-based sinc resampling. Alternatives:
| Method | Quality | Speed | Our Use |
|---|---|---|---|
| Nearest neighbor | Terrible | Fast | Never |
| Linear | Poor | Fast | Never |
| Sinc (rubato) | Excellent | Medium | Yes |
Linear interpolation creates aliasing artifacts that sound “robotic.” Speech recognition models weren’t trained on robotic audio—they perform worse.
The CPU cost of proper resampling is negligible compared to inference.
Automatic Gain Control
The Problem
User A (quiet voice): ▁▁▂▁▁▂▁ (signal barely visible)
User B (loud voice): ▇▇█▇▇█▇ (signal clipping)
Model expects: ▃▄▅▄▃▅▄ (normalized range)
Our Solution
Soft-knee compression with tanh:
#![allow(unused)]
fn main() {
const TARGET_DB: f32 = -20.0;
const ATTACK_MS: f32 = 10.0;
const RELEASE_MS: f32 = 100.0;
pub struct Agc {
gain: f32,
target_rms: f32,
}
impl Agc {
pub fn process(&mut self, samples: &mut [f32]) {
let rms = calculate_rms(samples);
let target_gain = self.target_rms / rms.max(1e-10);
// Smooth gain changes to avoid clicks
self.gain = lerp(self.gain, target_gain, self.smoothing);
// Apply gain with soft clipping
for sample in samples.iter_mut() {
*sample = (*sample * self.gain).tanh();
}
}
}
}
The tanh function provides soft clipping—instead of hard clipping at ±1.0 (which sounds harsh), it smoothly compresses peaks.
Target Level
We target -20 dBFS. Why?
- Leaves headroom for peaks
- Matches typical model training data
- Consistent across different mic gains
DC Offset Removal
Some cheap mics have DC offset—the signal “floats” above or below zero:
Bad: ▄▅▆▅▄▅▆▅▄▅ (offset from zero)
Good: ▃▄▅▄▃▄▅▄▃▄ (centered on zero)
We use a simple high-pass filter:
#![allow(unused)]
fn main() {
const CUTOFF_HZ: f32 = 20.0; // Remove everything below 20Hz
pub fn remove_dc(samples: &mut [f32], state: &mut f32) {
let alpha = 1.0 - (2.0 * PI * CUTOFF_HZ / 16000.0);
for sample in samples.iter_mut() {
let new_state = *sample + alpha * *state;
*sample = new_state - *state;
*state = new_state;
}
}
}
Noise Gate
We don’t use one. Here’s why:
Noise gates cut audio below a threshold. In theory, they reduce background noise. In practice:
- They clip word beginnings (“hello” → “ello”)
- Silero VAD already handles speech detection
- Models are trained on noisy data and handle it fine
If the environment is so noisy that VAD triggers incorrectly, a noise gate won’t help—the user needs a better mic or quieter room.
Preprocessing Pipeline
Audio flows through these stages in order:
Mic → DC Remove → Resample → AGC → Model
Each stage is independent and stateless (except AGC’s smoothing state).
Testing
We keep a collection of “pathological” audio files:
- Recorded at 8kHz
- Heavy background noise
- Extreme clipping
- Strong DC offset
CI runs inference on these files. If accuracy drops, we investigate.
Code
- Resampling:
crates/audio/src/resample.rs - AGC:
crates/audio/src/agc.rs - Pipeline:
crates/audio/src/stream.rs
The Context Engine
gibb.eri.sh knows what you’re doing. Here’s how.
The Goal
To enable Context-Aware AI, we need to know the user’s state without burning the CPU.
- Are they coding? (Enable Git tools)
- Are they in a meeting? (Enable Transcription tools)
- Are they looking at a specific URL? (Provide deep context)
The Implementation
We use a high-frequency polling loop in crates/context that build a realtime snapshot of the OS state.
1. Active App Detection (Native Cocoa)
We use the macOS Cocoa API (NSWorkspace) via the objc crate to detect the focused application.
Why Native instead of AppleScript?
- Performance: Sub-millisecond execution. No subprocess fork/exec overhead.
- Efficiency: Negligible CPU usage even at 1s polling intervals.
- Reliability: Directly queries the Window Server for the
frontmostApplication.
2. Browser Deep Context (URL Detection)
When a supported browser (Chrome, Safari, Arc, Brave) is focused, we go deeper.
- Mechanism: We use a targeted AppleScript call to fetch the
URLof the active tab. - Optimization: We only trigger the AppleScript if the active application is a browser, preventing unnecessary overhead.
- Value: This allows “Summarize this page” to work by feeding the URL directly to our extraction tools.
3. Meeting Detection (The Activity)
We monitor CoreAudio to see if known meeting apps (Zoom, Teams) are accessing the microphone.
- Crate:
crates/detect(wrapped bycontext) - Logic:
is_mic_active && is_meeting_app(bundle_id)
Privacy
- Local Only: No context data leaves the device.
- Targeted: We only care about specific
bundle_ids. We don’t read window titles or keystrokes. - Incognito Awareness: We attempt to detect and ignore private browsing windows to avoid leaking sensitive URLs into the LLM context.
Clean Architecture
We use Dependency Inversion to keep the codebase maintainable. Here’s the pattern.
The Problem We Avoided
Imagine adding a new STT engine:
#![allow(unused)]
fn main() {
// BAD: Direct dependencies everywhere
match config.engine {
Engine::Sherpa => sherpa::transcribe(&audio),
Engine::Parakeet => parakeet::transcribe(&audio),
Engine::NewEngine => new_engine::transcribe(&audio), // ADD HERE
}
// ... and here, and here, and here
}
Every new engine means touching multiple files. Tests break. Things get coupled.
The Solution: Trait-Based Abstraction
Define a trait. Implement it. Inject the implementation.
The SttEngine Trait
#![allow(unused)]
fn main() {
// crates/stt/src/engine.rs
pub trait SttEngine: Send + Sync {
fn transcribe(&self, audio: &[f32]) -> Result<Vec<Segment>>;
fn is_streaming_capable(&self) -> bool;
fn model_name(&self) -> &str;
fn supported_languages(&self) -> Vec<&'static str>;
}
}
Implementations
#![allow(unused)]
fn main() {
// crates/sherpa/src/zipformer.rs
impl SttEngine for ZipformerEngine {
fn transcribe(&self, audio: &[f32]) -> Result<Vec<Segment>> {
// Sherpa-specific implementation
}
// ...
}
// crates/parakeet/src/lib.rs
impl SttEngine for ParakeetEngine {
fn transcribe(&self, audio: &[f32]) -> Result<Vec<Segment>> {
// Parakeet-specific implementation
}
// ...
}
}
Usage
The application layer never knows which engine it’s using:
#![allow(unused)]
fn main() {
// crates/application/src/transcriber.rs
pub struct Transcriber {
engine: Box<dyn SttEngine>,
}
impl Transcriber {
pub fn new(engine: Box<dyn SttEngine>) -> Self {
Self { engine }
}
pub fn process(&self, audio: &[f32]) -> Result<Vec<Segment>> {
self.engine.transcribe(audio)
}
}
}
The Factory Pattern
How do we create the right engine at runtime?
#![allow(unused)]
fn main() {
// crates/stt/src/loader.rs
pub trait EngineLoader: Send + Sync {
fn name(&self) -> &str;
fn can_load(&self, model_id: &str) -> bool;
fn load(&self, model_path: &Path) -> Result<Box<dyn SttEngine>>;
}
// Usage
pub fn create_engine(
loaders: &[Box<dyn EngineLoader>],
model_id: &str,
path: &Path,
) -> Result<Box<dyn SttEngine>> {
for loader in loaders {
if loader.can_load(model_id) {
return loader.load(path);
}
}
Err(Error::UnknownModel(model_id.to_string()))
}
}
Adding a New Engine
Adding WhisperTurbo requires:
- Create
crates/whisper-turbo/ - Implement
SttEngine - Implement
EngineLoader - Register the loader at startup
No changes to crates/application/. No changes to existing engines. No changes to the UI.
#![allow(unused)]
fn main() {
// crates/whisper-turbo/src/lib.rs
pub struct WhisperTurboEngine { /* ... */ }
impl SttEngine for WhisperTurboEngine {
fn transcribe(&self, audio: &[f32]) -> Result<Vec<Segment>> {
// Implementation
}
// ...
}
pub struct WhisperTurboLoader;
impl EngineLoader for WhisperTurboLoader {
fn name(&self) -> &str { "whisper-turbo" }
fn can_load(&self, model_id: &str) -> bool {
model_id.starts_with("whisper-turbo")
}
fn load(&self, path: &Path) -> Result<Box<dyn SttEngine>> {
Ok(Box::new(WhisperTurboEngine::new(path)?))
}
}
}
The Dependency Graph
┌─────────────────┐
│ application │
│ (orchestration)│
└────────┬────────┘
│ depends on trait
▼
┌─────────────────┐
│ stt │
│ (SttEngine) │
└────────┬────────┘
│ implemented by
┌────────────────┼────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ sherpa │ │ parakeet │ │ whisper │
└──────────┘ └──────────┘ └──────────┘
application never imports sherpa, parakeet, or whisper directly. It only knows SttEngine.
Testing
Trait-based design makes testing simple:
#![allow(unused)]
fn main() {
struct MockEngine {
response: Vec<Segment>,
}
impl SttEngine for MockEngine {
fn transcribe(&self, _audio: &[f32]) -> Result<Vec<Segment>> {
Ok(self.response.clone())
}
// ...
}
#[test]
fn test_transcriber() {
let mock = MockEngine {
response: vec![Segment { text: "hello".into(), ..Default::default() }],
};
let transcriber = Transcriber::new(Box::new(mock));
let result = transcriber.process(&[0.0; 1600]).unwrap();
assert_eq!(result[0].text, "hello");
}
}
No model files needed. No inference overhead. Fast tests.
Other Traits
The same pattern applies elsewhere:
| Trait | Location | Implementations |
|---|---|---|
SttEngine | crates/stt | Sherpa, Parakeet |
VoiceActivityDetector | crates/vad | Silero |
TurnDetector | crates/turn | SmartTurn, Simple |
SessionStorage | crates/storage | SQLite |
The Trade-Off
Trait objects have runtime cost:
- Dynamic dispatch (vtable lookup)
- Can’t be inlined
For inference (already 50-200ms), this overhead is negligible. We measured <1μs per trait call.
If performance mattered here, we’d use generics:
#![allow(unused)]
fn main() {
// Generic (faster, but less flexible)
pub struct Transcriber<E: SttEngine> {
engine: E,
}
// Trait object (our choice: flexible, dynamic)
pub struct Transcriber {
engine: Box<dyn SttEngine>,
}
}
We chose flexibility. Runtime engine switching is worth the microseconds.