Hybrid Inference Engine

gibb.eri.sh supports two modes of operation, selectable at runtime. Each has trade-offs.

Streaming Mode (Sherpa-ONNX Zipformer)

Best for: Dictation, live captioning, instant feedback

Audio ─▶ [Transducer] ─▶ Partial results every ~50ms

How It Works

The Zipformer model uses a transducer architecture:

Processes audio in small chunks (10-20ms)
Maintains internal state between chunks
Emits partial hypotheses continuously
Refines predictions as context grows

Characteristics

Aspect	Value
Latency	~50ms per update
Accuracy	Good (may miss proper nouns)
Languages	English (primary)
Model Size	~100MB

When Streaming Struggles

Proper nouns: “Kubernetes” might become “Cooper Netties”
Rare words: Technical jargon may be misheard
Accents: Less training data for non-standard speech

Batch Mode (Parakeet / Whisper)

Best for: Meetings, archival, accuracy-critical tasks

Audio ─▶ [VAD Buffer] ─▶ [Encoder-Decoder] ─▶ Final text on pause

How It Works

Batch models see the entire utterance before producing output:

VAD detects speech boundaries
Audio is buffered during speech
Model processes the complete segment
Result is highly accurate

Characteristics

Aspect	Value
Latency	~500ms after speech ends
Accuracy	Excellent
Languages	99+ (Whisper)
Model Size	~500MB

Simulated Streaming

Users want batch accuracy with streaming feel. We fake it:

Run partial inference every 500ms on the growing buffer
Display “volatile” text (gray, may change)
On VAD trigger, run final inference
Replace volatile text with “stable” text (black, final)

Speaking: "The quick brown fox"

Time 0ms:   [           ] (buffering)
Time 500ms: [The quick  ] volatile
Time 1000ms:[The quick brown] volatile
Time 1200ms: (pause detected)
Time 1400ms:[The quick brown fox.] stable ✓

Switching Modes

Users can switch modes at runtime via the Settings sheet:

// Frontend
await invoke('plugin:stt|set_mode', { mode: 'streaming' });
await invoke('plugin:stt|set_mode', { mode: 'batch' });

The backend handles the transition gracefully, draining any buffered audio.

Model Recommendations

We’ve tested many models. Here are our picks:

Use Case	Recommended Model
General dictation	Sherpa Zipformer (streaming)
Meetings	Whisper Small (batch)
Non-English	Whisper Small (batch)
Low-end hardware	Sherpa Zipformer (streaming)

Keyboard shortcuts

Gibberish Documentation