Simulated Streaming

Batch models are more accurate but have high latency. We decouple visual feedback from final transcription by running partial inference on growing audio buffers.

The “volatile” text shown during recording isn’t a trick—it’s a valid partial hypothesis based on audio heard so far. Human brains work similarly: we predict words before hearing them fully and revise as needed.

The Problem

Model Type	Accuracy	Latency	Feel
Streaming	Good	~50ms	Live, responsive
Batch	Excellent	~2000ms	Sluggish, frustrating

Users are impatient. A 2-second delay feels broken. But batch models are significantly more accurate, especially for:

Proper nouns (“Kubernetes” vs “Cooper Netties”)
Rare words
Accented speech

The Solution

Run partial inference on the growing audio buffer every 500ms.

Time    Buffer              Display           State
─────────────────────────────────────────────────────
0ms     []                  (empty)           waiting
200ms   [audio...]          (empty)           buffering
500ms   [audio......]       "The quick"       volatile
1000ms  [audio..........]   "The quick brown" volatile
1200ms  (pause detected)    "The quick brown" processing
1400ms  (inference done)    "The quick brown fox." stable ✓

Implementation

#![allow(unused)]
fn main() {
pub struct SimulatedStreamer {
    buffer: Vec<f32>,
    engine: Box<dyn SttEngine>,
    partial_interval: Duration,
    last_partial: Instant,
}

impl SimulatedStreamer {
    pub fn push_audio(&mut self, chunk: &[f32]) -> Option<PartialResult> {
        self.buffer.extend_from_slice(chunk);

        // Emit partial every 500ms
        if self.last_partial.elapsed() >= self.partial_interval {
            self.last_partial = Instant::now();

            let result = self.engine.transcribe(&self.buffer).ok()?;
            return Some(PartialResult {
                text: result.text,
                is_final: false,
            });
        }

        None
    }

    pub fn commit(&mut self) -> FinalResult {
        // Run final inference on complete buffer
        let result = self.engine.transcribe(&self.buffer).unwrap();

        // Clear for next utterance
        self.buffer.clear();

        FinalResult {
            text: result.text,
            is_final: true,
        }
    }
}
}

UX: Volatile vs Stable Text

We visually distinguish draft from final:

// Frontend
function TranscriptLine({ segment }: { segment: Segment }) {
    return (
        <span className={segment.is_final ? 'text-white' : 'text-gray-500'}>
            {segment.text}
        </span>
    );
}

Volatile (gray): Partial hypothesis, may be revised
Stable (white): Final transcription

Edge Cases

Partial Overwrites

Each partial replaces the previous:

Partial 1: "The quick"
Partial 2: "The quick brown"      // Replaces partial 1
Partial 3: "The quick brown fox"  // Replaces partial 2
Final:     "The quick brown fox." // Replaces partial 3

Long Utterances

For very long speech (>30s), we chunk the buffer:

#![allow(unused)]
fn main() {
const MAX_BUFFER_SECONDS: usize = 30;

if self.buffer.len() > MAX_BUFFER_SECONDS * 16000 {
    // Force commit and start fresh
    self.commit();
}
}

Rapid Corrections

If the user speaks, pauses briefly, then continues, we may commit prematurely. Smart Turn detection helps, but isn’t perfect. We accept occasional mis-commits in exchange for responsiveness.

Performance

Metric	Pure Batch	Simulated Streaming
Perceived latency	2000ms	500ms
Accuracy	Excellent	Excellent (same model)
CPU usage	Lower	Higher (repeated inference)

The CPU trade-off is worth it for UX.

When NOT to Use

Simulated streaming adds overhead. Skip it when:

Processing pre-recorded files (no need for real-time feel)
Running on low-end hardware (CPU budget matters)
Accuracy is more important than speed (archival use case)

Keyboard shortcuts

Gibberish Documentation