Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Simulated Streaming

Batch models are more accurate but have high latency. We decouple visual feedback from final transcription by running partial inference on growing audio buffers.

The “volatile” text shown during recording isn’t a trick—it’s a valid partial hypothesis based on audio heard so far. Human brains work similarly: we predict words before hearing them fully and revise as needed.

The Problem

Model TypeAccuracyLatencyFeel
StreamingGood~50msLive, responsive
BatchExcellent~2000msSluggish, frustrating

Users are impatient. A 2-second delay feels broken. But batch models are significantly more accurate, especially for:

  • Proper nouns (“Kubernetes” vs “Cooper Netties”)
  • Rare words
  • Accented speech

The Solution

Run partial inference on the growing audio buffer every 500ms.

Time    Buffer              Display           State
─────────────────────────────────────────────────────
0ms     []                  (empty)           waiting
200ms   [audio...]          (empty)           buffering
500ms   [audio......]       "The quick"       volatile
1000ms  [audio..........]   "The quick brown" volatile
1200ms  (pause detected)    "The quick brown" processing
1400ms  (inference done)    "The quick brown fox." stable ✓

Implementation

#![allow(unused)]
fn main() {
pub struct SimulatedStreamer {
    buffer: Vec<f32>,
    engine: Box<dyn SttEngine>,
    partial_interval: Duration,
    last_partial: Instant,
}

impl SimulatedStreamer {
    pub fn push_audio(&mut self, chunk: &[f32]) -> Option<PartialResult> {
        self.buffer.extend_from_slice(chunk);

        // Emit partial every 500ms
        if self.last_partial.elapsed() >= self.partial_interval {
            self.last_partial = Instant::now();

            let result = self.engine.transcribe(&self.buffer).ok()?;
            return Some(PartialResult {
                text: result.text,
                is_final: false,
            });
        }

        None
    }

    pub fn commit(&mut self) -> FinalResult {
        // Run final inference on complete buffer
        let result = self.engine.transcribe(&self.buffer).unwrap();

        // Clear for next utterance
        self.buffer.clear();

        FinalResult {
            text: result.text,
            is_final: true,
        }
    }
}
}

UX: Volatile vs Stable Text

We visually distinguish draft from final:

// Frontend
function TranscriptLine({ segment }: { segment: Segment }) {
    return (
        <span className={segment.is_final ? 'text-white' : 'text-gray-500'}>
            {segment.text}
        </span>
    );
}
  • Volatile (gray): Partial hypothesis, may be revised
  • Stable (white): Final transcription

Edge Cases

Partial Overwrites

Each partial replaces the previous:

Partial 1: "The quick"
Partial 2: "The quick brown"      // Replaces partial 1
Partial 3: "The quick brown fox"  // Replaces partial 2
Final:     "The quick brown fox." // Replaces partial 3

Long Utterances

For very long speech (>30s), we chunk the buffer:

#![allow(unused)]
fn main() {
const MAX_BUFFER_SECONDS: usize = 30;

if self.buffer.len() > MAX_BUFFER_SECONDS * 16000 {
    // Force commit and start fresh
    self.commit();
}
}

Rapid Corrections

If the user speaks, pauses briefly, then continues, we may commit prematurely. Smart Turn detection helps, but isn’t perfect. We accept occasional mis-commits in exchange for responsiveness.

Performance

MetricPure BatchSimulated Streaming
Perceived latency2000ms500ms
AccuracyExcellentExcellent (same model)
CPU usageLowerHigher (repeated inference)

The CPU trade-off is worth it for UX.

When NOT to Use

Simulated streaming adds overhead. Skip it when:

  • Processing pre-recorded files (no need for real-time feel)
  • Running on low-end hardware (CPU budget matters)
  • Accuracy is more important than speed (archival use case)