Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Smart Turn Detection

Standard VAD detects silence. Smart Turn detects completion.

The Problem

Voice Activity Detection (VAD) detects silence. Humans detect pauses.

We pause for many reasons:

  • Thinking: “I want to… [pause] …explain something”
  • Breathing: Natural respiratory pauses
  • Emphasis: “This is… [dramatic pause] …important”
  • Completion: “That’s all I have to say.”

Standard VAD treats all pauses the same. This leads to:

  • Sentences being split mid-thought
  • Awkward commit timing
  • User frustration

The Solution

We implement a Neural Turn Detector inspired by Daily.co’s VAD 3.1 research.

Instead of just measuring silence, we analyze:

  1. Acoustic features: Pitch contour, energy decay
  2. Timing: Duration and pattern of the pause
  3. Semantic probability: Is this a likely sentence ending?

The Algorithm

if (Silence > 300ms AND Probability(EndOfSentence) > 0.5):
    Commit()
else:
    Wait()

Components

ComponentRole
Silero VADDetects raw silence
Smart Turn ModelPredicts sentence completion
Redemption TimerGrace period before commit

Implementation

The Smart Turn detector lives in crates/smart-turn:

#![allow(unused)]
fn main() {
pub struct SmartTurnV31Cpu {
    session: Mutex<Session>,  // ONNX Runtime session
    input_name: String,
    output_name: String,
}

impl TurnDetector for SmartTurnV31Cpu {
    fn predict_endpoint_probability(
        &self,
        audio_16k_mono: &[f32]
    ) -> Result<f32, TurnError> {
        // Returns probability 0.0-1.0 that speaker is done
    }
}
}

Configuration

Users can tune the behavior via Settings:

Redemption Time

The grace period after silence begins before we even consider committing.

SettingValueEffect
Fast200msQuick commits, may split sentences
Balanced300msDefault, good for most users
Relaxed500msWaits longer, better for slow speakers

Sensitivity

How confident must we be that the sentence is complete?

SettingThresholdEffect
Aggressive0.3Commits on weak signals
Normal0.5Balanced
Conservative0.7Only commits on strong endings

The Flow

graph TD
    A[Audio Input] --> B{VAD: Speech?}
    B -->|Yes| C[Buffer Audio]
    B -->|No| D{Silence > Redemption?}
    D -->|No| C
    D -->|Yes| E[Smart Turn Analysis]
    E --> F{P(End) > Threshold?}
    F -->|Yes| G[Commit Text]
    F -->|No| C
    G --> H[Reset State]

Real-World Impact

Without Smart Turn:

User: "I think we should... [thinking pause]"
System: COMMIT → "I think we should"
User: "...consider the alternatives"
System: COMMIT → "consider the alternatives"

With Smart Turn:

User: "I think we should... [thinking pause] ...consider the alternatives"
System: (waiting, P(End) = 0.2)
System: (waiting, P(End) = 0.3)
User: [longer pause, falling intonation]
System: (P(End) = 0.7) COMMIT → "I think we should consider the alternatives"