Audio Hygiene
Bad microphones shouldn’t mean bad transcripts. We fix what we can.
The Problems
Consumer microphones vary wildly:
- Built-in laptop mics pick up fan noise
- USB mics have different gain settings
- Sample rates range from 8kHz to 96kHz
- Some mics clip, others are too quiet
Models expect clean, consistent 16kHz audio. We bridge the gap.
Resampling
All models need 16kHz mono audio. Users have everything else.
Why Sinc Interpolation?
#![allow(unused)]
fn main() {
use rubato::{FftFixedIn, Resampler};
let resampler = FftFixedIn::<f32>::new(
input_rate, // e.g., 44100
16000, // target
chunk_size,
2, // sub-chunks
1, // channels
)?;
}
We use rubato’s FFT-based sinc resampling. Alternatives:
| Method | Quality | Speed | Our Use |
|---|---|---|---|
| Nearest neighbor | Terrible | Fast | Never |
| Linear | Poor | Fast | Never |
| Sinc (rubato) | Excellent | Medium | Yes |
Linear interpolation creates aliasing artifacts that sound “robotic.” Speech recognition models weren’t trained on robotic audio—they perform worse.
The CPU cost of proper resampling is negligible compared to inference.
Automatic Gain Control
The Problem
User A (quiet voice): ▁▁▂▁▁▂▁ (signal barely visible)
User B (loud voice): ▇▇█▇▇█▇ (signal clipping)
Model expects: ▃▄▅▄▃▅▄ (normalized range)
Our Solution
Soft-knee compression with tanh:
#![allow(unused)]
fn main() {
const TARGET_DB: f32 = -20.0;
const ATTACK_MS: f32 = 10.0;
const RELEASE_MS: f32 = 100.0;
pub struct Agc {
gain: f32,
target_rms: f32,
}
impl Agc {
pub fn process(&mut self, samples: &mut [f32]) {
let rms = calculate_rms(samples);
let target_gain = self.target_rms / rms.max(1e-10);
// Smooth gain changes to avoid clicks
self.gain = lerp(self.gain, target_gain, self.smoothing);
// Apply gain with soft clipping
for sample in samples.iter_mut() {
*sample = (*sample * self.gain).tanh();
}
}
}
}
The tanh function provides soft clipping—instead of hard clipping at ±1.0 (which sounds harsh), it smoothly compresses peaks.
Target Level
We target -20 dBFS. Why?
- Leaves headroom for peaks
- Matches typical model training data
- Consistent across different mic gains
DC Offset Removal
Some cheap mics have DC offset—the signal “floats” above or below zero:
Bad: ▄▅▆▅▄▅▆▅▄▅ (offset from zero)
Good: ▃▄▅▄▃▄▅▄▃▄ (centered on zero)
We use a simple high-pass filter:
#![allow(unused)]
fn main() {
const CUTOFF_HZ: f32 = 20.0; // Remove everything below 20Hz
pub fn remove_dc(samples: &mut [f32], state: &mut f32) {
let alpha = 1.0 - (2.0 * PI * CUTOFF_HZ / 16000.0);
for sample in samples.iter_mut() {
let new_state = *sample + alpha * *state;
*sample = new_state - *state;
*state = new_state;
}
}
}
Noise Gate
We don’t use one. Here’s why:
Noise gates cut audio below a threshold. In theory, they reduce background noise. In practice:
- They clip word beginnings (“hello” → “ello”)
- Silero VAD already handles speech detection
- Models are trained on noisy data and handle it fine
If the environment is so noisy that VAD triggers incorrectly, a noise gate won’t help—the user needs a better mic or quieter room.
Preprocessing Pipeline
Audio flows through these stages in order:
Mic → DC Remove → Resample → AGC → Model
Each stage is independent and stateless (except AGC’s smoothing state).
Testing
We keep a collection of “pathological” audio files:
- Recorded at 8kHz
- Heavy background noise
- Extreme clipping
- Strong DC offset
CI runs inference on these files. If accuracy drops, we investigate.
Code
- Resampling:
crates/audio/src/resample.rs - AGC:
crates/audio/src/agc.rs - Pipeline:
crates/audio/src/stream.rs