Audio Format: PCM16LE
Voquii uses PCM16LE (Pulse Code Modulation, 16-bit, Little-Endian) as its primary audio format—the same format used in professional audio production and CD-quality recordings.
| Characteristic | Benefit |
|---|---|
| Uncompressed | Zero quality loss from compression artifacts |
| 16-bit Depth | 96dB dynamic range—captures whispers to shouts |
| Industry Standard | Compatible with all audio systems |
| Low Latency | No encoding/decoding delays |
| Processing Friendly | Optimal for real-time AI analysis |
Quality Comparison
| Format | Bit Rate | Quality | Latency | Use Case |
|---|---|---|---|---|
| PCM16LE | 256 kbps | Excellent | Lowest | Real-time voice AI |
| MP3 | 128 kbps | Good | High | Music streaming |
| Opus | 32-64 kbps | Very Good | Low | VoIP, WebRTC |
| G.711 | 64 kbps | Acceptable | Low | Traditional telephony |
Sample Rate: 16kHz
Voquii processes audio at 16kHz (16,000 samples per second)—the sweet spot for voice applications that balances quality with efficiency.
Captures All Speech
Every phoneme clearly distinguishable, natural voice timbre preserved, emotion and tone intact.
Efficient Processing
50% less data than 24kHz, faster ASR processing, lower bandwidth requirements.
Industry Standard
Matches Deepgram's optimal input, compatible with all TTS providers, WebRTC default for voice.
Quality Impact
| Metric | 8kHz (Narrowband) | 16kHz (Wideband) |
|---|---|---|
| ASR Accuracy | 85-90% | 95%+ |
| Speaker Recognition | Limited | Excellent |
| Emotion Detection | Poor | Good |
| Processing Clarity | Muffled | Crystal clear |
Real-Time Audio Streaming
Real-time streaming is the foundation of natural AI conversations. Voquii processes audio as it's generated—no waiting for complete utterances, no awkward pauses, no buffering delays.
| Aspect | Batch Processing | Real-Time Streaming |
|---|---|---|
| Wait Time | 2-5 seconds | <300ms |
| User Experience | Robotic, stilted | Natural, fluid |
| Memory Usage | High (full audio) | Low (chunks) |
| Interruption | Not possible | Instant barge-in |
Frame-Based Processing
Audio is processed in small, consistent frames for optimal performance:
- Frame Duration: 20ms
- Samples per Frame: 320 (at 16kHz)
- Bytes per Frame: 640 (PCM16LE)
- Frames per Second: 50
- Processing Budget: <20ms per frame
WebSocket Audio Transport
Voquii uses WebSocket connections for persistent, bidirectional audio streaming—the optimal choice for real-time voice applications.
| Feature | Benefit |
|---|---|
| Persistent Connection | No reconnection overhead |
| Bidirectional | Send and receive simultaneously |
| Low Latency | Minimal protocol overhead |
| Full-Duplex | True simultaneous communication |
Codec Support
Voquii supports multiple codecs for different use cases:
| Codec | Direction | Bit Rate | Use Case |
|---|---|---|---|
| PCM16LE | Internal | 256 kbps | Processing |
| Opus | Client | 32-64 kbps | Web/mobile |
| G.711 μ-law | PSTN | 64 kbps | Phone calls (US) |
| G.711 A-law | PSTN (EU) | 64 kbps | Phone calls (EU) |
Audio Quality Assurance
Voquii continuously monitors and optimizes audio quality:
- MOS Score: 4.2/5.0 average (Good-Excellent)
- Jitter: 12ms average
- Packet Loss: <0.3%
- End-to-End Latency: ~285ms
Why Audio Quality Matters
| Metric | Poor Audio | Quality Audio (Voquii) |
|---|---|---|
| ASR Accuracy | 75-85% | 95%+ |
| Repeat Requests | 25% of utterances | <5% |
| Call Duration | +30% longer | Baseline |
| Customer Satisfaction | 65% | 92% |
| First Call Resolution | 60% | 85% |
"We switched from a competitor with noticeable audio compression. The improvement in ASR accuracy alone saved us $40K/month in misrouted calls and callbacks."
— VP Operations, Insurance Company
