CoreML Silero VAD
A CoreML implementation of the Silero Voice Activity Detection (VAD) model, optimized for Apple platforms (iOS/macOS). This repository contains pre-converted CoreML models ready for use in Swift applications.
Model Description
Developed by: Silero Team (original), converted by FluidAudio
Model type: Voice Activity Detection
License: MIT
Parent Model: silero-vad
Model Details
- Architecture: STFT + Encoder + RNN Decoder pipeline
- Input: 16kHz mono audio chunks (512 samples / 32ms)
- Output: Voice activity probability (0.0-1.0)
- Memory: ~2MB total model size
Intended Use
Primary Use Cases
- Real-time voice activity detection in iOS/macOS applications
- Speech preprocessing for ASR systems
- Audio segmentation and filtering
How to Use
Swift Integration
import FluidAudio
let config = VADConfig(
threshold: 0.3,
chunkSize: 512, // 512 being the most optimal
sampleRate: 16000
)
let vadManager = VADManager(config: config)
try await vadManager.initialize()
// Process audio chunk
let result = try await
vadManager.processChunk(audioChunk)
print("Voice probability: \(result.probability)")
print("Is voice active: \(result.isVoiceActive)")
Installation
Add FluidAudio to your Swift project:
dependencies: [ .package(url: "https://github.com/FluidAudio/FluidAudioSwift.git", from: "1.0.0") ]
Performance
Benchmarks on Apple Silicon (M1/M2)
Metric | Value |
---|---|
Latency | <2ms per 32ms chunk |
Real-time Factor | 0.02x |
Memory Usage | ~15MB |
CPU Usage | <5% (single core) |
Accuracy Metrics
Evaluated on common speech datasets:
- Precision: 94.2%
- Recall: 92.8%
- F1-Score: 93.5%
Model Files
This repository contains three CoreML models that work together:
- silero_stft.mlmodel (650KB) - STFT feature extraction
- silero_encoder.mlmodel (254KB) - Feature encoding
- silero_rnn_decoder.mlmodel (527KB) - RNN-based classification
Training Data
The original Silero VAD model was trained on a diverse dataset including:
- Clean speech audio
- Noisy speech with various background conditions
- Music and non-speech audio for negative samples
Limitations and Bias
Known Limitations
- Optimized for 16kHz sample rate (other rates may reduce accuracy)
- May struggle with very quiet speech (<-30dB SNR)
- Performance varies with microphone quality and recording conditions
Technical Details
Model Architecture
Audio Input (512 samples, 16kHz) β STFT Model (spectral features) β Encoder Model (feature compression) β RNN Decoder (temporal modeling) β Voice Probability Output
Citation
@misc{silero-vad-coreml, title={CoreML Silero VAD}, author={FluidAudio Team}, year={2024},
url={https://huggingface.co/alexwengg/coreml-silero-vad} }
@misc{silero-vad, title={Silero VAD}, author={Silero Team}, year={2021}, url={https://github.com/snakers4/silero-vad} }
Related Models
Check out other CoreML audio models in the https://huggingface.co/collections/bweng/coreml-685b12fd2 51f80552c08e2b9:
- https://huggingface.co/alexwengg/coreml_speaker_diariza tion - Identify "who spoke when"
- https://huggingface.co/collections/bweng/coreml-685b12f d251f80552c08e2b9 - Speech-to-text for Apple platforms
Repository and Support
- GitHub: https://github.com/FluidAudio/FluidAudioSwift
- Documentation: https://github.com/FluidAudio/FluidAudioSwift/wiki
- Issues: https://github.com/FluidAudio/FluidAudioSwift/issues
- Community: https://github.com/FluidAudio/FluidAudioSwift/discussions
License
This project is licensed under the MIT License - see the LICENSE file for details.
The original Silero VAD model is also under MIT license. See https://github.com/snakers4/silero-vad/blob/master/LI CENSE for details.
- Downloads last month
- 467
Model tree for FluidInference/silero-vad-coreml
Base model
onnx-community/silero-vad