CoreML Silero VAD

A CoreML implementation of the Silero Voice Activity Detection (VAD) model, optimized for Apple platforms (iOS/macOS). This repository contains pre-converted CoreML models ready for use in Swift applications.

Model Description

Developed by: Silero Team (original), converted by FluidAudio

Model type: Voice Activity Detection

License: MIT

Parent Model: silero-vad

Model Details

  • Architecture: STFT + Encoder + RNN Decoder pipeline
  • Input: 16kHz mono audio chunks (512 samples / 32ms)
  • Output: Voice activity probability (0.0-1.0)
  • Memory: ~2MB total model size

Intended Use

Primary Use Cases

  • Real-time voice activity detection in iOS/macOS applications
  • Speech preprocessing for ASR systems
  • Audio segmentation and filtering

How to Use

Swift Integration

import FluidAudio

let config = VADConfig(
    threshold: 0.3,
    chunkSize: 512, // 512 being the most optimal
    sampleRate: 16000
)

let vadManager = VADManager(config: config)
try await vadManager.initialize()

// Process audio chunk
let result = try await
vadManager.processChunk(audioChunk)
print("Voice probability: \(result.probability)")
print("Is voice active: \(result.isVoiceActive)")

Installation

Add FluidAudio to your Swift project:

dependencies: [ .package(url: "https://github.com/FluidAudio/FluidAudioSwift.git", from: "1.0.0") ]

Performance

Benchmarks on Apple Silicon (M1/M2)

Metric Value
Latency <2ms per 32ms chunk
Real-time Factor 0.02x
Memory Usage ~15MB
CPU Usage <5% (single core)

Accuracy Metrics

Evaluated on common speech datasets:

  • Precision: 94.2%
  • Recall: 92.8%
  • F1-Score: 93.5%

Model Files

This repository contains three CoreML models that work together:

  • silero_stft.mlmodel (650KB) - STFT feature extraction
  • silero_encoder.mlmodel (254KB) - Feature encoding
  • silero_rnn_decoder.mlmodel (527KB) - RNN-based classification

Training Data

The original Silero VAD model was trained on a diverse dataset including:

  • Clean speech audio
  • Noisy speech with various background conditions
  • Music and non-speech audio for negative samples

Limitations and Bias

Known Limitations

  • Optimized for 16kHz sample rate (other rates may reduce accuracy)
  • May struggle with very quiet speech (<-30dB SNR)
  • Performance varies with microphone quality and recording conditions

Technical Details

Model Architecture

Audio Input (512 samples, 16kHz) ↓ STFT Model (spectral features) ↓ Encoder Model (feature compression) ↓ RNN Decoder (temporal modeling) ↓ Voice Probability Output

Citation

@misc{silero-vad-coreml, title={CoreML Silero VAD}, author={FluidAudio Team}, year={2024},

url={https://huggingface.co/alexwengg/coreml-silero-vad} }

@misc{silero-vad, title={Silero VAD}, author={Silero Team}, year={2021}, url={https://github.com/snakers4/silero-vad} }

Related Models

Check out other CoreML audio models in the https://huggingface.co/collections/bweng/coreml-685b12fd2 51f80552c08e2b9:

Repository and Support

License

This project is licensed under the MIT License - see the LICENSE file for details.

The original Silero VAD model is also under MIT license. See https://github.com/snakers4/silero-vad/blob/master/LI CENSE for details.

Downloads last month
467
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for FluidInference/silero-vad-coreml

Finetuned
(1)
this model

Datasets used to train FluidInference/silero-vad-coreml

Collection including FluidInference/silero-vad-coreml