Speaker Diarization CoreML Models

State-of-the-art speaker diarization models optimized for Apple Neural Engine, powering real-time on-device speaker separation with research-competitive performance.

Model Description

This repository contains CoreML-optimized speaker diarization models specifically converted and optimized for Apple devices (macOS 13.0+, iOS 16.0+). These models enable efficient on-device speaker diarization with minimal power consumption while maintaining state-of-the-art accuracy.

Usage

See the SDK for more details https://github.com/FluidInference/FluidAudio

With FluidAudio SDK (Recommended)

Installation Add FluidAudio to your project using Swift Package Manager:

dependencies: [
    .package(url: "https://github.com/FluidInference/FluidAudio.git", from: "0.0.2"),
],
import FluidAudio

Task {
    let diarizer = DiarizerManager()
    try await diarizer.initialize()
    
    let audioSamples: [Float] = // your 16kHz audio
    let result = try await diarizer.performCompleteDiarization(
        audioSamples, 
        sampleRate: 16000
    )
    
    for segment in result.segments {
        print("Speaker \(segment.speakerId): \(segment.startTimeSeconds)s - \(segment.endTimeSeconds)s")
    }
}


### Direct CoreML Usage
``swift
import CoreML

// Load the model
let model = try! SpeakerDiarizationModel(configuration: MLModelConfiguration())

// Prepare input (16kHz audio)
let input = SpeakerDiarizationModelInput(audioSamples: audioArray)

// Run inference
let output = try! model.prediction(input: input)

Acknowledgments

These CoreML models are based on excellent work from:

sherpa-onnx - Foundational diarization algorithms pyannote-audio - State-of-the-art diarization research wespeaker - Speaker embedding techniques

Key Features

  • Apple Neural Engine Optimized: Zero performance trade-offs with maximum efficiency
  • Real-time Processing: RTF of 0.02x (50x faster than real-time)
  • Research-Competitive: DER of 17.7% on AMI benchmark
  • Power Efficient: Designed for maximum performance per watt
  • Privacy-First: All processing happens on-device

Intended Uses & Limitations

Intended Uses

  • Meeting Transcription: Real-time speaker identification in meetings
  • Voice Assistants: Multi-speaker conversation understanding
  • Media Production: Automated speaker labeling for podcasts/interviews
  • Research: Academic research in speaker diarization
  • Privacy-Focused Applications: On-device processing without cloud dependencies

Limitations

  • Optimized for 16kHz audio input
  • Best performance with clear audio (no heavy background noise)
  • May struggle with heavily overlapping speech
  • Requires Apple devices with CoreML support

Technical Specifications

  • Input: 16kHz mono audio
  • Output: Speaker segments with timestamps and IDs
  • Framework: CoreML (converted from PyTorch)
  • Optimization: Apple Neural Engine (ANE) optimized operations
  • Precision: FP32 on CPU/GPU, FP16 on ANE

Training Data

These models are converted from open-source variants trained on diverse speaker diarization datasets. The original models were trained on:

  • Multi-speaker conversation datasets
  • Various acoustic conditions
  • Multiple languages and accents

Note: Specific training data details depend on the original open-source model variant.

Downloads last month
1,114
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for FluidInference/speaker-diarization-coreml

Finetuned
(32)
this model

Collection including FluidInference/speaker-diarization-coreml