Speaker Diarization CoreML Models

State-of-the-art speaker diarization models optimized for Apple Neural Engine, powering real-time on-device speaker separation with research-competitive performance.

Model Description

This repository contains CoreML-optimized speaker diarization models specifically converted and optimized for Apple devices (macOS 13.0+, iOS 16.0+). These models enable efficient on-device speaker diarization with minimal power consumption while maintaining state-of-the-art accuracy.

Usage

See the SDK for more details https://github.com/FluidInference/FluidAudio

With FluidAudio SDK (Recommended)

Installation Add FluidAudio to your project using Swift Package Manager:

dependencies: [
    .package(url: "https://github.com/FluidInference/FluidAudio.git", from: "0.0.2"),
],

import FluidAudio

Task {
    let diarizer = DiarizerManager()
    try await diarizer.initialize()
    
    let audioSamples: [Float] = // your 16kHz audio
    let result = try await diarizer.performCompleteDiarization(
        audioSamples, 
        sampleRate: 16000
    )
    
    for segment in result.segments {
        print("Speaker \(segment.speakerId): \(segment.startTimeSeconds)s - \(segment.endTimeSeconds)s")
    }
}


### Direct CoreML Usage
``swift
import CoreML

// Load the model
let model = try! SpeakerDiarizationModel(configuration: MLModelConfiguration())

// Prepare input (16kHz audio)
let input = SpeakerDiarizationModelInput(audioSamples: audioArray)

// Run inference
let output = try! model.prediction(input: input)

Acknowledgments

These CoreML models are based on excellent work from:

sherpa-onnx - Foundational diarization algorithms pyannote-audio - State-of-the-art diarization research wespeaker - Speaker embedding techniques

Key Features

Apple Neural Engine Optimized: Zero performance trade-offs with maximum efficiency
Real-time Processing: RTF of 0.02x (50x faster than real-time)
Research-Competitive: DER of 17.7% on AMI benchmark
Power Efficient: Designed for maximum performance per watt
Privacy-First: All processing happens on-device

Intended Uses & Limitations

Intended Uses

Meeting Transcription: Real-time speaker identification in meetings
Voice Assistants: Multi-speaker conversation understanding
Media Production: Automated speaker labeling for podcasts/interviews
Research: Academic research in speaker diarization
Privacy-Focused Applications: On-device processing without cloud dependencies

Limitations

Optimized for 16kHz audio input
Best performance with clear audio (no heavy background noise)
May struggle with heavily overlapping speech
Requires Apple devices with CoreML support

Technical Specifications

Input: 16kHz mono audio
Output: Speaker segments with timestamps and IDs
Framework: CoreML (converted from PyTorch)
Optimization: Apple Neural Engine (ANE) optimized operations
Precision: FP32 on CPU/GPU, FP16 on ANE

Training Data

These models are converted from open-source variants trained on diverse speaker diarization datasets. The original models were trained on:

Multi-speaker conversation datasets
Various acoustic conditions
Multiple languages and accents

Note: Specific training data details depend on the original open-source model variant.

FluidInference
/

speaker-diarization-coreml