--- language: - en tags: - audio - music - codec - neural-audio - audio-compression - transformers pipeline_tag: audio-to-audio library_name: transformers inference: true --- # XCodec Mini - Neural Audio Codec ## Model Description XCodec Mini is a state-of-the-art neural audio codec designed for high-quality music compression and reconstruction. It combines semantic and acoustic encoding approaches to achieve efficient compression while maintaining audio quality. ### Key Features - **Dual Encoding Architecture** - Semantic encoder for high-level musical features - Acoustic encoder for detailed sound information - Multi-scale processing for efficient compression - **Advanced Compression** - Multiple codebooks for flexible quality/size tradeoff - Support for 44.1kHz high-fidelity audio - Separate processing paths for vocals and instrumentals - **Technical Specifications** - Input: Raw audio at 44.1kHz - Output: Compressed representations and reconstructed audio - Model Size: [Add total size] - Compression Ratio: [Add typical ratio] ## Intended Uses - High-quality music compression - Audio archival and storage - Music streaming applications - Audio processing pipelines ## Training Data The model was trained on a diverse dataset of music, including: - Various genres and styles - Vocal and instrumental tracks - High-quality studio recordings ## Performance and Limitations ### Strengths - High-quality audio reconstruction - Efficient compression ratios - Separate handling of vocals and instrumentals - Support for high sample rates ### Limitations - Computationally intensive for real-time applications - Requires significant GPU memory - Best suited for offline processing - May introduce artifacts in extreme compression settings ## Technical Specifications ### Model Architecture 1. **Semantic Encoder** - Based on HuBERT architecture - Captures high-level musical features - Outputs semantic tokens 2. **Acoustic Encoder** - Multi-scale convolutional architecture - Processes detailed sound information - Generates acoustic tokens 3. **Dual Decoders** - Separate decoders for vocals and instrumentals - Multi-stage reconstruction process - Quality-focused design ### Input Requirements - Audio Format: WAV/MP3 - Sample Rate: 44.1kHz - Channels: Mono/Stereo - Bit Depth: 16-bit ### Output Format - Reconstructed Audio: 44.1kHz WAV - Intermediate Representations: Compressed tokens ## Usage Guidelines ### Hardware Requirements - GPU: NVIDIA GPU with 8GB+ VRAM - RAM: 16GB+ recommended - Storage: SSD recommended for faster processing ### Software Requirements - Python 3.8+ - PyTorch 2.0+ - CUDA 11.0+ - Additional dependencies listed in installation guide ## Ethical Considerations - **Copyright**: Users should ensure they have proper rights to process copyrighted material - **Attribution**: Proper attribution should be given when using this model - **Data Privacy**: Consider data privacy implications when processing sensitive audio ## Additional Information ### Model Weights The model requires several checkpoint files: - Semantic Encoder - Vocal Decoder - Instrumental Decoder - Final Checkpoint ### Contact For issues and questions, please use the GitHub repository's issue tracker.