--- library_name: transformers tags: - EchoLLaMA license: apache-2.0 datasets: - AquaLabs/Spatial-DPO-Dataset language: - en base_model: - meta-llama/Llama-3.2-1B-Instruct pipeline_tag: text-generation --- # EchoLLaMA: 3D-to-Speech with Multimodal AI [![Hugging Face](https://img.shields.io/badge/Hugging%20Face-EchoLLaMA--1B-yellow)](https://huggingface.co/AquaLabs/EchoLLaMA-1B) [![Hugging Face](https://img.shields.io/badge/Hugging%20Face-Orpheus--3B--0.1--ft--Elise-blue)](https://huggingface.co/AquaLabs/Orpheus-3B-0.1-ft-Elise) [![Hugging Face](https://img.shields.io/badge/Dataset-Spatial--DPO--Dataset-green)](https://huggingface.co/datasets/AquaLabs/Spatial-DPO-Dataset/) ## Overview EchoLLaMA is a multimodal AI system that transforms 3D visual data into natural spoken descriptions while enabling interactive dialogue through speech input. This repository contains the implementation of the LLaMA-3.2-1B-Instruct model fine-tuned with Direct Preference Optimization (DPO) for generating rich textual descriptions of 3D scenes. ## Model Architecture The EchoLLaMA pipeline integrates four specialized models: 1. **Image Analysis**: - DETR (DEtection TRansformer) for object detection - MiDaS for monocular depth estimation - Moondream for holistic image captioning 2. **Text Generation**: - LLaMA-3.2-1B-Instruct fine-tuned with DPO 3. **Speech Synthesis**: - Orpheus-3B-0.1-ft TTS model fine-tuned on the Elise English speech dataset 4. **Speech Recognition**: - SpeechRecognition package for transcribing user speech input ## Key Features - **3D Object Detection Matrix**: Constructs a grid-based representation of detected objects with spatial coordinates - **Depth-Aware Scene Understanding**: Incorporates relative depth values to capture 3D relationships - **Natural Language Generation**: Produces coherent and contextually rich descriptions - **High-Quality Speech Synthesis**: Converts textual descriptions into natural-sounding speech ## Training Details ### LLaMA Model The LLaMA-3.2-1B-Instruct model was fine-tuned using: - **Technique**: Direct Preference Optimization (DPO) with LoRA - **Dataset**: 2000 samples from COCO 2017 processed with DETR, and Moondream - **Chosen Responses**: Generated by DeepSeek-V3-0324 - **Rejected Responses**: Generated by pre-fine-tuned LLaMA-3.2-1B-Instruct - **Training Parameters**: - LoRA Rank: 8 - β (DPO): 0.1 - Learning Rate: 2×10⁻⁵ with cosine decay - Batch Size: 16 (with 2×8 accumulation) - Sequence Length: 8192 - **Hardware**: 2×T4 GPU - **Training Time**: 1 hour 40 minutes ### Orpheus Model The Orpheus-3B-0.1-ft TTS model was fine-tuned using: - **Technique**: Low-Rank Adaptation (LoRA) - **Dataset**: Elise English speech dataset - **Training Parameters**: - LoRA Rank (r): 64 - LoRA Alpha (α): 64 - LoRA Dropout: 0 - Learning Rate: 2×10⁻⁴ - **Hardware**: 2×T4 GPU - **Training Time**: 47 minutes ## Usage ### Installation ```bash # Clone the repository git clone https://github.com/The-Aqua-Labs/EchoLLaMA-Pipeline.git cd EchoLLaMA-Pipeline ``` And run the Jupyter Notebook file. ## Pipeline Flow 1. Image is processed with DETR for object detection and MiDaS for depth estimation 2. Moondream generates a caption describing the image content 3. The object detection matrix and caption are combined into a prompt 4. LLaMA-3.2-1B-Instruct generates a detailed textual description 5. Orpheus-3B-0.1-ft converts the text into speech ## Dataset The training dataset contains 1999 samples, each consisting of: - An image-derived prompt with object detection matrix and caption - A chosen response from DeepSeek-V3-0324 - A rejected response from LLaMA-3.2-1B-Instruct You can access the dataset at [AquaLabs/Spatial-DPO-Dataset](https://huggingface.co/datasets/AquaLabs/Spatial-DPO-Dataset/) ## Model Weights - LLaMA-3.2-1B-Instruct (fine-tuned): [AquaLabs/EchoLLaMA-1B](https://huggingface.co/AquaLabs/EchoLLaMA-1B) - Orpheus-3B-0.1-ft (fine-tuned): [AquaLabs/Orpheus-3B-0.1-ft-Elise](https://huggingface.co/AquaLabs/Orpheus-3B-0.1-ft-Elise) ## Contributors - Ahmet Erdem Pamuk - [GitHub](https://github.com/ahmeterdempmk) | [Hugging Face](https://huggingface.co/ahmeterdempmk) - Emir Kaan Özdemir - [GitHub](https://github.com/emirkaanozdemr) | [Hugging Face](https://huggingface.co/emirkaanozdemr) - Şuayp Talha Kocabay - [GitHub](https://github.com/suayptalha) | [Hugging Face](https://huggingface.co/suayptalha) ## License This project is licensed under the Apache-2.0 License. Details are provided in the [paper]().