---
library_name: transformers
tags:
- EchoLLaMA
license: apache-2.0
datasets:
- AquaLabs/Spatial-DPO-Dataset
language:
- en
base_model:
- meta-llama/Llama-3.2-1B-Instruct
pipeline_tag: text-generation
---

# EchoLLaMA: 3D-to-Speech with Multimodal AI

[![Hugging Face](https://img.shields.io/badge/Hugging%20Face-EchoLLaMA--1B-yellow)](https://huggingface.co/AquaLabs/EchoLLaMA-1B)
[![Hugging Face](https://img.shields.io/badge/Hugging%20Face-Orpheus--3B--0.1--ft--Elise-blue)](https://huggingface.co/AquaLabs/Orpheus-3B-0.1-ft-Elise)
[![Hugging Face](https://img.shields.io/badge/Dataset-Spatial--DPO--Dataset-green)](https://huggingface.co/datasets/AquaLabs/Spatial-DPO-Dataset/)

## Overview

EchoLLaMA is a multimodal AI system that transforms 3D visual data into natural spoken descriptions while enabling interactive dialogue through speech input. This repository contains the implementation of the LLaMA-3.2-1B-Instruct model fine-tuned with Direct Preference Optimization (DPO) for generating rich textual descriptions of 3D scenes.

## Model Architecture

The EchoLLaMA pipeline integrates four specialized models:

1. **Image Analysis**: 
   - DETR (DEtection TRansformer) for object detection
   - MiDaS for monocular depth estimation
   - Moondream for holistic image captioning

2. **Text Generation**:
   - LLaMA-3.2-1B-Instruct fine-tuned with DPO

3. **Speech Synthesis**:
   - Orpheus-3B-0.1-ft TTS model fine-tuned on the Elise English speech dataset

4. **Speech Recognition**:
   - SpeechRecognition package for transcribing user speech input

## Key Features

- **3D Object Detection Matrix**: Constructs a grid-based representation of detected objects with spatial coordinates
- **Depth-Aware Scene Understanding**: Incorporates relative depth values to capture 3D relationships
- **Natural Language Generation**: Produces coherent and contextually rich descriptions
- **High-Quality Speech Synthesis**: Converts textual descriptions into natural-sounding speech

## Training Details

### LLaMA Model

The LLaMA-3.2-1B-Instruct model was fine-tuned using:

- **Technique**: Direct Preference Optimization (DPO) with LoRA
- **Dataset**: 2000 samples from COCO 2017 processed with DETR, and Moondream
- **Chosen Responses**: Generated by DeepSeek-V3-0324
- **Rejected Responses**: Generated by pre-fine-tuned LLaMA-3.2-1B-Instruct
- **Training Parameters**:
  - LoRA Rank: 8
  - β (DPO): 0.1
  - Learning Rate: 2×10⁻⁵ with cosine decay
  - Batch Size: 16 (with 2×8 accumulation)
  - Sequence Length: 8192
- **Hardware**: 2×T4 GPU
- **Training Time**: 1 hour 40 minutes

### Orpheus Model

The Orpheus-3B-0.1-ft TTS model was fine-tuned using:

- **Technique**: Low-Rank Adaptation (LoRA)
- **Dataset**: Elise English speech dataset
- **Training Parameters**:
  - LoRA Rank (r): 64
  - LoRA Alpha (α): 64
  - LoRA Dropout: 0
  - Learning Rate: 2×10⁻⁴
- **Hardware**: 2×T4 GPU
- **Training Time**: 47 minutes

## Usage

### Installation

```bash
# Clone the repository
git clone https://github.com/The-Aqua-Labs/EchoLLaMA-Pipeline.git
cd EchoLLaMA-Pipeline
```

And run the Jupyter Notebook file.

## Pipeline Flow

1. Image is processed with DETR for object detection and MiDaS for depth estimation
2. Moondream generates a caption describing the image content
3. The object detection matrix and caption are combined into a prompt
4. LLaMA-3.2-1B-Instruct generates a detailed textual description
5. Orpheus-3B-0.1-ft converts the text into speech

## Dataset

The training dataset contains 1999 samples, each consisting of:
- An image-derived prompt with object detection matrix and caption
- A chosen response from DeepSeek-V3-0324
- A rejected response from LLaMA-3.2-1B-Instruct

You can access the dataset at [AquaLabs/Spatial-DPO-Dataset](https://huggingface.co/datasets/AquaLabs/Spatial-DPO-Dataset/)

## Model Weights

- LLaMA-3.2-1B-Instruct (fine-tuned): [AquaLabs/EchoLLaMA-1B](https://huggingface.co/AquaLabs/EchoLLaMA-1B)
- Orpheus-3B-0.1-ft (fine-tuned): [AquaLabs/Orpheus-3B-0.1-ft-Elise](https://huggingface.co/AquaLabs/Orpheus-3B-0.1-ft-Elise)

## Contributors

- Ahmet Erdem Pamuk - [GitHub](https://github.com/ahmeterdempmk) | [Hugging Face](https://huggingface.co/ahmeterdempmk)
- Emir Kaan Özdemir - [GitHub](https://github.com/emirkaanozdemr) | [Hugging Face](https://huggingface.co/emirkaanozdemr)
- Şuayp Talha Kocabay - [GitHub](https://github.com/suayptalha) | [Hugging Face](https://huggingface.co/suayptalha)

## License

This project is licensed under the Apache-2.0 License.

Details are provided in the [paper]().