SIMS-7B / README.md

Update README.md

2a83e72 verified 21 days ago

4.22 kB

	---
	library_name: transformers
	license: mit
	datasets:
	- slprl/sTinyStories
	language:
	- en
	base_model:
	- Qwen/Qwen2.5-7B
	pipeline_tag: audio-to-audio
	---

	# Scaling Analysis of Interleaved Speech-Text Language Models

	The model was presented in the paper [Scaling Analysis of Interleaved Speech-Text Language Models](https://arxiv.org/abs/2504.02398).

	# Paper abstract
	Existing Speech Language Model (SLM) scaling analysis paints a bleak picture. They predict that SLMs require much more compute and data
	compared to text, leading some to question the feasibility of training high-quality SLMs. However, modern SLMs are often initialised from
	pre-trained TextLMs using speech-text interleaving to allow knowledge transfer. This raises the question - _Do interleaved SLMs scale more efficiently than textless-SLMs?_
	In this paper we answer a resounding _yes!_ We conduct scaling analysis of interleaved SLMs by training several dozen and analysing the
	scaling trends. We see that under this setup SLMs scale more efficiently with compute. Additionally, our results indicate that the
	scaling-dynamics are significantly different than textless-SLMs, suggesting one should allocate notably more of the compute budget for
	increasing model size over training tokens. We also study the role of synthetic data and TextLM model families in unlocking this potential.
	Results suggest, that our scaled up model achieves comparable performance with leading models on speech semantic metrics while using less
	compute and data than other approaches.

	# Model Card for Model ID
	This is a Speech Language Model (SLM) trained for generating speech or text continuations over discrete [Hubert tokens](https://huggingface.co/slprl/mhubert-base-25hz) given speech-text prompts.


	## Model Details

	### Model Description
	This Speech Language Model, introduced in ["Scaling Analysis of Interleaved Speech-Text Language Models"](https://arxiv.org/abs/2504.02398), focuses on scaling analysis of interleaved speech-text SLMs.
	It was fine-tuned from [Qwen/Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) by extending its vocabulary with 500 speech tokens extracted from
	the 11-th layer of [mhubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz).

	- Developed by: [SLP-RL](https://huggingface.co/slprl)
	- Model type: SpeechLM
	- License: MIT
	- Finetuned from model: [Qwen/Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B)

	### Model Sources

	- Repository: [https://github.com/slp-rl/slamkit](https://github.com/slp-rl/slamkit)
	- Paper: [https://arxiv.org/abs/2504.02398](https://arxiv.org/abs/2504.02398)
	- Demo: [https://pages.cs.huji.ac.il/adiyoss-lab/sims/](https://pages.cs.huji.ac.il/adiyoss-lab/sims/)

	## Uses
	This base SpeechLM can be used to generate continuations for speech segments, or cross-modal e.g generate a text contiuation to a speech prompt, or as a base for further tuning. See the _SlamKit_
	[codebase](https://github.com/slp-rl/slamkit) for more details on usage, and checkout the [demo page](https://pages.cs.huji.ac.il/adiyoss-lab/sims/) for some generation examples

	### Out-of-Scope Use
	This model was trained on diverse speech datasets, as such the outputs should not be treated as factual in any way.


	## How to Get Started with the Model
	We refer users to the official repository for full usage explanations - [github](https://github.com/slp-rl/slamkit).


	## Training Details
	We highly encourage users to read the full [paper](https://arxiv.org/abs/2504.02398), for full training details.


	### Compute Infrastructure
	#### Hardware
	This model was trained using 8 Nvidia H100 GPUs.

	#### Software
	The model was trained using the [SlamKit](https://github.com/slp-rl/slamkit) codebase which builds upon 🤗transformers extending it to support
	easy and efficient training of Speech Language Models.

	## Citation

	BibTeX:
	```
	@misc{maimon2025scaling,
	title={Scaling Analysis of Interleaved Speech-Text Language Models},
	author={Gallil Maimon and Michael Hassid and Amit Roth and Yossi Adi},
	year={2025},
	eprint={2504.02398},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2504.02398},
	}
	```