Audio-Text-to-Text

PyTorch Implementation of Audio Flamingo 2

Zhifeng Kong, Arushi Goel, João Felipe Santos, Sreyan Ghosh, Rafael Valle, Wei Ping, Bryan Catanzaro

[paper] [GitHub]

This repo contains the PyTorch implementation of Audio Flamingo Sound-CoT Technical Report: Improving Chain-of-Thought Reasoning in Sound Understanding. Audio Flamingo 2 Sound-CoT (3B) has significant improvements on the chain-of-thought (CoT) reasoning abilities and is comparable to several 7B reasoning baselines on reasoning benchmarks. It is finetuned from our previous Audio Flamingo 2.

  • We introduce AF-Reasoning-Eval, a sound reasoning benchmark targeting common-sense reasoning and the ability to discriminate among closely related choices.

  • We introduce AF-CoT-Train with 1.24M CoT reasoning traces to advance the field of audio understanding.

  • Audio Flamingo 2 Sound-CoT shows strong reasoning abilities on several sound reasoning benchmarks, despite being small (3B) and trained exclusively on public datasets.

Usage

The inference script is almost the same as Audio Flamingo 2. The only difference is to add a special prompt (Output the answer with <SUMMARY>, <CAPTION>, <REASONING>, and <CONCLUSION> tags.) after the input question. For instance, in Audio Flamingo 2, the input is

Based on the given audio, identify the source of the church bells. Choose the correct option from the following options:\n(A) Church\n(B) School\n(C) Clock Tower\n(D) Fire Station.

In Audio Flamingo 2 Sound-CoT, the input is

Based on the given audio, identify the source of the church bells. Choose the correct option from the following options:\n(A) Church\n(B) School\n(C) Clock Tower\n(D) Fire Station. Output the answer with <SUMMARY>, <CAPTION>, <REASONING>, and <CONCLUSION> tags.

License

  • The code in this repo is under MIT license.
  • The checkpoints are for non-commercial use only (see NVIDIA OneWay Noncommercial License). They are also subject to the Qwen Research license, the Terms of Use of the data generated by OpenAI, and the original licenses accompanying each training dataset.
  • Notice: Audio Flamingo 2 Sound-CoT is built with Qwen-2.5. Qwen is licensed under the Qwen RESEARCH LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved.

Citation

  • Audio Flamingo 2
@inproceedings{
  ghosh2025audio,
  title={Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities},
  author={Ghosh, Sreyan and Kong, Zhifeng and Kumar, Sonal and Sakshi, S and Kim, Jaehyeon and Ping, Wei and Valle, Rafael and Manocha, Dinesh and Catanzaro, Bryan},
  booktitle={Forty-second International Conference on Machine Learning},
  year={2025},
  url={https://openreview.net/forum?id=xWu5qpDK6U}
}
  • Audio Flamingo
@inproceedings{kong2024audio,
  title={Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities},
  author={Kong, Zhifeng and Goel, Arushi and Badlani, Rohan and Ping, Wei and Valle, Rafael and Catanzaro, Bryan},
  booktitle={International Conference on Machine Learning},
  pages={25125--25148},
  year={2024},
  organization={PMLR}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Datasets used to train nvidia/audio-flamingo-2-SoundCoT