(BLIP) Image question answering: given an image, answer a question on this image (VILT) Image segmentation: given an image and a prompt, output the segmentation mask of that prompt (CLIPSeg) Speech to text: given an audio recording of a person talking, transcribe the speech into text (Whisper) Text to speech: convert text to speech (SpeechT5) Zero-shot text classification: given a text and a list of labels, identify to which label the text corresponds the most (BART) Text summarization: summarize a long text in one or a few sentences (BART) Translation: translate the text into a given language (NLLB) These tools have an integration in transformers, and can be used manually as well, for example: from transformers import load_tool tool = load_tool("text-to-speech") audio = tool("This is a text to speech tool") Custom tools While we identify a curated set of tools, we strongly believe that the main value provided by this implementation is the ability to quickly create and share custom tools.