Qwen3-Audio

Qwen3-Audio will be the next evolution in the Qwen series of large audio-language models (in addition to Qwen3-VL and Qwen3-Math). Qwen3-Audio will be designed to handle diverse audio signal inputs, offering enhanced audio analysis and generating direct textual responses to spoken instructions. The model will feature two advanced modes of audio interaction:

Voice chat: Users will be able to engage in seamless voice interactions with Qwen3-Audio, eliminating the need for text input.
Audio analysis: Users will be able to provide both audio and text instructions for in-depth analysis during interactions.

The Qwen3-Audio lineup is expected to include Qwen3-Audio-7B and Qwen3-Audio-7B-Instruct, encompassing both a pretrained model and a chat-focused version.

Requirements

The code of Qwen2-Audio has been in the latest Hugging face transformers and we advise you to build from source with command

pip install git+https://github.com/huggingface/transformers

Quickstart

Here provides offers a code snippet illustrating the process of loading both the processor and model, alongside detailed instructions on executing the pretrained base model for content generation.

Voice Chat Inference

In the voice chat mode, users can freely engage in voice interactions with Qwen3-Audio without text input:


from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen3AudioForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained("Qwen/Qwen3-Audio-7B-Instruct")
model = Qwen3AudioForConditionalGeneration.from_pretrained("Qwen/Qwen3-Audio-7B-Instruct", device_map="auto")

conversation = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Audio/audio/guess_age_gender.wav"},
    ]},
    {"role": "assistant", "content": "Yes, the speaker is female and in her twenties."},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Audio/audio/translate_to_chinese.wav"},
    ]},
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = []
for message in conversation:
    if isinstance(message["content"], list):
        for ele in message["content"]:
            if ele["type"] == "audio":
                audios.append(librosa.load(
                    BytesIO(urlopen(ele['audio_url']).read()), 
                    sr=processor.feature_extractor.sampling_rate)[0]
                )

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs.input_ids = inputs.input_ids.to("cuda")

generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

Explore Qwen models on HuggingFace.