Qwen2.5-Max

We firmly believe that continuously scaling both data size and model size leads to significant improvements in model intelligence. However, the research and industry community has limited experience in effectively scaling extremely large models, whether they are dense or Mixture-of-Expert (MoE) models. Many critical details regarding this scaling process were only disclosed with the recent release of DeepSeek V3.

In parallel, we have been developing Qwen2.5-Max, a large-scale MoE model that has been pretrained on over 20 trillion tokens and further refined through curated Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) methodologies. Today, we are excited to share the performance results of Qwen2.5-Max and announce the availability of its API through Alibaba Cloud. We also invite you to explore Qwen2.5-Max on Qwen Chat!

Performance

To evaluate Qwen2.5-Max, we benchmarked it against leading proprietary and open-weight models across a range of industry-relevant benchmarks. These include MMLU-Pro, which tests knowledge through college-level problems, LiveCodeBench for coding capabilities, LiveBench for general capabilities, and Arena-Hard, which approximates human preferences. Our findings include performance scores for both base and instruct models.

We began by comparing the performance of instruct models, which are suitable for downstream applications such as chat and coding. The results demonstrate how Qwen2.5-Max performs alongside state-of-the-art models, including DeepSeek V3, GPT-4o, and Claude-3.5-Sonnet.

Qwen2.5-Max outperforms DeepSeek V3 in benchmarks such as Arena-Hard, LiveBench, LiveCodeBench, and GPQA-Diamond while demonstrating competitive results in other assessments, including MMLU-Pro.

For base model comparisons, we were unable to access proprietary models such as GPT-4o and Claude-3.5-Sonnet. Instead, we evaluated Qwen2.5-Max against DeepSeek V3, a leading open-weight MoE model, Llama-3.1-405B, the largest open-weight dense model, and Qwen2.5-72B, another top-performing open-weight dense model. The results showcase the strengths of our base models across most benchmarks, reinforcing our confidence that advancements in post-training techniques will further enhance the next version of Qwen2.5-Max.

Use Qwen2.5-Max

Qwen2.5-Max is now available on Qwen Chat, allowing users to chat with the model, interact with artifacts, search, and more.

The API for Qwen2.5-Max (model name: qwen-max-2025-01-25) is also available. To use it, simply register an Alibaba Cloud account, activate the Alibaba Cloud Model Studio service, and generate an API key through the console.

Since Qwen APIs are OpenAI-API compatible, users can follow standard OpenAI API usage practices. Below is an example of how to use Qwen2.5-Max in Python:

from openai import OpenAI
import os

client = OpenAI(
    api_key=os.getenv("API_KEY"),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen-max-2025-01-25",
    messages=[
      {'role': 'system', 'content': 'You are a helpful assistant.'},
      {'role': 'user', 'content': 'Which number is larger, 9.11 or 9.8?'}
    ]
)

print(completion.choices[0].message)

Future Work

Our commitment to scaling both data and model size reflects our dedication to pushing the boundaries of model intelligence. We aim to further enhance the thinking and reasoning capabilities of large language models through the innovative application of scaled reinforcement learning. This pursuit holds the potential to enable our models to surpass human intelligence, unlocking new frontiers of knowledge and understanding.

Read other articles in our Blog.