Qwen3 LLM

The next version of the Qwen LLM series, Qwen3, brings a new level of advancement in both natural language processing and multimodal capabilities. Building on the success of its predecessors, Qwen3 models are equipped with larger datasets, enhanced architectures, and superior fine-tuning, enabling them to handle even more complex reasoning, language understanding, and generation tasks. These models have expanded token limits, ensuring that they can generate longer, more coherent responses and manage more intricate conversational flows.

Qwen3 represents the newest generation of large language models in the Qwen series, offering a full range of dense and mixture-of-experts (MoE) models. Built on extensive training, Qwen3 introduces major breakthroughs in reasoning, instruction-following, agent capabilities, and multilingual support, featuring:

  • Support for over 100 languages and dialects, with robust performance in multilingual instruction following and translation.
  • Unique ability to seamlessly switch between thinking mode (for complex logical reasoning, mathematics, and coding) and non-thinking mode (for efficient, general-purpose conversation) within a single model, optimizing performance across a wide range of tasks.
  • Significant improvements in reasoning, outperforming both previous QwQ models (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) across mathematics, code generation, and commonsense logical reasoning.
  • Exceptional human preference alignment, excelling at creative writing, role-playing, multi-turn conversations, and instruction following, delivering a more natural, engaging, and immersive dialogue experience.
  • Advanced agent capabilities, allowing precise interaction with external tools in both thinking and non-thinking modes, achieving state-of-the-art results in complex agent-driven tasks among open-source models.

Qwen3-235B-A22B delivers competitive performance across benchmarks in coding, mathematics, general capabilities, and more, standing alongside other leading models like DeepSeek-R1, o1, o3-mini, Grok-3, and Gemini-2.5-Pro. Meanwhile, the smaller MoE model, Qwen3-30B-A3B, surpasses QwQ-32B, despite having only one-tenth the number of activated parameters. Even the compact Qwen3-4B matches the performance level of the much larger Qwen2.5-72B-Instruct.

Qwen3-235B-A22B, a large model featuring 235 billion total parameters with 22 billion activated parameters, and Qwen3-30B-A3B, a smaller MoE model with 30 billion total parameters and 3 billion activated parameters, are both available. In addition, six dense models — Qwen3-32B, Qwen3-14B, Qwen3-8B, Qwen3-4B, Qwen3-1.7B, and Qwen3-0.6B — are open-weighted and released under the Apache 2.0 license.

Post-trained models like Qwen3-30B-A3B, along with their pre-trained versions (e.g., Qwen3-30B-A3B-Base), are now available on platforms such as Hugging Face, ModelScope, and Kaggle. For deployment, we recommend frameworks like SGLang and vLLM. For local use, tools such as Ollama, LMStudio, MLX, llama.cpp, and KTransformers are highly recommended. These options make it easy for users to integrate Qwen3 into their workflows across research, development, and production environments.

We believe that the release and open-sourcing of Qwen3 will drive significant progress in the research and development of large foundation models. Our mission is to empower researchers, developers, and organizations worldwide to create innovative solutions with these state-of-the-art models.

You can also experience Qwen3 firsthand through Qwen Chat Web (chat.qwen.ai) and the Qwen mobile app!

Key Features

Hybrid Thinking Modes

Qwen3 models introduce a hybrid approach to problem-solving by supporting two distinct modes:

  • Thinking Mode: In this mode, the model engages in step-by-step reasoning before providing an answer, making it ideal for complex problems that demand deeper analysis.
  • Non-Thinking Mode: This mode delivers rapid, near-instant responses, suited for simpler tasks where speed is prioritized over detailed reasoning.

This flexibility allows users to adjust how much “thinking” the model applies based on the specific task. Complex challenges can be approached with extended reasoning, while straightforward queries can be addressed immediately.

Importantly, the integration of these two modes significantly improves the model’s ability to manage thinking budgets efficiently. As demonstrated, Qwen3 shows scalable and smooth performance gains directly tied to the allocated computational reasoning budget. This design empowers users to configure task-specific budgets more easily, optimizing the balance between cost efficiency and inference quality.


Multilingual Support

Qwen3 models support 119 languages and dialects, greatly expanding their usability for global applications. This extensive multilingual capability allows users around the world to harness the full potential of Qwen3 across diverse linguistic and cultural contexts.

Language Families and Coverage:

Language FamilyLanguages & Dialects
Indo-EuropeanEnglish, French, Portuguese, German, Romanian, Swedish, Danish, Bulgarian, Russian, Czech, Greek, Ukrainian, Spanish, Dutch, Slovak, Croatian, Polish, Lithuanian, Norwegian Bokmål, Norwegian Nynorsk, Persian, Slovenian, Gujarati, Latvian, Italian, Occitan, Nepali, Marathi, Belarusian, Serbian, Luxembourgish, Venetian, Assamese, Welsh, Silesian, Asturian, Chhattisgarhi, Awadhi, Maithili, Bhojpuri, Sindhi, Irish, Faroese, Hindi, Punjabi, Bengali, Oriya, Tajik, Eastern Yiddish, Lombard, Ligurian, Sicilian, Friulian, Sardinian, Galician, Catalan, Icelandic, Tosk Albanian, Limburgish, Dari, Afrikaans, Macedonian, Sinhala, Urdu, Magahi, Bosnian, Armenian
Sino-TibetanChinese (Simplified, Traditional, Cantonese), Burmese
Afro-AsiaticArabic (Standard, Najdi, Levantine, Egyptian, Moroccan, Mesopotamian, Ta’izzi-Adeni, Tunisian), Hebrew, Maltese
AustronesianIndonesian, Malay, Tagalog, Cebuano, Javanese, Sundanese, Minangkabau, Balinese, Banjar, Pangasinan, Iloko, Waray (Philippines)
DravidianTamil, Telugu, Kannada, Malayalam
TurkicTurkish, North Azerbaijani, Northern Uzbek, Kazakh, Bashkir, Tatar
Tai-KadaiThai, Lao
UralicFinnish, Estonian, Hungarian
AustroasiaticVietnamese, Khmer
OtherJapanese, Korean, Georgian, Basque, Haitian, Papiamento, Kabuverdianu, Tok Pisin, Swahili

Improved Agentic Capabilities

Qwen3 models have been significantly optimized for coding and agent-driven tasks. In addition, support for MCP (Multi-Context Processing) has been further strengthened.

Below, we provide examples demonstrating how Qwen3 models reason, interact with their environment, and perform in complex agentic workflows.


Would you like me also to slightly refine it further into a more “marketing-launch” style or a more “technical-documentation” style depending on your purpose?

Pre-training

For Qwen3, the pretraining dataset has been significantly expanded compared to Qwen2.5. While Qwen2.5 was trained on 18 trillion tokens, Qwen3 uses nearly double that amount — approximately 36 trillion tokens — spanning 119 languages and dialects.

To build this large-scale dataset, we sourced data not only from the web but also from PDF-like documents. Text extraction from documents was performed using Qwen2.5-VL, while Qwen2.5 was used to enhance the quality of the extracted content. To enrich the dataset with math and coding examples, synthetic data was generated using Qwen2.5-Math and Qwen2.5-Coder, including textbooks, Q&A pairs, and code snippets.

The pretraining process follows three stages:

  • Stage 1 (S1): The model was pretrained on over 30 trillion tokens with a context length of 4K tokens, establishing strong fundamental language skills and general knowledge.
  • Stage 2 (S2): The dataset was further refined by increasing the proportion of knowledge-intensive content, such as STEM topics, coding challenges, and reasoning tasks. The model then underwent pretraining on an additional 5 trillion tokens.
  • Final Stage: High-quality, long-context data was used to extend the model’s context window to 32K tokens, ensuring it can effectively process much longer inputs.

Thanks to improvements in model architecture, expanded training data, and more efficient training techniques, Qwen3 dense base models now match — and in some cases surpass — the performance of larger Qwen2.5 base models.

For example, Qwen3-1.7B/4B/8B/14B/32B-Base models perform on par with Qwen2.5-3B/7B/14B/32B/72B-Base models, respectively. In particular, Qwen3 dense base models show notable advantages over Qwen2.5 models in STEM, coding, and reasoning tasks.

Meanwhile, Qwen3-MoE base models achieve comparable performance to Qwen2.5 dense models while utilizing only 10% of the active parameters — delivering significant savings in both training and inference costs.

Post-training

Hybrid Model Training Pipeline

To develop a hybrid model capable of both step-by-step reasoning and rapid response generation, Alibaba dev team designed a four-stage training pipeline:

  1. Long Chain-of-Thought (CoT) Cold Start:In this initial stage, the model was fine-tuned on a wide variety of long CoT datasets spanning tasks such as mathematics, coding, logical reasoning, and STEM challenges. This training established the model’s foundational reasoning capabilities.
  2. Reasoning-Based Reinforcement Learning (RL):The second stage focused on enhancing the model’s exploration and exploitation abilities by scaling up computational resources and applying rule-based reward mechanisms during reinforcement learning.
  3. Thinking Mode Fusion:In this stage, non-thinking (rapid response) capabilities were integrated into the reasoning model. This was achieved by fine-tuning on a mixture of long CoT data and standard instruction-tuning datasets, generated by the enhanced model from Stage 2. This fusion enabled seamless switching between deep reasoning and fast response modes.
  4. General Reinforcement Learning (General RL):Finally, reinforcement learning was applied across more than 20 general-domain tasks, further improving the model’s overall capabilities and mitigating undesired behaviors. These tasks included instruction following, format adherence, agentic behaviors, and more.