Qwen
Integrations
- DashScope API
- vLLM / SGLang
- Ollama / llama.cpp
- Hugging Face
- ModelScope
- Qwen-Agent (MCP)
Pricing Details
- Open-source models under Apache 2.0.
- DashScope API: Qwen3-Max starts at $1.20/M input tokens.
- Context Caching (Cache Read) offers ~80% discount ($0.24/M).
- Batch API provides 50% discount.
Features
- Dense Transformer Family (0.6B to 32B) under Apache 2.0
- Sparse MoE: Qwen3-Max (1T+), 235B-A22B, 30B-A3B
- Unified Thinking Mode (In-context CoT)
- 128K - 1M Context Window via YaRN
- 36 Trillion Token Multilingual Corpus (119 languages)
- OpenAI-Compatible API with Context Caching
- Native MCP Support & Qwen-Agent Framework
- Qwen3-Omni & VL Multimodal Capabilities
Description
Qwen: Dual-Architecture & Unified Reasoning Audit
As of January 2026, Qwen3 has matured into a multi-modal powerhouse. The architecture spans from mobile-ready 0.6B dense models to trillion-parameter MoE clusters (Qwen3-Max). The ecosystem is defined by its Unified Thinking Mode, which uses special tokens (<think> ID: 151667) to perform internal reasoning before generating final responses 📑.
Model Orchestration & Hybrid Thinking
The 2026 architecture eliminates the need for specialized reasoning clones. A single model manages both 'fast' and 'slow' thinking via runtime parameters, optimizing compute allocation based on task complexity 📑.
- Expert Specialization: Qwen3-235B-A22B utilizes 128 experts with zero shared-expert overhead, resulting in superior STEM performance (92.3% on AIME'25) while maintaining the inference speed of a 22B model 📑.
- Operational Scenario: Multi-Step Reasoning & Tool Use:
Input: High-complexity mathematical proof or codebase bug report 📑.
Process: The model triggers 'Thinking Mode' via/think, performs long-form CoT, and uses the Qwen-Agent framework with MCP integration to execute code or search documentation 🧠.
Output: Verified reasoning trace followed by a production-ready solution or patch 📑.
⠠⠉⠗⠑⠁⠞⠑⠙⠀⠃⠽⠀⠠⠁⠊⠞⠕⠉⠕⠗⠑⠲⠉⠕⠍
Infrastructure & API Management
DashScope API provides regionalized, OpenAI-compatible endpoints with native Context Caching support, reducing costs for repeated tokens by up to 80% 📑.
- Omni-Modal Ingestion: Qwen3-Omni (released Sept 2025) processes text, image, audio, and video inputs with native audio/text output, operating through a unified cross-modal attention architecture 📑.
- Edge Deployment: Optimized for local execution via SGLang (≥0.4.6) and vLLM (≥0.9.0), supporting specialized
--reasoning-parser qwen3for clean response streaming 📑.
Evaluation Guidance
Technical evaluators should verify the following architectural characteristics:
- Thinking Budget Tuning: Adjust
temperature=0.6andmin_p=0when using Thinking Mode to maximize reasoning quality as per official generation_config.json specs 📑. - Quantization impact on MoE: Audit the performance of KTransformers or llama.cpp quantizations for the 235B model, as expert routing logic is sensitive to bit-depth precision 🧠.
- Cache Retention Logic: Request details on geographic cache persistence policies (Global vs US endpoints) for sensitive enterprise data 🌑.
- YaRN 1M Context Fidelity: Test 'needle-in-a-haystack' retrieval for models 8B and above when using the 1M token extension before production deployment 🧠.
Release History
General release of the Qwen3 model series (7B, 72B, 175B). Introduction of Qwen3.5, a further refined version with improved reasoning and safety alignment.
Early access release of Qwen3, featuring a new architecture and significantly increased parameter count (up to 175B). Demonstrates state-of-the-art performance across multiple tasks.
Qwen2.5-VL released, building on Qwen2.5 with enhanced visual understanding and multimodal interaction. Improved detail recognition in images.
Qwen2.5 released, featuring improved instruction following and conversational abilities. Expanded multilingual support, including better performance in European languages.
Qwen2-VL released, combining the Qwen2 language model with visual capabilities. Improved multimodal reasoning and generation.
Qwen2 released with 7B and 72B parameter models. Enhanced reasoning and coding abilities. Improved performance on various benchmarks.
Introduction of Qwen-VL, a multimodal model combining language and visual understanding. Supports image input and reasoning.
Released Qwen1.5, offering 0.5B, 1.5B, 4B, 7B, and 14B parameter models. Improved performance and efficiency. Support for longer context lengths.
Initial release of the Qwen series, featuring a 7B parameter model. Strong Chinese and English language capabilities. Open-sourced.
Tool Pros and Cons
Pros
- Excellent Chinese performance
- Versatile API deployment
- Wide range of model sizes
- Strong English support
- Cost-effective open-source
- Rapid development
- Good content generation
- Multimodal support
Cons
- Commercial API costs
- Resource-intensive open-source
- Developing VL capabilities