Qwen3 Coder Next NVFP4 setup
This is an NVFP4 quantized version of Qwen/Qwen3-Coder-Next (80B-A3B), a state-of-the-art code generation model using Hybrid DeltaNet + Attention + Mixture of Experts architecture. The quantization reduces the model size from ~149GB BF16 to 45GB (70% reduction) while maintaining strong performance across code generation tasks. This model is optimized for deployment with vLLM and supports context lengths up to 262,144 tokens. The NVFP4 quantization runs very efficiently on NVIDIA Blackwell GPUs.
Original model: https://huggingface.co/Qwen/Qwen3-Coder-Next NVFP4 model: https://huggingface.co/GadflyII/Qwen3-Coder-Next-NVFP4Model Specifications & Architecture
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3-Coder-Next |
| Architecture | Qwen3NextForCausalLM (Hybrid DeltaNet + Attention + MoE) |
| Parameters | 80B total, 3B activated per token |
| Experts | 512 total, 10 activated + 1 shared |
| Layers | 48 |
| Context Length | 262,144 tokens (256K) |
| Quantization | NVFP4 (FP4 weights + FP4 activations) |
| Size | 45GB (down from ~149GB BF16, 70% reduction) |
| Format | compressed-tensors |
NVFP4 Quantization Configuration
Quantized using llmcompressor 0.9.0.1 with the following configuration:
NUM_CALIBRATION_SAMPLES = 20
MAX_SEQUENCE_LENGTH = 2048
DATASET = "HuggingFaceH4/ultrachat_200k" (train_sft)
moe_calibrate_all_experts = True
# Layers kept in BF16
ignore = [
"lm_head",
"re:.*mlp.gate$", # MoE router gates
"re:.*mlp.shared_expert_gate$", # Shared expert gates
"re:.*linear_attn.*", # DeltaNet linear attention
]
Performance Benchmarks & Evaluation
MMLU-Pro
| Model | Accuracy | Delta |
|---|---|---|
| BF16 | 52.90% | - |
| NVFP4 | 51.27% | -1.63% |
Context Length Testing
Successfully tested up to 128K tokens with FP8 KV cache (Not enough VRAM to test any higher context).
vLLM Integration & Usage Guide
Requires vLLM with NVFP4 support (0.16.0+), Transformers 5.0.0+
# vLLM Serving
vllm serve GadflyII/Qwen3-Coder-Next-NVFP4 \
--tensor-parallel-size 2 \
--max-model-len 131072 \
--kv-cache-dtype fp8
License & Citation Information
Apache 2.0 (same as base model)
Credits & Acknowledgments
- Qwen Team for the base model
- RedHatAI for the quantization approach reference
- vLLM Project for llmcompressor