Local model execution commands for VLLM for dual NVidia RTX6000 Pro
This guide provides detailed instructions for setting up and running large language models (LLMs) locally using vLLM on a system equipped with dual NVIDIA RTX 6000 Pro GPUs. It covers installation steps for both stable and nightly builds of vLLM, solutions to common issues like GPU communication problems, and specific commands for various high-performance models. The article is valuable for anyone looking to maximize the potential of their multi-GPU setup for efficient LLM inference, offering insights into achieving optimal speed and memory utilization.
Ground rules: We want speed (tens or hundreds of tokens/sec) and everything fitting into available VRAM
How to install vLLM stable
Prerequisite: Ubuntu 24.04 and the proper NVIDIA drivers
mkdir vllm cd vllm uv venv --python 3.12 --seed source .venv/bin/activate uv pip install vllm --torch-backend=auto
How to install vLLM nightly
Prerequisites: Ubuntu 24.04 and the proper NVIDIA drivers
mkdir vllm-nightly
cd vllm-nightly
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install -U vllm \
--torch-backend=auto \
--extra-index-url https://wheels.vllm.ai/nightly
How to download models
mkdir /models cd /models uv venv --python 3.12 --seed source .venv/bin/activate pip install huggingface_hub # To download a model after going to /models and running source .venv/bin/activate mkdir /models/awq hf download cyankiwi/Devstral-2-123B-Instruct-2512-AWQ-4bit --local-dir /models/awq/cyankiwi-Devstral-2-123B-Instruct-2512-AWQ-4bit
If setting tensor-parallel-size 2 fails in vLLM
I spent two months debugging why I cannot start vLLM with tp 2 (--tensor-parallel-size 2). It was always hanging because the two GPUs could not communicate with each other. I would only see this output in the terminal:
[shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
Here is my hardware:
CPU: AMD Ryzen 9 7950X3D 16-Core Processor Motherboard: ROG CROSSHAIR X670E HERO GPU: Dual NVIDIA RTX Pro 6000 (each at 96 GB VRAM) RAM: 192 GB DDR5 5200
And here was the solution:
sudo vi /etc/default/grub At the end of GRUB_CMDLINE_LINUX_DEFAULT add md_iommu=on iommu=pt like so: GRUB_CMDLINE_LINUX_DEFAULT="quiet splash md_iommu=on iommu=pt" sudo update-grub
Devstral 2 123B
cyankiwi/Devstral-2-123B-Instruct-2512-AWQ-4bit
vLLM version tested: vllm-nightly on December 25th, 2025
hf download cyankiwi/Devstral-2-123B-Instruct-2512-AWQ-4bit \
--local-dir /models/awq/cyankiwi-Devstral-2-123B-Instruct-2512-AWQ-4bit
vllm serve \
/models/awq/cyankiwi-Devstral-2-123B-Instruct-2512-AWQ-4bit \
--served-model-name Devstral-2-123B-Instruct-2512-AWQ-4bit \
--enable-auto-tool-choice \
--tool-call-parser mistral \
--max-num-seqs 4 \
--max-model-len 262144 \
--gpu-memory-utilization 0.95 \
--tensor-parallel-size 2 \
--host 0.0.0.0 \
--port 8000
zai-org/GLM-4.5-Air-FP8
zai-org/GLM-4.5-Air-FP8
vLLM version tested: 0.12.0
vllm serve \
/models/original/GLM-4.5-Air-FP8 \
--served-model-name GLM-4.5-Air-FP8 \
--max-num-seqs 10 \
--max-model-len 128000 \
--gpu-memory-utilization 0.95 \
--tensor-parallel-size 2 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--host 0.0.0.0 \
--port 8000
zai-org/GLM-4.6V-FP8
zai-org/GLM-4.6V-FP8
vLLM version tested: 0.12.0
vllm serve \
/models/original/GLM-4.6V-FP8/ \
--served-model-name GLM-4.6V-FP8 \
--tensor-parallel-size 2 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--max-num-seqs 10 \
--max-model-len 131072 \
--mm-encoder-tp-mode data \
--mm_processor_cache_type shm \
--allowed-local-media-path / \
--host 0.0.0.0 \
--port 8000
QuantTrio/MiniMax-M2-AWQ
QuantTrio/MiniMax-M2-AWQ
vLLM version tested: 0.12.0
vllm serve \
/models/awq/QuantTrio-MiniMax-M2-AWQ \
--served-model-name MiniMax-M2-AWQ \
--max-num-seqs 10 \
--max-model-len 128000 \
--gpu-memory-utilization 0.95 \
--tensor-parallel-size 2 \
--pipeline-parallel-size 1 \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-pararser minimax_m2_append_think \
--host 0.0.0.0 \
--port 8000
OpenAI gpt-oss-120b
openai/gpt-oss-120b
vLLM version tested: 0.12.0
Note: We are running this on a single GPU
vllm serve \ /models/original/openai-gpt-oss-120b \ --served-model-name gpt-oss-120b \ --tensor-parallel-size 1 \ --pipeline-parallel-size 1 \ --data-parallel-size 2 \ --max_num_seqs 20 \ --max-model-len 131072 \ --gpu-memory-utilization 0.85 \ --tool-call-parser openai \ --reasoning-parser openai_gptoss \ --enable-auto-tool-choice \ --host 0.0.0.0 \ --port 8000
Qwen/Qwen3-235B-A22B
Qwen/Qwen3-235B-A22B-GPTQ-Int4
vLLM version tested: 0.12.0
vllm serve \
/models/gptq/Qwen-Qwen3-235B-A22B-GPTQ-Int4 \
--served-model-name Qwen3-235B-A22B-GPTQ-Int4 \
--reasoning-parser deepseek_r1 \
--enable-auto-tool-choice \
--tool-call-pararser hermes \
--swap-space 16 \
--max-num-seqs 10 \
--max-model-len 32768 \
--gpu-memory-utilization 0.95 \
--tensor-parallel-size 2 \
--host 0.0.0.0 \
--port 8000
QuantTrio/Qwen3-235B-A22B-Thinking-2507-AWQ
QuantTrio/Qwen3-235B-A22B-Thinking-2507-AWQ
vLLM version tested: 0.12.0
vllm serve \
/models/awq/QuantTrio-Qwen3-235B-A22B-Thinking-2507-AWQ \
--served-model-name Qwen3-235B-A22B-Thinking-2507-AWQ \
--reasoning-parser deepseek_r1 \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--swap-space 16 \
--max-num-seqs 10 \
--max-model-len 262144 \
--gpu-memory-utilization 0.95 \
--tensor-parallel-size 2 \
--host 0.0.0.0 \
--port 8000
nvidia/Qwen3-235B-A22B-NVFP4
nvidia/Qwen3-235B-A22B-NVFP4
vLLM version tested: 0.12.0
Note: NVFP4 is slow on vLLM and RTX Pro 6000 (sm120)
hf download nvidia/Qwen3-235B-A22B-NVFP4 --local-dir
/models/nvfp4/nvidia/Qwen3-235B-A22B-NVFP4
vllm serve \
/models/nvfp4/nvidia/Qwen3-235B-A22B-NVFP4 \
--served-model-name Qwen3-235B-A22B-NVFP4 \
--reasoning-parser deepseek_r1 \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--swap-space 16 \
--max-num-seqs 10 \
--max-model-len 40960 \
--gpu-memory-utilization 0.95 \
--tensor-parallel-size 2 \
--host 0.0.0.0 \
--port 8000
QuantTrio/Qwen3-VL-235B-A22B-Thinking-AWQ
Qwen3-VL-235B-A22B-Thinking-AWQ
vLLM version tested: 0.12.0
vllm serve \
/models/awq/QuantTrio-Qwen3-VL-235B-A22B-Thinking-AWQ \
--served-model-name Qwen3-VL-235B-A22B-Thinking-AWQ \
--reasoning-parser deepseek_r1 \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--swap-space 16 \
--max-num-seqs 1 \
--max-model-len 262144 \
--gpu-memory-utilization 0.95 \
--tensor-parallel-size 2 \
--host 0.0.0.0 \
--port 8000
Source: Guide on installing and running the best models on a dual RTX Pro 6000 rig with vLLM