Local model execution commands for VLLM for dual NVidia RTX6000 Pro

This guide provides detailed instructions for setting up and running large language models (LLMs) locally using vLLM on a system equipped with dual NVIDIA RTX 6000 Pro GPUs. It covers installation steps for both stable and nightly builds of vLLM, solutions to common issues like GPU communication problems, and specific commands for various high-performance models. The article is valuable for anyone looking to maximize the potential of their multi-GPU setup for efficient LLM inference, offering insights into achieving optimal speed and memory utilization.

Ground rules: We want speed (tens or hundreds of tokens/sec) and everything fitting into available VRAM

How to install vLLM stable

Prerequisite: Ubuntu 24.04 and the proper NVIDIA drivers

mkdir vllm
cd vllm
uv venv --python 3.12 --seed
source .venv/bin/activate

uv pip install vllm --torch-backend=auto

How to install vLLM nightly

Prerequisites: Ubuntu 24.04 and the proper NVIDIA drivers

mkdir vllm-nightly
cd vllm-nightly
uv venv --python 3.12 --seed
source .venv/bin/activate

uv pip install -U vllm \
    --torch-backend=auto \
    --extra-index-url https://wheels.vllm.ai/nightly

How to download models

mkdir /models
cd /models
uv venv --python 3.12 --seed
source .venv/bin/activate

pip install huggingface_hub

# To download a model after going to /models and running source 
.venv/bin/activate
mkdir /models/awq
hf download cyankiwi/Devstral-2-123B-Instruct-2512-AWQ-4bit --local-dir 
/models/awq/cyankiwi-Devstral-2-123B-Instruct-2512-AWQ-4bit

If setting tensor-parallel-size 2 fails in vLLM

I spent two months debugging why I cannot start vLLM with tp 2 (--tensor-parallel-size 2). It was always hanging because the two GPUs could not communicate with each other. I would only see this output in the terminal:

[shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).

Here is my hardware:

CPU: AMD Ryzen 9 7950X3D 16-Core Processor
Motherboard: ROG CROSSHAIR X670E HERO
GPU: Dual NVIDIA RTX Pro 6000 (each at 96 GB VRAM)
RAM: 192 GB DDR5 5200

And here was the solution:

sudo vi /etc/default/grub
At the end of GRUB_CMDLINE_LINUX_DEFAULT add md_iommu=on iommu=pt like so:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash md_iommu=on iommu=pt"
sudo update-grub

Devstral 2 123B

cyankiwi/Devstral-2-123B-Instruct-2512-AWQ-4bit

vLLM version tested: vllm-nightly on December 25th, 2025

hf download cyankiwi/Devstral-2-123B-Instruct-2512-AWQ-4bit \
--local-dir /models/awq/cyankiwi-Devstral-2-123B-Instruct-2512-AWQ-4bit

vllm serve \
    /models/awq/cyankiwi-Devstral-2-123B-Instruct-2512-AWQ-4bit \
    --served-model-name Devstral-2-123B-Instruct-2512-AWQ-4bit \
    --enable-auto-tool-choice \
    --tool-call-parser mistral \
    --max-num-seqs 4 \
    --max-model-len 262144 \
    --gpu-memory-utilization 0.95 \
    --tensor-parallel-size 2 \
    --host 0.0.0.0 \
    --port 8000

zai-org/GLM-4.5-Air-FP8

vLLM version tested: 0.12.0

vllm serve \
    /models/original/GLM-4.5-Air-FP8 \
    --served-model-name GLM-4.5-Air-FP8 \
    --max-num-seqs 10 \
    --max-model-len 128000 \
    --gpu-memory-utilization 0.95 \
    --tensor-parallel-size 2 \
    --tool-call-parser glm45 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --host 0.0.0.0 \
    --port 8000

zai-org/GLM-4.6V-FP8

vLLM version tested: 0.12.0

vllm serve \
    /models/original/GLM-4.6V-FP8/ \
    --served-model-name GLM-4.6V-FP8 \
    --tensor-parallel-size 2 \
    --tool-call-parser glm45 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --max-num-seqs 10 \
    --max-model-len 131072 \
    --mm-encoder-tp-mode data \
    --mm_processor_cache_type shm \
    --allowed-local-media-path / \
    --host 0.0.0.0 \
    --port 8000

QuantTrio/MiniMax-M2-AWQ

vLLM version tested: 0.12.0

vllm serve \
    /models/awq/QuantTrio-MiniMax-M2-AWQ \
    --served-model-name MiniMax-M2-AWQ \
    --max-num-seqs 10 \
    --max-model-len 128000 \
    --gpu-memory-utilization 0.95 \
    --tensor-parallel-size 2 \
    --pipeline-parallel-size 1 \
    --enable-auto-tool-choice \
    --tool-call-parser minimax_m2 \
    --reasoning-pararser minimax_m2_append_think \
    --host 0.0.0.0 \
    --port 8000

OpenAI gpt-oss-120b

openai/gpt-oss-120b

vLLM version tested: 0.12.0

Note: We are running this on a single GPU

vllm serve \
  /models/original/openai-gpt-oss-120b \
  --served-model-name gpt-oss-120b \
  --tensor-parallel-size 1 \
  --pipeline-parallel-size 1 \
  --data-parallel-size 2 \
  --max_num_seqs 20 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.85 \
  --tool-call-parser openai \
  --reasoning-parser openai_gptoss \
  --enable-auto-tool-choice \
  --host 0.0.0.0 \
  --port 8000

Qwen/Qwen3-235B-A22B

Qwen/Qwen3-235B-A22B-GPTQ-Int4

vLLM version tested: 0.12.0

vllm serve \
    /models/gptq/Qwen-Qwen3-235B-A22B-GPTQ-Int4 \
    --served-model-name Qwen3-235B-A22B-GPTQ-Int4 \
    --reasoning-parser deepseek_r1 \
    --enable-auto-tool-choice \
    --tool-call-pararser hermes \
    --swap-space 16 \
    --max-num-seqs 10 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.95 \
    --tensor-parallel-size 2 \
    --host 0.0.0.0 \
    --port 8000

QuantTrio/Qwen3-235B-A22B-Thinking-2507-AWQ

vLLM version tested: 0.12.0

vllm serve \
    /models/awq/QuantTrio-Qwen3-235B-A22B-Thinking-2507-AWQ \
    --served-model-name Qwen3-235B-A22B-Thinking-2507-AWQ \
    --reasoning-parser deepseek_r1 \
    --enable-auto-tool-choice \
    --tool-call-parser hermes \
    --swap-space 16 \
    --max-num-seqs 10 \
    --max-model-len 262144 \
    --gpu-memory-utilization 0.95 \
    --tensor-parallel-size 2 \
    --host 0.0.0.0 \
    --port 8000

nvidia/Qwen3-235B-A22B-NVFP4

vLLM version tested: 0.12.0

Note: NVFP4 is slow on vLLM and RTX Pro 6000 (sm120)

hf download nvidia/Qwen3-235B-A22B-NVFP4 --local-dir 
/models/nvfp4/nvidia/Qwen3-235B-A22B-NVFP4

vllm serve \
    /models/nvfp4/nvidia/Qwen3-235B-A22B-NVFP4 \
    --served-model-name Qwen3-235B-A22B-NVFP4 \
    --reasoning-parser deepseek_r1 \
    --enable-auto-tool-choice \
    --tool-call-parser hermes \
    --swap-space 16 \
    --max-num-seqs 10 \
    --max-model-len 40960 \
    --gpu-memory-utilization 0.95 \
    --tensor-parallel-size 2 \
    --host 0.0.0.0 \
    --port 8000

QuantTrio/Qwen3-VL-235B-A22B-Thinking-AWQ

Qwen3-VL-235B-A22B-Thinking-AWQ

vLLM version tested: 0.12.0

vllm serve \
    /models/awq/QuantTrio-Qwen3-VL-235B-A22B-Thinking-AWQ \
    --served-model-name Qwen3-VL-235B-A22B-Thinking-AWQ \
    --reasoning-parser deepseek_r1 \
    --enable-auto-tool-choice \
    --tool-call-parser hermes \
    --swap-space 16 \
    --max-num-seqs 1 \
    --max-model-len 262144 \
    --gpu-memory-utilization 0.95 \
    --tensor-parallel-size 2 \
    --host 0.0.0.0 \
    --port 8000

Source: Guide on installing and running the best models on a dual RTX Pro 6000 rig with vLLM

Local model execution commands for VLLM for dual NVidia RTX6000 Pro

How to install vLLM stable

How to install vLLM nightly

How to download models

If setting tensor-parallel-size 2 fails in vLLM

Devstral 2 123B

zai-org/GLM-4.5-Air-FP8

zai-org/GLM-4.6V-FP8

QuantTrio/MiniMax-M2-AWQ

OpenAI gpt-oss-120b

Qwen/Qwen3-235B-A22B

QuantTrio/Qwen3-235B-A22B-Thinking-2507-AWQ

nvidia/Qwen3-235B-A22B-NVFP4

QuantTrio/Qwen3-VL-235B-A22B-Thinking-AWQ

More information