How to run local AI Models on 64 GB Vram, 2x RTX 5090

This guide provides detailed instructions for setting up and running large language models (LLMs) locally using vLLM on a system equipped with dual NVIDIA RTX 5090 GPUs. It covers installation steps for both stable and nightly builds of vLLM, model download procedures, and specific commands for running high-performance models. The article is valuable for anyone looking to maximize the potential of their dual RTX 5090 setup for efficient LLM inference, offering insights into achieving optimal speed and memory utilization with 64 GB of combined VRAM.

Ground rules: We want speed (tens or hundreds of tokens/sec) and everything fitting into available VRAM

How to install vLLM stable

Prerequisite: Ubuntu 24.04 and the proper NVIDIA drivers

mkdir vllm
cd vllm
uv venv --python 3.12 --seed
source .venv/bin/activate

uv pip install vllm --torch-backend=auto

How to install vLLM nightly

Prerequisites: Ubuntu 24.04 and the proper NVIDIA drivers

mkdir vllm-nightly
cd vllm-nightly
uv venv --python 3.12 --seed
source .venv/bin/activate

uv pip install -U vllm \
    --torch-backend=auto \
    --extra-index-url https://wheels.vllm.ai/nightly

How to download models

mkdir /models
cd /models
uv venv --python 3.12 --seed
source .venv/bin/activate

pip install huggingface_hub

# To download a model after going to /models and running source 
.venv/bin/activate
mkdir /models/nvfp4
hf download GadflyII/GLM-4.7-Flash-NVFP4 --local-dir 
/models/nvfp4/GadflyII-GLM-4.7-Flash-NVFP4

GadflyII/GLM-4.7-Flash-NVFP4

GadflyII/GLM-4.7-Flash-NVFP4

Using vllm:

export PYTORCH_ALLOC_CONF=expandable_segments:True uv run vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \ 
    --download-dir /mnt/models/llm \
    --kv-cache-dtype fp8 \
    --tensor-parallel-size 2 \
    --max-model-len 44000 \
    --trust-remote-code \
    --max-num-seqs 4 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --served-model-name glm-4.7-flash \
    --host 0.0.0.0 \
    --port 8000

Hint: The context size can go up to 80K

More information