How to run local AI Models on 64 GB Vram, 2x RTX 5090
This guide provides detailed instructions for setting up and running large language models (LLMs) locally using vLLM on a system equipped with dual NVIDIA RTX 5090 GPUs. It covers installation steps for both stable and nightly builds of vLLM, model download procedures, and specific commands for running high-performance models. The article is valuable for anyone looking to maximize the potential of their dual RTX 5090 setup for efficient LLM inference, offering insights into achieving optimal speed and memory utilization with 64 GB of combined VRAM.
Ground rules: We want speed (tens or hundreds of tokens/sec) and everything fitting into available VRAM
How to install vLLM stable
Prerequisite: Ubuntu 24.04 and the proper NVIDIA drivers
mkdir vllm cd vllm uv venv --python 3.12 --seed source .venv/bin/activate uv pip install vllm --torch-backend=auto
How to install vLLM nightly
Prerequisites: Ubuntu 24.04 and the proper NVIDIA drivers
mkdir vllm-nightly
cd vllm-nightly
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install -U vllm \
--torch-backend=auto \
--extra-index-url https://wheels.vllm.ai/nightly
How to download models
mkdir /models cd /models uv venv --python 3.12 --seed source .venv/bin/activate pip install huggingface_hub # To download a model after going to /models and running source .venv/bin/activate mkdir /models/nvfp4 hf download GadflyII/GLM-4.7-Flash-NVFP4 --local-dir /models/nvfp4/GadflyII-GLM-4.7-Flash-NVFP4
GadflyII/GLM-4.7-Flash-NVFP4
Using vllm:
export PYTORCH_ALLOC_CONF=expandable_segments:True uv run vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \
--download-dir /mnt/models/llm \
--kv-cache-dtype fp8 \
--tensor-parallel-size 2 \
--max-model-len 44000 \
--trust-remote-code \
--max-num-seqs 4 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-4.7-flash \
--host 0.0.0.0 \
--port 8000
Hint: The context size can go up to 80K