How to run local AI Models on 64 GB Vram, 2x RTX 5090

Mixed precision NVFP4 quantized version the new GLM-4.7-FLASH on HF.

https://huggingface.co/GadflyII/GLM-4.7-Flash-NVFP4

Using vllm

export PYTORCH_ALLOC_CONF=expandable_segments:True uv run vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \ 
--download-dir /mnt/models/llm \ 
--kv-cache-dtype fp8 \ 
--tensor-parallel-size 2 \ 
--max-model-len 44000 \ 
--trust-remote-code \ 
--max-num-seqs 4\ 
--tool-call-parser glm47 \ 
--reasoning-parser glm45 \ 
--enable-auto-tool-choice \ 
--served-model-name glm-4.7-flash \ 
--host 0.0.0.0 --port 8000

Hint: The context size can go up to 80K

More information