How to run local AI Models on 64 GB Vram, 2x RTX 5090
Mixed precision NVFP4 quantized version the new GLM-4.7-FLASH on HF.
https://huggingface.co/GadflyII/GLM-4.7-Flash-NVFP4
Using vllm
export PYTORCH_ALLOC_CONF=expandable_segments:True uv run vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \ --download-dir /mnt/models/llm \ --kv-cache-dtype fp8 \ --tensor-parallel-size 2 \ --max-model-len 44000 \ --trust-remote-code \ --max-num-seqs 4\ --tool-call-parser glm47 \ --reasoning-parser glm45 \ --enable-auto-tool-choice \ --served-model-name glm-4.7-flash \ --host 0.0.0.0 --port 8000
Hint: The context size can go up to 80K