How Much VRAM to Run an LLM? (7B–70B GPU Memory Guide)
A 70-billion-parameter LLM needs about 140GB of VRAM for FP16 inference (2 bytes per parameter), or roughly 35–40GB when quantized to 4-bit. Training needs 1.5–4x more than inference for optimizer states and gradients. As a quick rule, budget ~2GB of VRAM per 1B parameters at FP16 for inference — then add overhead for the KV … Read more