How Much GPU Memory Does Your LLM Actually Need?
GPU memory is the binding constraint for LLM deployment. The model's parameters must reside in VRAM alongside everything the runtime needs: the key-value cache, intermediate activations, and the se...

Source: DEV Community
GPU memory is the binding constraint for LLM deployment. The model's parameters must reside in VRAM alongside everything the runtime needs: the key-value cache, intermediate activations, and the serving framework's own buffers. Getting this budget wrong in either direction has real consequences. Underprovisioning leads to OOM crashes under load. Overprovisioning means paying for VRAM that sits idle, and the difference between a two-GPU and four-GPU configuration is $2,000-4,000 per month. The weight formula Memory (GB) = Parameters (B) x Bytes per Parameter These numbers cover weights only. In practice, you need an additional 20-40% for the KV cache, activations, and framework overhead. The KV cache is where teams underestimate Model weights are the predictable part. What makes GPU sizing deceptive is the key-value cache: for each concurrent request, the model stores key and value vectors for every token in the sequence, and this cache grows linearly with context length and batch size.