How to Actually Run an LLM on Almost No RAM
Someone on Reddit recently posted a photo of an LLM running on a 1998 iMac G3 with 32 MB of RAM. My first reaction was "no way." My second reaction was "okay, but how?" That question sent me down a...

Source: DEV Community
Someone on Reddit recently posted a photo of an LLM running on a 1998 iMac G3 with 32 MB of RAM. My first reaction was "no way." My second reaction was "okay, but how?" That question sent me down a rabbit hole of model quantization, tiny architectures, and just how far you can push inference on absurdly constrained hardware. Whether you're trying to run a model on a Raspberry Pi, an old laptop, or just want to understand the actual floor for LLM inference, here's what I learned. The Problem: LLMs Are Memory Hogs The typical advice for running LLMs locally assumes you have a modern GPU with 8+ GB of VRAM, or at minimum a machine with 16 GB of system RAM. That's fine if you're running Llama 3 or Mistral on your M-series MacBook. But what if you're working with something far more constrained? Maybe you want to run inference on an edge device. Maybe you're building for embedded systems. Or maybe you just want to see how small you can go for the sheer fun of it. The blocker is always the sa