Cloudflare Builds High-Performance Infrastructure for Running LLMs

Announced

Cloudflare's new inference stack separates compute-heavy prefill from memory-bound decode stages onto different machine classes, reducing contention and improving throughput. The custom Infire engine — first previewed at Birthday Week 2025 — now supports multi-GPU workloads, while the new Unweight compression technique trims model weights 15–22% with no measured accuracy loss. The combined system runs Kimi K2.5 (over 560 GB) on 8 H100 GPUs and Llama 4 Scout on just 2 H200s. For platform engineers evaluating where to host inference workloads, Cloudflare's approach demonstrates that aggressive architectural specialization can meaningfully shrink GPU requirements for very large models. The Unweight compression result is particularly relevant for teams managing memory-constrained deployments or trying to reduce per-token costs at scale.

└─InfoQ

May 3