tlder@dev — Cloudflare Reveals Disaggregated Prefill Architecture and Infire Engine for Edge LLM Serving

tlder@dev:~$

Platform/Cloud

Cloudflare Reveals Disaggregated Prefill Architecture and Infire Engine for Edge LLM Serving

Announced

Cloudflare's Infire inference engine separates the prefill and decode phases of LLM inference across different node pools, allowing each phase to be scaled and hardware-matched independently — prefill is compute-bound while decode is memory-bandwidth-bound. Alongside this, Cloudflare developed Unweight, a compression approach that reduces model weights 15-22% without measurable accuracy regression. The production configuration runs Kimi K2.5 (over one trillion parameters) on eight-H100 nodes and Llama 4 Scout on two-H200 nodes distributed across Cloudflare's global network. The architecture is significant for platform engineers evaluating edge inference deployments because it demonstrates that disaggregated prefill is operationally viable at hyperscale without a centralized GPU cluster. The Unweight compression technique, if made available externally, could reduce both VRAM requirements and inter-node transfer costs. Teams designing multi-region inference pipelines should watch Cloudflare's developer documentation for any SDK or Workers AI API surface updates that expose these capabilities.

└─InfoQ

May 3