Design a GPU Inference Serving System
Asked by: Anthropic · NVIDIA
"Design a high-throughput GPU inference service" shows up directly in Anthropic and NVIDIA interviews and underlies most AI-lab infra questions. Unlike "Design ChatGPT" (which centers on the conversational product), this one is squarely an infrastructure problem: build the platform that serves any model to many callers efficiently. It maps cleanly onto the standard system-design interview framework, so we'll use it.
The crux (spend ~60% of your time here). The entire problem reduces to maximizing useful work extracted from a fixed pool of expensive GPUs while honoring latency SLOs — i.e. the serving engine (batching + KV memory + admission), not the surrounding distributed-systems plumbing. The gateway, registry, and autoscaler are table-stakes. Name the engine as the heart in the first five minutes; size the problem in tokens/second, not requests/second, because that single reframing drives every decision that follows.
Comments (0)