Stop Treating LLMs Like REST APIs
Why do PoCs run smoothly while launch day implodes? Because LLM traffic is a streaming, state-heavy beast that breaks every REST assumption: requests aren’t stateless, payloads snowball with context, and GPU memory melts under token floods. We’ll map the three checkpoints where most projects stall—context explosion, batch backfires, cache chaos—and show how LLM-D’s open-source sharding plus a hybrid NVIDIA/AMD node pool turns each choke point into a green light. You’ll see live before-and-after dashboards, get a YAML ladder you can drop into any cluster, and learn a back-of-the-napkin formula to keep cost per 1 000 tokens under control.
Speaker
-
Jeff FanDigitalOceanJeff Fan is a Solutions Architect at DigitalOcean who designs Kubernetes-based GPU stacks for LLM inference. He speaks on right-sizing LLM serving (vLLM/KServe/llm-d on DOKS), building memory-enabled support agents, and eval-first RAG (“evals, not vibes”). Formerly keeping mission-critical German systems online, he now turns cloud/AI complexity into copy-paste playbooks that help teams move from PoC to cost-efficient production.