Combining Kubernetes and vLLM to Deliver Scalable, Distributed Inference with llm-d

Effectively managing and scaling modern AI applications requires rethinking how we use Kubernetes, as traditional load balancing and scheduling fall short for diverse inference workloads. This session introduces llm-d, a joint open-source, Kubernetes-native framework designed specifically for distributed LLM inference. We will explore the core challenges of scalability, reliability, and hardware mapping in AI, and demonstrate how llm-d solves them through intelligent inference scheduling and a pluggable architecture.

By moving beyond standard Kubernetes primitives, llm-d optimizes performance across various hardware accelerators while avoiding vendor lock-in. We will dive into the lifecycle of an inference request and explore advanced production techniques, including precise KV-cache-aware routing and Prefill-Decode disaggregation. Join us to learn how llm-d’s specialized worker topologies maximize GPU utilization and how this collaborative project is reshaping the future of high-performance, cost-effective AI deployments.

Speaker

  • Antonio Cardace
    Antonio Cardace
    Red Hat

    Antonio Cardace (@acardace), Principal Machine Learning Engineer at Red Hat, is a dedicated technologist with a deep background in Kubernetes, Virtualization, and embedded development. An active member of the open-source community, he currently contributes to the llm-d project and previously served as a maintainer for the KubeVirt project. Antonio is passionate about bridging the gap between infrastructure and AI to streamline complex environments. He is based in Reggio Emilia, Italy.