Combining Kubernetes and vLLM to Deliver Scalable, Distributed Inference with llm-d

Effectively managing and scaling modern AI applications requires rethinking how we use Kubernetes, as traditional load balancing and scheduling fall short for diverse inference workloads. This session introduces llm-d, a joint open-source, Kubernetes-native framework designed specifically for distributed LLM inference. We will explore the core challenges of scalability, reliability, and hardware mapping in AI, and demonstrate how llm-d solves them through intelligent inference scheduling and a pluggable architecture.

By moving beyond standard Kubernetes primitives, llm-d optimizes performance across various hardware accelerators while avoiding vendor lock-in. We will dive into the lifecycle of an inference request and explore advanced production techniques, including precise KV-cache-aware routing and Prefill-Decode disaggregation. Join us to learn how llm-d’s specialized worker topologies maximize GPU utilization and how this collaborative project is reshaping the future of high-performance, cost-effective AI deployments.

Speaker

  • Antonio Cardace
    Antonio Cardace
    Red Hat

    Antonio Cardace (@acardace), Principal Machine Learning Engineer at Red Hat, is a dedicated technologist with a deep background in Kubernetes, Virtualization, and embedded development. An active member of the open-source community, he currently contributes to the llm-d project and previously served as a maintainer for the KubeVirt project. Antonio is passionate about bridging the gap between infrastructure and AI to streamline complex environments. He is based in Reggio Emilia, Italy.

Date

Apr 28 2026
Expired!

Time

9:15 - 9:45