Cloud-Native Model Serving: vLLM’s Lifecycle in Kubernetes
Effectively deploying Large Language Models (LLMs) in Kubernetes is critical for modern AI workloads, and vLLM has emerged as a leading open-source project for LLM inference serving. This session will explore the unique features of vLLM, which set it apart by maximizing throughput and minimizing resource usage. We’ll explore the lifecycle of deploying AI/LLM workloads on Kubernetes, focusing on achieving seamless containerization, efficient scaling with Kubernetes-native tools, and robust monitoring to ensure reliable operations.
By simplifying complex workloads and optimizing performance, vLLM drives innovation in scalable and efficient LLM deployment by leveraging features like dynamic batching and distributed serving, making advanced inference accessible for diverse and demanding use cases. Join us to learn why vLLM is shaping the future of LLM serving and how it integrates into Kubernetes to deliver reliable, cost-effective, and high-performance AI systems.
Speaker
-
Cedric ClyburnRed HatCedric Clyburn (@cedricclyburn), Senior Developer Advocate at Red Hat, is an enthusiastic software technologist with a background in Kubernetes, DevOps, and container tools. He has experience speaking and organizing conferences including Devoxx, WeAreDevelopers, The Linux Foundation, KCD NYC, and more. Cedric loves all things open-source, and works to make developer’s lives easier! Based out of New York.