Measuring Reliability in Production
Measuring Reliability in Production uses an example application to describe how to define SLIs and SLOs. It includes an overview of application architecture, a how-to for developing SLOs, and suggestions for implementing SLOs in Cloud Operations. There’s also a focus on how to identify CUJs (Critical User Journeys) and recommendations for implementing metrics to use as SLI and SLO targets.
Thomas Voß is a Staff Software Engineer in Google’s SRE organization, with 15 years of experience in designing, implementing, and operating large-scale infrastructure components across many different verticals and industries. Prior to joining Google, Thomas helped with defining and implementing standards for safely and efficiently integrating drones into the sky at scale, strongly advocating for the use of open-source principles and technologies to enable a thriving ecosystem and robust large-scale deployments across many different organizations.