Tuesday, June 25 • 17:30 - 18:05
No More Chaos: Audit and Inspect Kubernetes at Scale - 陈杰, 阿里云 & 马金晶, 蚂蚁金服(杭州)网络技术有限公司

Accuracy in fault detection and efficiency of issue analysis are important for availability and stability of Kubernetes clusters.While there are huge number of resources, events, and metrics in Kubernetes. In our cluster, we noticed Kubernetes generates thousands of metrics data per second which makes it challenging to figure out the root cause from this ocean of data, not to mention analysis,data visualizion and alarms.In this talk, we will share experince and practices of auditing and inspecting Kubernetes at web scale. We'll firstly talk about the how we design data metrics to reflect the stability of Kubernetes and how we consume these metrics and set out streaming alarm.We will use real cases to demo how we aggregate and analyze these metrics data.Finally,we will share the practices in Alibaba of building a automiatic system for real-time data inspection and analysis for Kubernetes.

avatar for 陈杰


技术专家, 阿里云
2011年加入阿里,早期参与阿里搜索引擎统一运维平台的建设以及负责一淘搜索引擎的运维;2013年参与搜索调度平台的创建和建设;2015年开始推动搜索的容器化以及pouch化,2016... Read More →
avatar for 马金晶


高级开发工程师, 蚂蚁金服
目前就职于蚂蚁金服 - 世界上最有价值的独角兽公司,2017年开始参与阿里 Sigma 容器调度平台的研发,参与并见证了阿里巴巴、蚂蚁金服将大规模集群从 Sigma 迁移至 Kubernetes 的潮流,在后 Kubernetes... Read More →

