Shanghai, China
June 24–26, 2019
Click here for more information and registration

Simultaneous translation will be provided for all keynote and breakout sessions.

To view the Chinese version of this schedule please go here.

Venue + Sponsor Showcase Map
场馆 + 赞助商展示区地图

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

KC+CNC - Observability [clear filter]
Tuesday, June 25


Real World Architecture - Building a Global Cross-Cloud Monitoring Platform - Dominic Green, Improbable & Yifan Zhao, Improbable
Prometheus allows us to monitor applications and infrastructure running within a Kubernetes cluster with ease. Starting out with just a few servers it's simple to configure and run, as you start to scale out you adopt novel strategies such as federation and meta monitoring to ensure you can get all the metrics you need. But what happens when you scale out past a single cluster? What happens when you scale out past a single cloud provider?

In this talk, you will learn how at Improbable we have been able to successfully scale our metrics platform to reach a global scale. Prometheus is a solid foundation to our platform which is extended by Thanos an OSS project to allow global querying and high availability of Prometheus scrapers. Then by adding in Envoy, we can unlock cross-cluster, cross-cloud communication allowing our engineers to monitor our platform across the globe.

avatar for Dominic Green

Dominic Green

Software Engineer, Improbable
Dom was the first cadet to outsmart the Kobiashi Maru, completed the Kessel Run in less than twelve parsecs, and beat Parzival to the First Gate. While not melting reality with fiction Dom works as a Software Engineer at Improbable a London based startup creating virtual worlds with... Read More →

Yifan Zhao

Co-Founder, Improbable China
Yifan is the co-founder of Improbable China, looking to bring Improbables vision of next-generation massively scalable online games to China. He is part of the founding team of Improbable and played a core role in the development of SpatialOS from the very early days. Yifan is an... Read More →

Tuesday June 25, 2019 11:00 - 11:35


High Available + Scalable Prometheus with Thanos in Alibaba - Guo'an Qin, Alibaba & Tao Li, Alibaba
Alibaba Group is using Kubernetes to support the world's largest e-commerce business. With the respect of the availability and scalability, how to provide reliable fine-grained monitoring and alerting services is a indeed challenge.

In this talk, we'll share the experiences in developing a fine-grained monitoring system with high availability and scalability based on the open source project Prometheus and Thanos. This system mainly supports Alibaba's cluster management system, which has 4 million TPS and 10K requests per-second.

We will have a discussion in following topics. 1) How to support a large-scale scenarios using Prometheus? 2) How to solve data query problem caused by multiple Prometheus instance with low query latency using Thanos? 3) The lessons we learnt from Prometheus and Thanos's configuration, such as target discovery and management of recording rule and alerting rule.


Guo'an Qin

Staff Engineer, Alibaba
Guo'an Qin is a staff engineer at Alibaba. He works in the sigma scheduler team. He worked in the Alibaba database team, where he developed a database scheduling system that supported the operation and maintenance of the Alibaba database.

Tao Li

Engineer, Alibaba
Tao Li is an engineer at Alibaba Cloud. He works in the sigma scheduler team. Mainly responsible for building monitoring systems that support large-scale clustering and multi-tenant scenarios.

final pdf

Tuesday June 25, 2019 11:45 - 12:20


Effective Logging in Multi-Tenant Kubernetes Environment - Benjamin Huo & Dan Ma, Beijing Yunify Technology Co., Ltd.
The EFK stack is a popular choice for Kubernetes logging. But users in a multi-tenant Kubernetes should only be allowed to access user specific application/auditing logs.

Fluentd is a good at log aggregation while Fluent Bit is more efficient on log collecting.

It's burdensome to adjust Fluent Bit options which requires some domain specific knowledges and this is what operator pattern is good at.

So We developed FluentBit operator. Users can simply update Fluent Bit config with one single command like "kubectl edit fluentbit fluent-bit", then FluentBit operator will take care of the rest including changing Fluent Bit config, turning on/off log collecting and reloading Fluent Bit config without recreating the entire Fluent Bit DaemonSet etc.

In this talk, engineers from QingCloud KubeSphere team will talk about kubernetes logging in multi-tenant Kubernetes and FluentBit Operator.

avatar for Benjamin Huo

Benjamin Huo

Lead of QingCloud kubernetes observability team, Beijing Yunify Technology Co., Ltd.
Benjamin Huo is the Lead of QingCloud kubernetes observability team who is responsible for development of kubernetes monitoring, alerting, logging, auditing and event management products. He is interested and experienced in cloud native and data related technologies like Kubernetes... Read More →
avatar for Dan Ma

Dan Ma

Senior Software Engineer of QingCloud kubernetes observability team, Beijing Yunify Technology Co., Ltd.
Dan Ma is a Senior Software Engineer of QingCloud kubernetes observability team who is responsible for development of kubernetes monitoring, alerting, logging, auditing and event management products. He focuses on Kubernetes, Big Data and AI technologies. He is interested in open... Read More →

Tuesday June 25, 2019 13:35 - 14:10


1-5-10: How to Fast Recover Container Failure at Large Scale - XiongHuan, Alibaba
In cloud era, container based applications in enterprise grow rapidly, then container failure's possibility is amplified so much due to mannual operations, hardware failure and so on. Thus how to guarantee reliability of containers at scale without increasing resource investment is a really huge challenge cloud platform face.

Alibaba has run millions of containers and put forward 1-5-10 thoery for recovering container-related failure: MTTD(Mean Time to Detect) is 1 min, MTTI(mean time to identity) is 5 min, MTTR(mean time to resolve) is 10 min.

In this session we'll discuss how to increase reliability of large-scaled containers by 1-5-10:
1. How to build an efficient agent locally to detect problems within 1 min;
2. How to diagnose container problem intelligently by expert's knowledge base;
3. How to recover container problem automatically in one failure-driven way.


Huan Xiong

Senior Engineer, Alibaba
A senior software engineer in Alibaba, focuses on reliability of host/container/cluster.

Tuesday June 25, 2019 14:20 - 14:55


Observability in Service Mesh Powered by Envoy and Apache SkyWalking - Sheng Wu & Lizan Zhou, Tetrate
Service Mesh provides a new angle to provide observability, no matter the architecture or language. In the traditional way, we needed language agents or SDK to observe the application server status. Since service mesh provides full control of RPC, observability is much easier to be added without language specific technology.

In this session, we will demonstrate an open source integration solution based on Envoy and SkyWalking. Without code injection technology or Istio Mixer, we could build telemetry from Envoy and analysis in SkyWalking, with good performance. The user could get the service topology map, metrics graph, request detail and error message, with a very nice visualization.


Lizan Zhou

Software Engineer, Tetrate
Lizan Zhou is a Founding Engineer at Tetrate leading traffic management. He is a senior maintainer of Envoy and one of core contributors of Istio. Previously he was working at Google Cloud, during his time at Google he worked on security and networking on Istio and Cloud Endpoints... Read More →

Sheng Wu

Founding Engineer, Tetrate
I am a Founding Engineer at Tetrate. And lead the Apache open source APM/Observability analysis platform project, SkyWalking, which has been included in cncf cloud native landscape. Be a PMC member of Apache Incubator. Take part in Apache Zipkin and ShardingSphere as a PMC member... Read More →

Tuesday June 25, 2019 15:05 - 15:40