Shanghai, China
June 24–26, 2019
Click here for more information and registration

Simultaneous translation will be provided for all keynote and breakout sessions.

To view the Chinese version of this schedule please go here.

Venue + Sponsor Showcase Map
场馆 + 赞助商展示区地图
Back To Schedule
Tuesday, June 25 • 14:20 - 14:55
1-5-10: How to Fast Recover Container Failure at Large Scale - XiongHuan, Alibaba

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.
In cloud era, container based applications in enterprise grow rapidly, then container failure's possibility is amplified so much due to mannual operations, hardware failure and so on. Thus how to guarantee reliability of containers at scale without increasing resource investment is a really huge challenge cloud platform face.

Alibaba has run millions of containers and put forward 1-5-10 thoery for recovering container-related failure: MTTD(Mean Time to Detect) is 1 min, MTTI(mean time to identity) is 5 min, MTTR(mean time to resolve) is 10 min.

In this session we'll discuss how to increase reliability of large-scaled containers by 1-5-10:
1. How to build an efficient agent locally to detect problems within 1 min;
2. How to diagnose container problem intelligently by expert's knowledge base;
3. How to recover container problem automatically in one failure-driven way.


Huan Xiong

Senior Engineer, Alibaba
A senior software engineer in Alibaba, focuses on reliability of host/container/cluster.

Tuesday June 25, 2019 14:20 - 14:55 CST
  KC+CNC - Observability