In cloud era, container based applications in enterprise grow rapidly, then container failure's possibility is amplified so much due to mannual operations, hardware failure and so on. Thus how to guarantee reliability of containers at scale without increasing resource investment is a really huge challenge cloud platform face.
Alibaba has run millions of containers and put forward 1-5-10 thoery for recovering container-related failure: MTTD(Mean Time to Detect) is 1 min, MTTI(mean time to identity) is 5 min, MTTR(mean time to resolve) is 10 min.
In this session we'll discuss how to increase reliability of large-scaled containers by 1-5-10: 1. How to build an efficient agent locally to detect problems within 1 min; 2. How to diagnose container problem intelligently by expert's knowledge base; 3. How to recover container problem automatically in one failure-driven way.