RAIN is a cluster resource management system developed at LinkedIn. It manages resources for tens of thousands of hosts per cluster in multiple datacenters including Azure to support scheduling both long running and batch jobs. It is integrated with existing LinkedIn cluster management ecosystem.
The goal for our next generation cluster management system is to support heterogeneous compute workloads quickly to improve developer productivity and server utilizations. We have evaluated and decided to adopt K8s' declarative API and extensible architecture. The adoption process has quite a few challenges for integrating with existing ecosystem at LinkedIn scale.
We first give an overview of LinkedIn cluster management ecosystem. Then we talk about our evaluation process and adoption challenges. We will then share lessons we learned during production and integration process.
Abin Shahab is a Staff Engineer at Linkedin’s Big Data Platform (BDP) team. He joined Linkedin in 2017 and leads the Deep Learning infra team in BDP. He is a veteran KubeCon speaker.
Tengfei Mu is a Staff Engineering Manager in Foundation team at LinkedIn where he is responsible for leading and architecting next generation cluster management system. He is passionate about incremental adopting k8s ecosystem at LinkedIn. Before joining LinkedIn, he was Tech Lead... Read More →