Loading…
Shanghai, China
June 24–26, 2019
Click here for more information and registration

Simultaneous translation will be provided for all keynote and breakout sessions.
我们将为所有主题演讲和分组会议提供同声传译服务。

To view the Chinese version of this schedule please go here.
请点击此处查看中文版本。

Venue + Sponsor Showcase Map
场馆 + 赞助商展示区地图
Tuesday, June 25 • 17:30 - 18:05
Minimizing GPU Cost for Your Deep Learning on Kubernetes - Kai Zhang & Yang Che, Alibaba

Sign up or log in to save this to your schedule and see who's attending!

Feedback form is now closed.
More and more data scientists run their Nvidia GPU based deep learning tasks on Kubernetes. Meanwhile, it's found over 40% cost are wasted on idle GPU in the cluster. So one important challenge is how Kubernetes can help to improve GPU usage efficiency.
In this talk we will introduce a GPU sharing solution on native Kubernetes. All design and implementation details will be discussed. Key topics include,
- How to define GPU sharing API
- How to make GPU sharing can be scheduled in the Kubernetes cluster without changing scheduler bare bone code.
- How to integrate GPU isolation solution with Kubernetes
A demo will be shown to illustrate how Tensorflow users to run different jobs on the same GPU device in Kubernetes cluster.
In practise of the solution , overall GPU usage gets remarkable improvement, especially for AI model develop, debug and inference services.

Speakers
avatar for Kai Zhang

Kai Zhang

Staff Engineer, Alibaba
Kai Zhang, is now a staff engineer of Alibaba Cloud. He's worked on container service product and enterprise solution development for 3 years. Before that, he worked in deep learning platform, cloud computing, distributed system and SOA area over 10 years. Recently, he is exploring... Read More →
avatar for Yang Che

Yang Che

Senior Engineer, Alibaba
Yang Che, is a senior engineer of Alibaba Cloud. He works in Alibaba cloud container service team, and focuses on Kubernetes and container related product development. Yang also works on building elastic machine learning platform on those technologies. He is an active contributor... Read More →



Tuesday June 25, 2019 17:30 - 18:05
620