Shanghai, China
June 24–26, 2019
Click here for more information and registration

Simultaneous translation will be provided for all keynote and breakout sessions.

To view the Chinese version of this schedule please go here.

Venue + Sponsor Showcase Map
场馆 + 赞助商展示区地图
Back To Schedule
Tuesday, June 25 • 13:35 - 14:10
Co-Location of CPU and GPU Workloads with High Resource Efficiency - Penghao Cen, Ant Financial & Jian He, Alibaba

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.
Users run various workloads in Kubernetes including long running services and AI batch jobs. Normally, GPU machines are dedicated only for AI training and the resource utilization is low in some time.

Have you ever thought about co-locating different kinds of workloads on same node so you can save machines, aka money?

In this talk we will share experience and practices of leveraging co-location mechanism in Kubernetes cluster.

In detail:
Why & how we created a new QoS class from BestEffort?
Why & How we created node level cgroup for batch jobs?
How we use a CRD named PodGroup to achieve gang scheduling?
How we do the utilization evaluation?

In the past months, we build a co-location cluster which has more than 100 GPU (NVIDIA Tesla P100) nodes and more than 500 CPU nodes. We co-deployed both long-running services and AI batch jobs and achieved utilization increase of 10%.

avatar for Jian He

Jian He

Staff Engineer, Alibaba
Jian He is a Staff Engineer at Alibaba where he works on a container infrastructures to support Alibaba massive workloads globally. Prior to that, he worked at Hortonworks Hadoop team, and primarily contributes to the Hadoop open source community and is also the Hadoop committer and... Read More →
avatar for Penghao Cen

Penghao Cen

Senior Engineer, Ant Financial
Penghao Cen is a Senior Engineer at Ant Financial (formerly known as Alipay). He is currently an active contributor/member in Kubernetes and Kubeflow community focussing on resource management and scheduling. He primarily contributes to kubeflow/tf-operator project(Tools for MachineLearning/Tensorflow... Read More →

Tuesday June 25, 2019 13:35 - 14:10 CST