Loading…
Shanghai, China
June 24–26, 2019
Click here for more information and registration

Simultaneous translation will be provided for all keynote and breakout sessions.
我们将为所有主题演讲和分组会议提供同声传译服务。

To view the Chinese version of this schedule please go here.
请点击此处查看中文版本。

Venue + Sponsor Showcase Map
场馆 + 赞助商展示区地图
Tuesday, June 25 • 16:00 - 16:35
Large Scale Distributed Deep Learning on Kubernetes Clusters - Yuan Tang, Ant Financial & Yong Tang, MobileIron

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.
The focus of this talk is the deployments of large scale distributed deep learning with Kubernetes. The usage of operators to manage and automate training processes for machine learning are discussed. We share our experiences and compare two open source Kubernetes operators, tf-operator and mpi-operator in this talk. Both operators manage training jobs for TensorFlow but they have different distribution strategies, which lead to different performance results with respect to the utilization ratio among CPU, GPU, and network.

Deep learning tasks are both network and GPU intensive such that a proper optimization for orchestration is very important. There could easily be an imbalance leads to idle compute capacity which is too expensive for GPU nodes (compared with CPUs). We will share our experiences with the hope to provide helpful insight for better economics with machine learning tasks.

Speakers
avatar for Yong Tang

Yong Tang

Senior Director of Engineering, Ivanti
Yong Tang is Senior Director of Engineering at Ivanti. He is a core maintainer of CoreDNS and contributes to many container, cloud-native, and machine learning projects for the open source community. In addition to CoreDNS, he is a maintainer of Docker/Moby. He is also a maintainer... Read More →
avatar for Yuan Tang

Yuan Tang

Principal Software Engineer, Red Hat
Yuan is a principal software engineer at Red Hat, working on OpenShift AI. Previously, he has led teams to build AI infrastructure and platforms at various companies, including Alibaba and Akuity. He's a project lead of Argo and Kubeflow, a maintainer of TensorFlow and XGBoost, and... Read More →



Tuesday June 25, 2019 16:00 - 16:35 CST
620