Enabling Efficient and Scalable Deep Learning on Supercomputers

November 27, 2018
10:30 AM – 11:30 AM
Building 240
Room 4301

Speaker: Zhao Zhang, Texas Advanced Computing Center

DSL Seminar

Abstract: Recent years have seen a fusion between deep learning (DL) and high-performance computing. Domain scientists are exploring and exploiting DL techniques for classification, prediction, and reduction of simulation dimensionality. These DL applications are naturally supercomputing applications given their computation, communication, and I/O characteristics.

In this talk, I will present two works that enable highly scalable distributed DL training. The first focuses on the layer-wise adaptive rate scaling algorithm and its application in ImageNet training on thousands of compute nodes with the state-of-the-art validation accuracy. The second enables efficient and scalable l/O for DL applications on supercomputers with FanStore, with which we are able to scale real-world applications to hundreds nodes on a CPU and GPU cluster with over 90% scaling efficiency.

Bio: Zhao Zhang is a compute scientist at Texas Advanced Computing Center. His current research focuses on scalable deep learning on supercomputers. He received Ph.D. in computer science from the University of Chicago.

Argonne National Laboratory

Enabling Efficient and Scalable Deep Learning on Supercomputers

Argonne National Laboratory