Skip to main content
Seminar | Argonne Leadership Computing Facility

Adaptive Parallelism Mapping in Dynamic Environments Using Machine Learning

ALCF Seminar

Abstract: Modern-day hardware platforms are parallel and diverse, ranging from mobile devices to data centers, and collocation of mainstream parallel applications is increasingly becoming common. The resulting resource contention may lead to drastic degradation in a program’s performance. In addition, the execution environment composed of workloads and hardware resources, is dynamic and unpredictable. Efficient matching of program parallelism to machine parallelism under uncertainty is hard. The mapping policies should anticipate these variations and enable effective resiliency to the applications.

This talk proposes solutions to the mapping of parallel programs in dynamic environments. It employs predictive modeling techniques to adaptively map programs by determining the best degree of parallelism. When evaluated on highly dynamic executions, these solutions are proven to surpass default, state-of-the-art adaptive and analytic approaches.

Next, I will introduce an approach for a transparent fault-tolerance approach for MPI that leverages the application checkpoint/restart mechanism used in scientific applications. I will then present a novel approach to optimize applications running on heterogeneous systems. This work analyzes parallel codes and uses machine learning model to decide the best data placement in multilevel memory hierarchy in GPUs.

This seminar will be streamed.