The strength of CALCioM: reducing I/O cross-application contention

April 17, 2014

A major challenge in high-performance computing systems is reducing I/O interference between concurrent applications. Allowing two concurrent applications equal sharing of throughput may seem fair, but it can lead to both applications being slowed. Similarly, taking application size into consideration is not enough: interference also varies according to each application’s behavior.

For example, if a small application is running on 8 cores while a bigger one is running on 336, the throughput of the smaller application may be severely impacted. Even worse, however, is the possible effect on systemwide performance.

What is needed is a strategy that considers the behavior and constraints of each application—I/O rate, memory storage, and job duration, as well as size. To this end, a team of researchers from Argonne National Laboratory, Inria - Rennes Bretagne Atlantique Research Center, and ENS Rennes developed a common communication layer called CALCioM (Cross-Application Layer for Coordinated I/O Management), in the framework of the INRIA Rennes-Bretagne Atlantique-ANL-UIUC associate team project (http://www.irisa.fr/kerdata/data-at-exascale).

The team studied several popular I/O scheduling strategies, each of which may be optimal in different contexts but suboptimal in others. With serialization, accesses are on a first-come-first served basis regardless of size; if the second application is considerably smaller, this strategy results in unnecessary wait time. With the interruption strategy, the second application interrupts the first; this strategy is effective when the second application is smaller, but it becomes ineffective and even counterproductive when the applications have a similar size. The interference strategy, on the other hand, can work well when the interference is low enough, for example, between two small applications, but can cause problems if all the applications cannot afford the performance decrease.

“CALCioM selects a scheduling strategy based on a holistic view of the set of running applications and their respective I/O activities,” said Dries Kimpe, an assistant computer scientist in Argonne’s Mathematics and Computer Science Division. “The CALCioM framework allows applications running on a supercomputer to communicate and coordinate their I/O strategy, at run time, in order to avoid interfering with one another.”

CALCioM differs from traditional approaches in a number of ways. For example, it does not optimize each application individually, disregarding potential cross-application interference. It does not leave interference-avoiding strategies to the file system’s scheduler, with no information about the constraints or freedom of each application and with no way to differentiate I/O requests. It does not allow an application to force the interruption of another or to lock out other applications from accessing the file system at the same time. Rather, it provides the means by which applications can communicate. CALCioM can be transparently integrated in the I/O stack of the applications and use the information exchanged by different applications to make a decision on their behavior.

“The key to choosing the best strategy is giving the applications a way to become aware of the I/O requirements of the other applications and to exchange this information,” said Kimpe.

For this purpose, the researchers implemented the various strategies in CALCioM and ran tests on both Argonne’s Blue Gene/P “Surveyor,” with 4,096 cores, and the French Grid’5000 testbed, with two clusters having a total of 960 cores. Benchmark applications were used for the tests; the study focused on collective write operations and write/write interference between two applications writing different amounts of data and running on different numbers of cores. The decision about the best strategy was made dynamically, based on information exchanged between the applications through CALCioM’s common communication layer.

“CALCioM always managed to make a decision that improved performance,” said Kimpe. For example, CALCioM was able to prevent a 14x slowdown of a small application competing with a larger one, at negligible cost to the latter, by allowing the interruption of its ongoing I/O operations. CALCioM thus opens a wide range of new possible scheduling optimizations through the sharing of I/O properties between applications.

Arguably, interference involving more than two applications raises far more potential complications. The next step, then, is to investigate the complex case of cross-coordinating more applications.

For further information, see

M. Dorier, G. Antoniu, R. Ross, D. Kimpe, and S. Ibrahim, “CALCioM: Mitigating I/O Interference in HPC Systems through Cross-Application Coordination,” accepted for presentation at the 28^th IEEE International Parallel & Distributed Processing Symposium, Phoenix, AZ, May 2014.