Skip to main content
Article | Mathematics and Computer Science

Formal Analysis of MPI-based Parallel Programs

Most parallel computing applications in high-performance computing use the Message Passing Interface (MPI) API. Given the fundamental importance of parallel computing to science and engineering research, application correctness is paramount. MPI was originally developed around 1993 by the MPI Forum, a group of vendors, parallel programming researchers, and computational scientists. However, the document defining the standard is not issued by an official standards organization but has become a de facto standard through near-universal adoption. MPI development continues, with MPI-2.2 released in 2009 and MPI 3.0 expected in 2012. The standard is published on the Web and as a book, along with several other books based on it; see, for example, Gropp et al. and Pacheco. Implementations are available in open source from MPICH2 and Open MPI from software vendors and from every vendor of HPC systems. MPI is widely cited; Google Scholar recently returned 39,600 hits for the term “+MPI +Message Passing Interface.”

MPI is designed to support highly scalable computing applications using more than 100,000 cores on, say, the IBM Blue Gene/P (see Figure 1) and Cray XT5. Many MPI programs represent dozens, if not hundreds, of person-years of development, including calibration for accuracy and performance tuning. Scientists and engineers worldwide use MPI in thousands of applications, including in investigations of alternate-energy sources and in weather simulation. For HPC computing, MPI is by far the dominant programming model; most (at some centers, all) applications running on supercomputers use MPI. Many application developers for exascale systems regard support for MPI as a requirement.

Still, the MPI debugging methods available to these developers are typically wasteful and ultimately unreliable. Existing MPI testing tools seldom provide coverage guarantees, examining essentially equivalent execution sequences, thus reducing testing efficiency. These methods fare even worse at large problem scales. Consider the costs of HPC bugs. A high-end HPC center costs hundreds of millions of dollars to commission, and the machines become obsolete within six years; in many centers, the annual electricity bill can run more than $3 million, and research teams apply for computer time through competitive proposals, spending years planning their experiments. In addition to these costs, one must add the costs to society of relying on potentially defective software to inform decisions involving issues of great public importance (such as climate change).

Formal methods can play an important role in debugging and verifying MPI applications. Here, we describe existing techniques, including their pros and cons, and why they have value beyond MPI, addressing the general needs of future concurrency application developers who will inevitably use low-level concurrency APIs.

Historically, parallel systems have used either message passing or shared memory for communication. Compared to other message-passing systems noted for their parsimony, MPI supports a large number of cohesively engineered features essential for designing large-scale simulations; for example, MPI-2.2 specifies more than 300 functions, though most developers use only a few dozen in any given application.

MPI programs consist of one or more threads of execution with private memories (called MPI processes”) and communicate through message exchanges. The two most common are point-to-point messages (such as sends and receives) and collective operations (such as broadcasts and reductions). MPI also supports non-blocking operations that help overlap computation and communication and persistent operations that make repeated sends/receives efficient. In addition, MPI allows processes and communication spaces to be structured using topologies and communicators. MPI’s derived datatypes further enhance the portability and efficiency of MPI codes by enabling the architect to communicate noncontiguous data with a single MPI function call. MPI also supports a limited form of shared-memory communication based on one-sided communication. A majority of MPI programs are still written using the two-sided” (message-passing-oriented) constructs we focus on through the rest of the article. Finally, MPI-IO addresses portable access to high-performance input/output systems.

MPI applications and libraries are written predominantly in C, C++, and/or Fortran. Languages that use garbage collection or managed runtimes (such as Java and C#) are rarely used in HPC; preexisting libraries, compiler support, and memory locality management drive these choices. Memory is a precious resource in large-scale systems; a rule of thumb is an application cannot afford to consume more than one byte per FLOP. Computer memory is expensive and increases cluster energy consumption. Even when developing traditional shared-memory applications, system architects must work with low amounts of cache-coherent memory per core and manage data locality, something done routinely by MPI programmers. Computer scientists are also realizing that future uses of MPI will be in conjunction with shared-memory libraries (such as Pthreads) to reduce message-copy proliferation. While some MPI applications are written from scratch, many are built atop user libraries, including ParMETIS for parallel hypergraph partitioning, ScaLAPACK for high-performance linear algebra, and PETSc for solving partial differential equations.

MPI processes execute in disjoint address spaces, interacting through communication commands involving deterministic, nondeterministic, collective, and non-blocking modes. Existing (shared-memory concurrent program) debugging techniques do not directly carry over to MPI, where operations typically match and complete out-of-program order according to an MPI-specific matches-before order. The overall behavior of an MPI program is also heavily influenced by how specific MPI library implementations take advantage of the latitude provided by the MPI standard.

An MPI program bug is often introduced when modeling the problem and approximating the numerical methods or while coding, including whole classes of floating-point challenges. While lower-level bugs (such as deadlocks and data races) are serious concerns, detecting them requires specialized techniques of the kind described here. Since many MPI programs are poorly parameterized, it is not easy for HPC developers to down-scale a program to a smaller instance and locate the bug. For these reasons, HPC developers need a variety of verification methods, each narrowly focused on subsets of correctness issues and making specific trade-offs. Our main focus here is formal analysis methods for smaller-scale MPI applications and semiformal analysis methods for the very large scale. For detecting MPI bugs in practice, formal analysis tools must be coupled with runtime instrumentation methods found in tools like Umpire, Marmot, and MUST, though much more research is needed in tool integration.

Dynamic analysis. MPI provides many nondeterministic constructs that free the runtime system to choose the most efficient way to carry out an operation but also mean a program can exhibit multiple behaviors when run on the same input, posing verification challenges; an example is a communication race arising from a wildcard” receive, an operation that does not specify the source process of the message to be received, leaving the decision to the runtime system. Many subtle program defects are revealed only for a specific sequence of choices. Though random testing might happen on one such sequence, it is hardly a reliable approach.

In contrast, dynamic verification approaches control the exact choices made by the MPI runtime, using this control to methodically explore a carefully constructed subset of behaviors. For each such behavior, a number of properties may be verified, including absence of deadlocks, assertion violations, incompatible data payloads between senders and receivers, and MPI resource leaks. Using a formal model of the MPI semantics, a dynamic verifier can conclude that if no violations occur on the subset of executions, then there can be no violation on an execution. If even this reduced subset cannot be explored exhaustively, the developer can specify precise coverage criteria and obtain a lesser (but still quantifiable) degree of assurance. This approach was originally demonstrated in the VeriSoft tool and has the advantage of not requiring modifications to the program source code, compiler, or libraries.

Full-scale debugging. Traditional step-by-step” debugging techniques are untenable for traces involving millions of threads. Later, in an expanded description of full-scale debugging, we describe a new debugging approach called Stack Trace Analysis that analyzes an execution trace and partitions the threads into equivalence classes based on their behavior. Experience on real large-scale systems shows that only a small number of classes typically emerge, and the information provided can help a developer isolate defects. While this approach is not comparable to the others covered here, in that the focus is on the analysis of one trace rather than reasoning about all executions, it provides a clear advantage in terms of scalability.

Symbolic analysis. The techniques discussed earlier are only as good as the set of inputs chosen during analysis. Defects revealed for very specific input or parameter values may be difficult to discover with these techniques alone. Symbolic execution is a well-known technique for identifying defects, described later in the section on symbolic analysis of MPI, including how it is applied to MPI programs. The TASS toolkit uses symbolic execution and state-enumeration techniques to verify properties of MPI programs, not only for all possible behaviors of the runtime system, but for all possible inputs as well. It can even be used to establish that two versions of a program are functionally equivalent, at least within specified bounds. On the other hand, implementing the symbolic execution technique requires sophisticated theorem-proving technology and a symbolic interpreter for all program constructs and library functions; for this reason, TASS supports only C and a subset of MPI. Moreover, it generally does not scale beyond a relatively small number of processes, though, as we show, defects that usually appear only in large configurations can often be detected in much smaller configurations through symbolic execution.

Static analysis. Compilers use static analyses to verify a variety of simple safety properties of sequential programs, working on a formal structure that abstractly represents some aspect of the program (such as a control-flow graph, or CFG). Extending these techniques to verify concurrency properties of MPI programs (such as deadlock freedom) requires new abstractions and techniques. Later, in the section on static analysis of MPI, we outline a new analysis framework targeting this problem that introduces the notion of a parallel CFG. The framework has the advantage that the pCFG is independent of the number of processes, essentially making it infinitely scalable. However, because automating these analyses is so difficult they may require user-provided program annotation to guide them.

Dynamic Verification of MPI

Here, we explore two dynamic analysis approaches: The first, implemented by the tool ISP (see Figure 2), delivers a formal coverage guarantee with respect to deadlocks and local safety assertions; ISP has been demonstrated on MPI applications of up to 15,000 lines of code. Running on modern laptop computers, ISP can verify such applications for up to 32 MPI processes on mostly deterministic MPI programs.

ISP’s scheduler, as outlined in the figure, exerts centralized control over every MPI action. It limits ISP scalability to at most a few dozen MPI processes and does not help programmers encountering difficulty at higher ends of the scale where user applications and library codes often use different algorithms. What if a designer has optimized an HPC computation to work efficiently on 1,000 processors and suddenly finds an inexplicable bug? Traditional HPC debugging support is severely lacking in terms of ensuring coverage goals. To address this limitation, some of the authors have built a tool called Distributed Analyzer of MPI, or DAMPI,34 which uses a distributed scheduler while still ensuring nondeterminism coverage. DAMPI scales demonstrably far more than ISP.

Dynamic verification using ISP. For programs with nondeterministic MPI calls, simply modulating the absolute times at which MPI calls are issued (such as by inserting nondeterministic sleep durations, as performed by stress-testing tools) is ineffective because most often it does not alter the way racing MPI sends match with MPI nondeterministic receives deep inside the MPI runtime. Also, such delays slow the entire testing process unnecessarily.

ISP’s active testing approach (see Figure 3) means if P2’s MPI_Isend can match P1’sMPI_Irecv, the test encounters a bug. But can such a match occur? Yes, and here’s how; first, let P0 issue its non-blocking MPI_Isend call and P1 its non-blocking MPI_Irecv call; then allow the execution to cross the MPI_Barrier calls; after that, P2 can issue itsMPI_Isend. The MPI runtime then faces a nondeterministic choice of matching eitherMPI_Isend. The system achieves this particular execution sequence only if theMPI_Barrier calls are allowed to match before the MPI_Irecv matches. Existing MPI testing tools cannot exert such fine control over MPI executions. By interposing a scheduler, as outlined in Figure 2, ISP can reorder, at runtime, MPI calls issued by the program. In the example, ISP’s scheduler intercepts all MPI calls coming to it in program order and dynamically reorders the calls going into the MPI runtime (ISP’s scheduler sends Barriers first, an order allowed by the MPI semantics), at which point it discovers the nondeterminism.

When ISP determines two matches could occur, it re-executes (replays from the beginning) the program in Figure 3 twice, once with the Isend from P0 matching the receive, the second Isend from P2 matching it. To ensure these matches occur, ISP dynamically rewrites Irecv(from:*) into Irecv(from:0) and into Irecv(from:2) in these replays. If the algorithm does not do this but instead issues Irecv(from:*) into the MPI runtime, coverage of both process sends is no longer guaranteed. ISP discovers the maximal extent of non-determinism through dynamic MPI call reordering and achieves scheduling control of relevant interleavings through dynamic API call rewriting. While pursuing relevant interleavings, ISP additionally detects three basic error conditions: deadlocks, resource leaks (such as MPI object leaks), and violations of C assertions in the code.

Developers should bear in mind that MPI programmers often use non-blocking MPI calls to enhance computation/communication overlap and nondeterministic MPI calls in master/worker patterns to detect which MPI process finishes first, so more work can be assigned to it. When these operations, together with collective” operations (such asBarriers), are all employed in the same example, a developer can obtain situations like the one in Figure 3. The safety net provided by ISP and other such tools is therefore essential for efficiency-oriented MPI programming.

ISP guarantees MPI communication nondeterminism coverage under the given test harness and helps avoid exponential interleaving explosion primarily by avoiding redundantly examining equivalent behaviors (such as by not examining the n! different orders in which an MPI barrier call might be invoked); testing tools typically fall victim to this explosion. ISP also includes execution-space sampling options.

ISP has examined many large MPI programs, including those making millions of MPI calls. Some of the authors have also built the Graphical Explorer of Message passing (GEM) tool, which hosts the ISP verification engine. GEM is an official component of the Eclipse Parallel Tools Platform, or PTP, (PTP version 3.0 onward), making dynamic formal verification of MPI available seamlessly within a popular integrated development environment. GEM also serves as a formal-methods supplement to a popular MPI textbook by providing chapter examples as readily available MPI C projects.

Dynamic verification using DAMPI. A widely used complexity-reduction approach is to debug a given program after first suitably downscaling it. However, a practical difficulty in carrying out such debugging is that many programs are poorly parameterized. For them, if a problem parameter is reduced, it is often unclear whether another parameter should be reduced proportionally, logarithmically, or through some other relationship. A more serious difficulty is that some bugs are manifest only when a problem is run at scale. The algorithms employed by applications and/or the MPI library itself can change depending on problem scale. Also, resource bugs (such as buffer overflows) often show up only at scale.

While user-level dynamic verification supported by ISP resolves significant nondeterminism, testing at larger scales requires a decentralized approach where supercomputing power aids verification, an idea the authors implemented in their tool framework DAMPI (see Figure 4).

The key insight that allowed them to design the decentralized scheduling algorithm of DAMPI is that a non-deterministic operation, as in MPI_Irecv(MPI_ANY_SOURCE) andMPI_Iprobe(MPI_ANY_SOURCE), represents a point on the timeline of the issuing process when the operation commits to a match decision. It is natural for an HPC programmer to view each such event as starting an epoch, an interval stretching from the current nondeterministic event up to (but not including) the next nondeterministic event. All deterministic receives can be assigned the same epoch as the one in which they occur. Even though the epoch is defined by a non-deterministic receive matching another process’s send, how can the tool determine all other sends that match it? The solution is to pick all the sends that are not causally after the nondeterministic receive (and subject to MPI’s non-overtaking” rules). DAMPI determines these sends through an MPI-specific version of Lamport clocks, striking a good compromise between scalability and omissions.

Experimental results show DAMPI effectively tests realistic problems running on more than 1,000 CPUs by exploiting the parallelism and memory capacity of clusters. It has examined all benchmarks from the Fortran NAS Parallel Benchmark suite, with instrumentation overhead less than 10% compared to ordinary testing, but able to provide nondeterminism coverage not provided by ordinary testing.

Recent experiments by some of the authors found a surprising fact: None of the MPI programs in the NAS Parallel Benchmarks employing MPI_Irecv(MPI_ANY_SOURCE)calls actually exhibit nondeterminism under DAMPI. This means these benchmarks were determinized,” perhaps through additional MPI call arguments and is further confirmation of the value of dynamic analysis in providing precise answers.

Symbolic Analysis of MPI

The basic idea of symbolic execution is to execute the program using symbolic expressions in place of the usual (concrete) values held by program variables. The inputs and initial values of the program are symbolic constants X0;X1,…, so-called because they represent values that do not change during execution. Numerical operations are replaced by operations on symbolic expressions; for example, if program variables u and v hold valuesX0 and X1, respectively, then u+v will evaluate to the symbolic expression X0 + X1.

The situation is more complicated at a branch point. Suppose a branch is governed by condition u+v>0. Since the values are symbolic, it is not necessarily possible to say whether the condition evaluates to true or false; both possibilities must be explored. Symbolic execution handles this problem by introducing a hidden Boolean-valued symbolic variable, the path condition pc, to record the choices made at branch points. This variable is initialized to true. At a branch, a non-deterministic choice is made between the two branches, and pc is updated accordingly. To execute the branch on u+v > 0, pcwould be assigned the symbolic value pc Image removed. u+v > 0 if the true branch is selected; if this is the first branch encountered, pc will now hold the symbolic expression X0 + X1 > 0. If the false branch is chosen instead, pc will hold X0 + X1 ≤ 0. Hence the path condition records the condition the inputs must satisfy for a particular path to be followed. Model-checking techniques can then be used to explore all nondeterministic choices and verify a property holds on all executions or generate a test set. An automated theorem prover (such as CVC3) can be used to determine if pc becomes unsatisfiable, in which case the current path is infeasible and pruned from the search.

One advantage of symbolic techniques is they map naturally to message-passing-based parallel programs. The Verified Software Lab’s Toolkit for Accurate Scientific Software (TASS), based on CVC3, uses symbolic execution and state-exploration techniques to verify properties of such programs. The TASS verifier takes as input the MPI/C source program and a specified number of processes and instantiates a symbolic model of the program with that process count. TASS maintains a model of the state of the MPI implementation, including that of the message buffers. Like all other program variables, the buffered message data is represented as symbolic expressions. The TASS user may also specify bounds on input variables in order to make the model finite or sufficiently small. An MPI-specific partial-order-reduction scheme restricts the set of states explored while still guaranteeing that if a counterexample to one of the properties exists (within the specified bounds), a violation is reported. Examples are included in the TASS distribution, including where TASS reveals defects in the MPI code (such as a diffusion simulation code from the Functional Equivalence Verification Suite at http://​vsl​.cis​.udel​.edu/​fevs/ Image removed.).

TASS can verify the standard safety properties, but its most important feature is the ability to verify that two programs are functionally equivalent; that is, if given the same input, they always return the same output. This is especially useful in scientific computing where developers often begin with a simple sequential version of an algorithm, then gradually add optimizations and parallelism. The production code is typically much more complex but intended to be functionally equivalent of the original. The symbolic technique used to compare two programs for functional equivalence is known as comparative symbolic execution.”

To illustrate the comparative symbolic technique, see Figure 6, where the sequential program reads n floating-point numbers from a file, sums the positive elements, and returns the result. A parallel version divides the file into approximately equal-size blocks. Each process reads one block into a local array and sums the positive elements in its block. On all processes other than process 0, this partial sum is sent to process 0, which receives the numbers, adds them to its partial sum, and outputs the final result.

Ignoring round-off error, the two programs are functionally equivalent; given the same file, they output the same result. To see how the comparative symbolic technique establishes equivalence, consider the case n = nprocs = 2 and call the elements of the fileX0 and X1. There are four paths through the sequential program, due to the two binary branches if a[i]>0.0. One of these paths, arising when both elements are positive, yields the path condition X0 > 0 Image removed. X1 > 0 and output X0 + X1. The comparative technique now explores all possible executions of adder _ par in which the initial path condition is X0 > 0 Λ X1 > 0; there are many such executions due to the various ways the statements from the two processes can be interleaved. In each, the output is X0 + X1. A similar fact can be established for the other three paths through the sequential program. Taken together, these facts imply the programs will produce the same result on any input (for n = nprocs = 2).

The ability to uncover defects at small scales is an important advantage of symbolic approaches. Isolating and repairing a defect that manifests only in tests with thousands of processes and huge inputs is difficult. Several research projects have focused on making traditional debuggers scale to thousands of processes for just this reason. However, it would be more practical to force the same defect to manifest itself at smaller scales and then isolate the defect at those scales.

A real-life example illustrates this point: In 2008, a user reported a failure in the MPICH2 MPI implementation when calling the broadcast function MPI _ Bcast, which used 256 processes and a message of just over count = 3,200 integers. Investigation revealed the defect was in a function used to implement broadcasts in specific situations (see Figure 7a). For certain inputs, the size” argument (nbytes-recv _ offset) to an MPI point-to-point operation—an argument that should always be nonnegative—could in fact be negative. For 256 processes and integer data (type _ size = 4), this fault occurs if and only if 3,201 ≤ count ≤ 3,251.

The problematic function is guarded by the code in Figure 7b, referring to three compile-time constants—MPIR _ BCAST _ SHORT _ MSG, MPIR _ BCAST _ LONG _ MSG, and MPIR _ BCAST _ MIN _ PROCS—defined elsewhere as 12,288, 524,288, and 8, respectively. Essentially, the function is called for medium-size” messages only when the number of processes is a power of 2 and above a certain threshold. With these settings, the smallest configuration that would reveal the defect is 128 processes, with count = 3,073.

A symbolic execution technique that checks that the size” arguments to MPI functions are always non-negative would readily detect the defect. If the tool also treats the three compile-time constants as symbolic constants, the defect can be manifest at the much smaller configuration of eight processes and count = 1 (in which case nbytes-recv _ offset = −1). Such an approach would likely have detected this defect earlier and with much less effort.

Arithmetic. In our analysis of the adder example, we interpreted the values manipulated by the program as the mathematical real numbers and the numerical operations as the (infinite precision) real operations. If instead these values are interpreted as (finite-precision) floating-point values and operations, the two programs are not functionally equivalent, since floating-point addition is not associative. Which is right? The answer is it depends on what the user is trying to verify. For functional equivalence, the specification and implementation are rarely expected to be bit-level” equivalent (recall the adder example), so real equivalence is probably more useful for the task. TASS uses a number of techniques specialized for real arithmetic, as when all real-valued expressions are put into a canonical form that is the quotient of two polynomials to facilitate the matching of expressions. For other tasks (such as detecting the defect in Figure 7), bit-level reasoning is more appropriate. Klee is another symbolic execution tool for (sequential) C programs that uses bit-precise reasoning. There is no reason why these techniques could not be extended to parallel MPI-based programs.

Static Analysis of MPI

In the sequential arena, compiler techniques have been successful at analyzing programs and transforming them to improve performance. However, analyzing MPI applications is difficult for four main reasons: the number of MPI processes is both unknown at compile time and unbounded; since MPI processes are identified by numeric ranks, applications use complex arithmetic expressions to define the processes involved in communications; the meaning of ranks depends closely on the MPI communicators used by the MPI calls; and MPI provides several nondeterministic primitives (such as MPI _ ANY _ SOURCE and MPI _ Waitsome). While some prior research (such as Strout et al.) explored analysis of MPI applications, none successfully addressed this challenge.

Some approaches treat MPI applications as sequential codes, making it possible to determine simple application behaviors (such as the relationship between writing to a buffer and sending the buffer). However, these approaches cannot represent or analyze the application’s communication topology. Other techniques require knowledge of the number of processes to be used at runtime, analyzing one copy of the application for each process. While this analysis can capture the application’s full parallel structure, it is inflexible and non-scalable.

Over the past few years, we have developed a novel compiler-analysis framework that extends traditional dataflow analyses to MPI applications, extracting the application’s communication topology and matching the send and receive operations that may communicate at runtime. The framework requires no runtime bound on number of processes and is formulated as a dataflow analysis over the Cartesian product of control flow graphs (CFGs) from all processes we refer to as a parallel CFG, or pCFG. The analysis procedure symbolically represents the execution of multiple sets of processes, keeping track of any send and receive. Process sets are represented through abstractions (such as lower and upper bounds on process ranks) and predicates (such as ranks divisible by 4”). Sends and receives are periodically matched to each other, establishing the application’s communication topology. Tool users can instantiate the analysis framework through a variety of client analyses” that leverage the communication-structure information derived by the framework to propagate dataflow information, as they do with sequential applications. Analyses and transformations include optimizations, error detection and verification, and information-flow detection.

Finally, since topological information is key to a variety of compiler transformations and optimizations, our ongoing work focuses on source-code annotations that can be used to describe a given MPI application’s communication topology and other properties. The techniques will then exploit this information to implement novel scalable analyses and transformations to enable valuable optimizations in complex applications.
<span class=”\&quot;Apple-style-span\&quot;” style=”\&quot;font-size:” 14px;=”” font-weight:=”” bold;=”” \”=””>
Conclusion

This article’s main objective is to highlight the fact that both formal and semi-formal methods are crucial for ensuring the reliability of message-passing programs across the vast scale of application sizes. Unfortunately, discussion of these techniques and approaches is rare in the literature. To address this lacuna, we presented the perspectives of academic researchers, as well as HPC researchers in U.S. national laboratories, engaged in cutting-edge HPC deployment. We propose a continuum of tools based on static analysis, dynamic analysis, symbolic analysis, and full-scale debugging, complemented by more traditional error-checking tools.

Unfortunately, we only barely scratched the surface of a vast problem area. The shortage of formal-methods researchers interested in HPC problems is perhaps the result of the severe historical disconnect between traditional computer scientists” and HPC researchers. This is especially unfortunate considering the disruptive technologies on the horizon, including many hybrid concurrency models to program many-core systems. There are also emerging message-passing-based standards for embedded multicores (such as MCAP), with designs and tool support that would benefit from lessons learned in the MPI arena.

We propose two approaches to accelerate use of formal methods in HPC: First and foremost, researchers in formal methods must develop verification techniques that are applicable to programs employing established APIs. This would help sway today’s HPC practitioners toward being true believers and eventually promoters of formal methods. Moreover, funding agencies must begin tempering the hoopla around performance goals (such as ExaFLOPs in this decade”) by also setting formal correctness goals that lend essential credibility to the HPC applications on which science and engineering depend.

Acknowledgments

This work is supported in part by Microsoft, National Science Foundation grants CNS-0509379, CCF-0811429, CCF-0903408, CCF-0953210, and CCF-0733035, and Department of Energy grant ASCR DE-AC0206CH11357. Part of this work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.