Abstract communicators for task-based runtime systems
Nicolas Ducarton is going to rehearse his WAMTA'26 presentations during the TOPAL Working Group.
Speaker
Nicolas Ducarton
When
Thursday, February 12, 11:00
Where
room Alan Turing
Title
Abstract communicators for task-based runtime systems
Abstract
Massively parallel and heterogeneous architectures can efficiently be exploited using task-based systems. They are used for ever longer-running problems, and on ever larger supercomputers. This leads to a heightened risk of failure, making fault-tolerance capabilities for such systems a desirable addition. To this end, StarPU has started including checkpoint-restart mechanisms directly inside the runtime system. But to tolerate actual failures, the runtime system must use a distributed-memory communication paradigm that enables the writing of fault-tolerant applications. The MPI Forum has recently started including the User-Level Failure Mitigation (ULFM) extension in the MPI Standard precisely to achieve this goal. ULFM defines a set of routines that help support operations after failures, and the behavior of MPI when failures occur. The communication engine and the task-based runtime system must thus be adapted to this specification. In particular, in order to recover the full communication capabilities, it becomes necessary to dynamically replace MPI communicator, that include failed processes, by renewed communicators without the failed processes. To this end, in this work we introduce abstract communicators, a mechanism to replace MPI communicators during execution, including failed communicators. This enables the application to continue submission without changing its communicators, even when a failure occurs. The runtime system re-orders processes without the application noticing. We then present implementation challenges raised by this mechanism, notably on the process of handling requests on old communicators, and switching to replacements while still accepting new requests submitted by the application on its original communicators. We validate our approach with a working implementation in StarPU with neglectible overhead.