IPDPS rehearsal (Storm)

Tue 20 May 2025 in Talks

On Tuesday, May 20th, from 10:30 to 11:00 AM, we will have the pleasure of listening to three of our PhD students present their recent work as a rehearsal for IPDPS.

Speaker

Lana Scravaglieri, Diane Orhan, Albert d'Aviau de Piolant

When

Tuesday, May 20th, from 10:30 to 11:00 AM

Where

LaBRI's amphitheatre

Title Lana

Compiler, Runtime, and Hardware Parameters Design Space Exploration

Abstract Lana

HPC systems are increasingly complex with many tunable parameters impacting applications’ metrics—e.g., performance, energy consumption. The main challenges of these systems are finding the appropriate configuration per application on any given system and understanding how the configurations affect applications' metrics on a system. Both can be addressed with design space exploration (DSE). However, exploring all the configurations available is costly due to the long execution time and the long setup time of these executions. Indeed, it requires instrumenting the applications to collect data, compiling them with different options and setting the parameters for each execution. DSE algorithms can greatly reduce the exploration time by guiding which configuration to execute next to reach the objective without evaluating all the configurations. A DSE study thus requires implementing an exploration algorithm and automating parameters setting, application instrumentation and compilation, and metrics collection. This represents a huge overhead to the actual study, yet most DSE studies still do it from scratch.

To alleviate the setup cost, we propose a unified methodology to perform the exploration and implement it in the CORHPEX framework to setup configurations with compiler, runtime, and hardware parameters, efficiently and flexibly. The framework enables choosing the exploration strategy, the design space to study, the applications to execute and the metrics to collect independently while involving little coding overhead. It is extensible with custom exploration algorithms and data readers.

We demonstrate the versatility and robustness of our framework on parallel codes, including NAS, Rodinia, LULESH benchmarks, and real-world applications, on two systems exposing different parameters with various DSE techniques and goals. We show that working with CORHPEX reduces code engineering overhead allowing users to focus on the actual exploration while using exploration algorithms can speedup the execution by a factor of 10X while preserving 95% the possible gains. Finally, we demonstrate the framework's potential for more advanced studies by training surrogate models of complex HPC applications achieving over 93% accuracy.

Title Diane

Scheduling Strategies for Partially-Replicable Task Chains on Two types of Resources

Abstract Diane

The arrival of heterogeneous (or hybrid) multicore architectures on parallel platforms has brought new performance opportunities for applications and efficiency opportunities to systems. They have also increased the challenges related to thread scheduling, as tasks’ execution times will vary depending if they are placed in big (performance) cores or little (efficient) ones. In this paper, we focus on the challenges heterogeneous multicore problems bring to partially-replicable task chains, such as the ones that implement digital communication standards in Software-Defined Radio (SDR). Our objective is to maximize the throughput of these task chains while also minimizing their power consumption. We model this problem as a pipelined workflow scheduling problem using pipelined and replicated parallelism on two types of resources whose objectives are to minimize the period and to use as many little cores as necessary. We propose two greedy heuristics (FERTAC and 2CATAC) and one optimal dynamic programming (HeRAD) solution to the problem. We evaluate our solutions and compare the quality of their schedules (in period and resource utilization) and their execution times using synthetic task chains and an implementation of the DVB-S2 communication standard running on StreamPU. Our results demonstrate the benefits and drawbacks of the different proposed solutions. On average, FERTAC and 2CATAC achieve near-optimal solutions, with periods that are less than 10% worse than the optimal (HeRAD) using fewer than 2 extra cores. These three scheduling strategies now enable programmers and users of StreamPU to transparently make use of heterogeneous multicore processors and achieve throughputs that differ from their theoretical maximums by less than 8% on average.

Title Albert

Improving energy efficiency of HPC applications using unbalanced GPU power capping

Abstract Albert

Energy efficiency represents a significant challenge in the domain of high-performance computing (HPC). One potential key parameter to improve energy efficiency is the use of power capping, a technique for controlling the power limits of a device, such as a CPU or GPU. In this paper, we propose to examine the impact of GPU power capping in the context of HPC applications using heterogeneous computing systems. To this end, we first conduct an extensive study of the impact of GPU power capping on a compute intensive kernel, namely matrix multiplication kernel (GEMM), on different Nvidia GPU architectures. Interestingly, such compute-intensive kernels are up to 30 % more energy efficient when the GPU is set to 55-70 % of its Thermal Design Power (TDP). Using the best power capping configuration provided by this study, we investigate how setting different power caps for GPU devices of a heterogeneous computing node can improve the energy efficiency of the running application. We consider dense linear algebra task-based operations, namely matrix multiplication and Cholesky factorization. We show how the underlying runtime system scheduler can then automatically adapt its decisions to take advantage of the heterogeneous performance capability of each GPU. The results show that for a given platform equipped with four GPU devices, applying a power cap on all GPUs improves the energy efficiency for matrix multiplication up to 24.3 % (resp. 33.78 %) for double (resp. simple) precision.