In this project, we propose to revisit the principles of existing RJMS toward the evolution of large-scale parallel systems, by using a more malleable computational paradigm for a better use of the resources along with energy optimization}
The success of Energumen relies on tackling the following challenges, which are addressed in the different workpackages.
Collect the most relevant data for energy.
We propose to instrument the processor units and switches with dedicated hardwares (FPGA and other dedicated cards) to obtain data, particularly those related to energy and performance. This will lead to a huge deluge of data.
First, there is a clear technical challenge to define the appropriate sampling rate and extract only relevant data.
Second, a challenge of another nature (at the software level) is to design sophisticated algorithms based on Machine Learning. These algorithms will enable self-tuned energy models being able to obtain better allocation decisions.
Dynamic redimensioning of parallel jobs.
The approach of Malleable Tasks have largely been addressed in the community of scheduling (mostly in idealized models targeting pure performance). We aim at extending this model for the energy optimization, based on two mechanisms, namely, speed-scaling (malleability in time) and power-down (malleability in resources).
Reduce data movements to save energy.
Energy savings can be also obtained as a consequence of smart allocations or data movements reductions. They might be divided into communications inside nodes (corresponding mainly to memory management and heterogeneity) and between nodes (network, interconnect design, I/O).
However, to the best of our knowledge, no work has been devoted so far to design methods of saving energy combining enhanced allocations and reduced data movements.
Transferability from algorithm design to automatic code production into actual OAR job and resource manager.
The first step is to integrate together both previous approaches, then, develop the implementation. During evaluation, simulation is the most used tool, but it is necessary to streamline the transfer toward production. Transferability is necessary to evaluate the realism of the proposed algorithms.
This project has the number French ANR project ANR-18-CE25-0008