Funded PhD position: Energy and performance monitoring and models towards sustainable Exascale computing

Context

High Performance Computing usage is growing from climate science studies to chemical research. The increased impact of these computation opens the field of research on how to manage and reduce their energy consumption. In the NumPEx project we aim at developing state-of-the-art skills and infrastructures in the field of exascale computing. One of the pillars of NumPEx focuses on making exascale computing sustainable.

To make informed cluster-level scheduling decisions and to provide feedback to users, information on the whole infrastructure is needed. At any time, several applications use cluster resources. Each of these applications use the resources differently, leading to different patterns of power consumption. A high level of abstraction is needed to tackle the complexity of the large number of simultaneous applications. Several academic proofs of concept exist to simplify and use high-level representation (including resource and power consumption) of such applications instead of timeseries of measures.

Objective

The objectives of the PhD are the following:

  • Monitoring of large-scale applications that have a stable behavior using limited data: Detect the behavior of the whole application by only using the monitoring data of a small number of servers ; Change the frequency of monitoring depending on the needs.
  • Modeling and caracterization of applications: Detect when an application switches from one phase to another ; Determine properties of the phases (whether they are io-bound, memory-bound, cpu-bound…). Software will be developed to detect and characterize phases of HPC applications during their execution.
  • Model the impact of various leverages (DVFS, network and IO reconfiguration,…) on performance and energy.

The PhD structure will be as follows:

  • State of the art on phase-based application models (such as https://theses.hal.science/tel-00946583)
  • Experiments to acquire data on actual HPC applications on multiple hardware configurations
  • Analysis of the data to build energy and performance models taking into account the hardware configuration
  • Analysis of the impact of reducing the amount of acquired data
  • A demonstrator using the phase detection system along with the model of leverages to drastically reduce the power consumption of HPC datacenters

Monitoring software will be used (such as MojitO/S) during the PhD, and some contributions might be done to them. A large scale experiment platform will be used (Grid'5000).

Expected skills and profile

  • Required: Master’s in computer science.
  • Strongly recommended: A taste for experimental approaches, C or Rust programming, Python or R data analysis.
  • Appreciated: Background in performance optimization, performance evaluation and modeling, usage of remote computing servers.

Practical details

The PhD will take place at IRIT, the largest computer science research institute in Toulouse, France. Our team SEPIA works on resource management on various distributed systems (cloud datacenters, HPC centers, edge architectures, IoT…) and is especially interested in ecological transition, notably by reducing energy consumption and CO2 emissions, by using renewable energy.

The PhD will be supervised by Millian Poquet and Georges Da Costa in a convivial atmosphere :).

The PhD will be funded by the NumPEx collaborative project and will include several national meetings and visits to other French research institutes. The monthly gross salary will be of 2100 €.

You can send us your application (cover letter + resume / short curriculum vitæ + transcript of records for the full master) by email to millian.poquet@irit.fr and georges.da-costa@irit.fr.