Internship/project position: Real-time phase detection for large-scale HPC applications

Context

High Performance Computing usage is growing from climate science studies to chemical research. The increased impact of these computation opens the field of research on how to manage and reduce their energy consumption. In the NumPEx project we aim at developing state-of-the-art skills and infrastructures in the field of exascale computing. One of the pillars of NumPEx focuses on making exascale computing sustainable.

To make informed cluster-level scheduling decisions and to provide feedback to users, information on the whole infrastructure is needed. At any time, several applications use cluster resources. Each of these applications use the resources differently, leading to different patterns of power consumption. A high level of abstraction is needed to tackle the complexity of the large number of simultaneous applications. Several academic proofs of concept exist to simplify and use high-level representation (including resource and power consumption) of such applications instead of timeseries of measures.

Most of the existing tools are producing raw time-series data. In our context, the MojitO/S (https://gitlab.irit.fr/sepia-pub/mojitos) monitoring tool monitor HPC application on a single computer. It monitors operating system values, hardware performance counters, power consumption of CPU and GPU.

Objective

The objectives of this internship is to aggregate data from the time series and to convert them into phases where the behavior of the application is constant. As an example, a classical HPC application would first be detected as IO-bound (reading the data), then as CPU-bound (computing the result), then finally as network-bound (to send the result to another application). The following publication https://inria.hal.science/hal-00925299/document already proposes an algorithm to detect the phases. This algorithm will be implemented and tested during the internship.

Expected skills and profile

  • Required: Currently in a master’s in computer science
  • Strongly recommended: A taste for experimental approaches, C or Rust programming.
  • Appreciated: Background in performance optimization, performance evaluation and modeling, usage of remote computing servers.

Practical details

The internship will take place at IRIT, the largest computer science research institute in Toulouse, France. Our team SEPIA works on resource management on various distributed systems (cloud datacenters, HPC centers, edge architectures, IoT…) and is especially interested in ecological transition, notably by reducing energy consumption and CO2 emissions, by using renewable energy.

It will be supervised by Millian Poquet and Georges Da Costa in a convivial atmosphere :).

It will be funded by the NumPEx collaborative project. The monthly gross salary will be of 591 €.

An open PhD position funded by the NumPEx collaborative project and closely related to this internship is expected to begin by September/October 2024.

You can send us your application (cover letter + resume / short curriculum vitæ + transcript of records for the full bachelor and current master) by email to millian.poquet@irit.fr and georges.da-costa@irit.fr.