Internship/project position: Real-time distributed system (hardware performance counters, RAPL, ...) monitoring for HPC

Context

High Performance Computing usage is growing from climate science studies to chemical research. The increased impact of these computation opens the field of research on how to manage and reduce their energy consumption. In the NumPEx project we aim at developing state-of-the-art skills and infrastructures in the field of exascale computing. One of the pillars of NumPEx focuses on making exascale computing sustainable.

To make informed cluster-level scheduling decisions and to provide feedback to users, information on the whole infrastructure is needed. At any time, several applications use cluster resources. Each of these applications use the resources differently, leading to different patterns of power consumption. A high level of abstraction is needed to tackle the complexity of the large number of simultaneous applications. Several academic proofs of concept exist to simplify and use high-level representation (including resource and power consumption) of such applications instead of timeseries of measures.

Most of these tools are single-node. In our context, the MojitO/S (https://gitlab.irit.fr/sepia-pub/mojitos) monitoring tool monitors HPC application on a single computer. It monitors operating system values, hardware performance counters, power consumption of CPUs and GPUs.

Objective

The objective of this internship is to develop a new distributed software to aggregate monitoring data from several nodes. The aggregation will be done in real time, as the data arrive on the monitoring server. One key feature that the developped software will include is the management of the temporal heterogeneity of measurements. In other words, we expect the software to generate a consistent aggregation even when the reference time or the monitoring frequency differ among nodes or over time.

Expected skills and profile

  • Required: Currently in a master’s in computer science
  • Strongly recommended: A taste for experimental approaches, C or Rust programming.
  • Appreciated: Background in performance optimization, performance evaluation and modeling, usage of remote computing servers.

Practical details

The internship will take place at IRIT, the largest computer science research institute in Toulouse, France. Our team SEPIA works on resource management on various distributed systems (cloud datacenters, HPC centers, edge architectures, IoT…) and is especially interested in ecological transition, notably by reducing energy consumption and CO2 emissions, by using renewable energy.

The internship will be supervised by Millian Poquet and Georges Da Costa in a convivial atmosphere :).

The internship will be funded by the NumPEx collaborative project. The monthly gross salary will be of 591 €.

An open PhD position funded by the NumPEx collaborative project and closely related to this internship is expected to begin by September/October 2024.

You can send us your application (cover letter + resume / short curriculum vitæ + transcript of records for the full bachelor and current master) by email to millian.poquet@irit.fr and georges.da-costa@irit.fr.