Internship/project position: Sustainable monitoring of large-scale HPC applications: Reducing data amount to save energy

Context

High Performance Computing usage is growing from climate science studies to chemical research. The increased impact of these computation opens the field of research on how to manage and reduce their energy consumption. In the NumPEx project we aim at developing state-of-the-art skills and infrastructures in the field of exascale computing. One of the pillars of NumPEx focuses on making exascale computing sustainable.

To make informed cluster-level scheduling decisions and to provide feedback to users, information on the whole infrastructure is needed. At any time, several applications use cluster resources. Each of these applications use the resources differently, leading to different patterns of power consumption. A high level of abstraction is needed to tackle the complexity of the large number of simultaneous applications. Several academic proofs of concept exist to simplify and use high-level representation (including resource and power consumption) of such applications instead of timeseries of measures.

Most of the existing tools are producing raw time-series data on all nodes of clusters. If an application spans several thousand of cores, monitoring data become enormous while containing few information. Indeed in such large scale applications, most computer are running the same tasks, so monitoring them will provide the same data.

Objective

The objective of this internship is to propose and develop a monitoring framework able to reconfigure the monitoring system in real time depending on the data. The two capabilities will be to

  • Adapt the number of monitored computers taking into account the diversity of behavior of these computers.
  • Adapt the frequency of measurements on each computer depending on the dynamic of the application.

Expected skills and profile

  • Required: Currently in a master’s in computer science
  • Strongly recommended: A taste for experimental approaches, C or Rust programming, Python or R data analysis.
  • Appreciated: Background in performance optimization, performance evaluation and modeling, usage of remote computing servers.

Practical details

The internship will take place at IRIT, the largest computer science research institute in Toulouse, France. Our team SEPIA works on resource management on various distributed systems (cloud datacenters, HPC centers, edge architectures, IoT…) and is especially interested in ecological transition, notably by reducing energy consumption and CO2 emissions, by using renewable energy.

It will be supervised by Millian Poquet and Georges Da Costa in a convivial atmosphere :).

It will be funded by the NumPEx collaborative project. The monthly gross salary will be of 591 €.

An open PhD position funded by the NumPEx collaborative project and closely related to this internship is expected to begin by September/October 2024.

You can send us your application (cover letter + resume / short curriculum vitæ + transcript of records for the full bachelor and current master) by email to millian.poquet@irit.fr and georges.da-costa@irit.fr.