Replaying with feedback: towards more realistic HPC simulations

Topic

Researchers use simulations to compare the performance (execution time, energy efficiency, …) of different scheduling algorithms in High-Performance Computing (HPC) platforms. The most common method is to replay historic workloads recorded in real HPC infrastructures (like the ones available in the Parallel Workloads Archive): jobs are submitted to the simulation at the same timestamp as in the original log.

A major drawback of this method is that it does not preserve the submission behavior of the users of the platform. In fact, in reality, when the scheduling algorithm is more performant and the jobs finish earlier, the users tend to submit their next jobs earlier as well. We propose to tackle this problem by doing a replay with feedback [1]. There are different ways to do so. For example, instead of preserving the original submission dates in the simulation, one can rather preserve the thinking time between the jobs (ie the time elapsed between the end of one a job and the submission of the next one). Alternatively, one can deduce from the log “working sessions” for each user and replay the jobs accordingly.

Objective of the internship

  • Review the literature on replay with feedback for HPC simulations
  • Propose different models of replay and implement them in the datacenter simulator Batsim thanks to the layer Batmen enabling the simulation of users
  • Conduct an experimental campaign to highlight the characteristics of each model

Expected ability of the student

  • Programming skills (C++ or Python)
  • Taste for experimental methods
  • Knowledge on distributed systems and scheduling is a plus

Practical details

The internship will take place at IRIT, the largest computer science research institute in Toulouse. Our team SEPIA works on resource management in distributed systems under environmental constraints (energy consumption, CO2 emissions, …).

The student will be supervised by Maël Madon (PhD student, mael.madon@irit.fr), Millian Poquet (MCF, millian.poquet@irit.fr) and Georges Da Costa (MCF HdR, georges.da-costa@irit.fr) in friendly atmosphere :). A computer and an office will be provided, as well as a monthly internship stipend of ~600€.

We also propose other internship topics in our team, check them here and do not hesitate to contact us.

Bibliography

[1] N. Zakay and D. G. Feitelson, “Preserving user behavior characteristics in trace-based simulation of parallel job scheduling,” in Proceedings of the 8th ACM International Systems and Storage Conference, Haifa Israel, May 2015, pp. 1–1. doi: 10.1145/2757667.2778191