Parallelism and interference 101

Tcarle/ juillet 26, 2022/ Non classé

The MeSCAliNe project deals with so-called interference in parallel architectures. In this blog post we’ll present in an unformal way what is meant by interference, as well as some problems interference causes with respect to timing predictability.

Modern applications consist of a set of functions that must be executed under certain constraints. For example, in a feed-forward neural network, the function corresponding to the computations of the first layer must be executed before the function corresponding to the second layer, for the same input. This is an example of a functional constraint: failing to uphold these constraints means computing a wrong output. Another example, in a real-time system, is the necessity to compute a function before a certain deadline: even if the computed output is functionnally correct, the failure to provide this output before a certain date may lead to catastrophic system failure (e.g. a car crash if the breaking system does not respond in time).

Modern processor architectures provide the possibility to compute these functions in parallel with one another, or even to break the functions themselves into smaller parallel logical computation units. In a single-core CPU, this can be achieved using separate threads of computation for the different functions/computation units, and to have a scheduler choose which thread executes when. Although the threads execute in sequence (not really in parallel), this mechanism masks the long latency memory accesses of the threads, thus increasing the average performance of the system. Moreover, it allows to set priorities on the different threads, which is capital in real-time systems and more generally in operating systems. This concurrency between threads for the access to a sequential computing resource (the processor) has been identified a long time ago in the real-time community, and has been characterized as interference in the Worst-Case Response Time (WCRT) analysis of task (thread) systems. A classical fixpoint method is used to compute the influence of this kind of interference [1].

In a perfect world, this blog post would finish here. However, if the processor is equipped with caches (which is now pretty standard), a thread that preempts another thread to execute may evict some memory blocks from the caches that the preempted thread still needs in order to execute. As a consequence, when its execution resumes it must load these blocks again in the caches from the main memory, thus lengthening the execution of the thread. This performance loss is a problem in itself, but in a hard real-time context, it becomes even worse: the worst-case performance loss (known as Cache-Related Preemption Delay – CRPD) must be computed offline and added to the worst case execution time (WCET) of the preempted thread when performing the schedulability analysis or WCRT analysis of the application. CRPD computation [2] leads to overestimations by construction, and if no assumption/restriction is made as of which thread can preempt which, the resulting system-wide overestimation may grow dramatically.

In multicore CPUs, threads may run in parallel on separate cores. However, separate cores share some hardware components: L2/L3 caches, memory buses, memory controllers, etc. Shared caches suffer from the same shortcomings that we described in the previous paragraph, except the threads no longer have to preempt each other to generate cache-related delays. The memory buses and controllers are sequential components: when multiple cores try to access them in parallel, they serialize the requests, thus creating a timing penalty for one of the threads. This is another source of interference [3]. Now, these contentions no longer appear at the granularity level of preemptions (i.e. at the task system level), which generated overestimation but were still manageable in terms of analysis. They appear at the granularity level of memory accesses (i.e. at the instruction level): tracking and analyzing them with precision is not possible [4]. On the other hand, the gap between the abstraction level of the classical real-time task representation (1 WCET + 1 number of memory accesses) and the actual granularity of memory accesses is so wide that it will lead to huge overestimations in practice.

This project is based on an alternate, more precise, representation of tasks called multi-phase, in which memory accesses are grouped within temporal windows in order to reduce their impact on the WCRT analysis and the overestimation on the amount of interference. This will be the topic of another blog post.

References:

[1] A. Burns and A. Wellings. Real-Time Systems and Programming Languages: Ada, Real-Time Java and Real-Time POSIX. Addison Wesley.

[2] Altmeyer, Sebastian and Claire Maiza. Cache-related preemption delay via useful cache blocks: Survey and redefinition. J. Syst. Archit. 57 (2011): 707-719.

[3] Claire Maiza, Hamza Rihani, Juan M. Rivas, Joël Goossens, Sebastian Altmeyer, and Robert I. Davis. 2019. A Survey of Timing Verification Techniques for Multi-Core Real-Time Systems. ACM Comput. Surv. 52, 3, Article 56 (May 2020), 38 pages. https://doi.org/10.1145/3323212

[4] Davis, R.I., Altmeyer, S., Indrusiak, L.S. et al. An extensible framework for multicore response time analysis. Real-Time Syst 54, 607–661 (2018). https://doi.org/10.1007/s11241-017-9285-4