# Timing predictability of GPUs: challenges and advances

# The Nvidia Pascal case

Thomas Carle, Univ. Toulouse 3, IRIT TRACES group



#### Introduction

- Embedded Real-Time systems
  - Need for strong **timing guarantees** (WCET)
  - Measurements/probabilities vs Static Analysis
    - Microarchitectural details (HW model) + execution model (Program semantics)
    - Mature techniques for CPU targets
- Models for static WCET analysis
  - Hardware model
    - Pipeline depth, number and latency of functional units, instruction queues, etc.
    - Caches size and configuration
    - Memory hierarchy
    - Predictors, prefetchers
    - Traditionally obtained from the documentation
  - Software model
    - Control Flow Graph of the disassembled binary program

# Introduction

- Graphical Processing Units: highly parallel accelerators
  - Major differences w.r.t. CPUs
    - Built for average throughput maximization (vs latency minimization)
      - Exploit parallelism to hide latencies
        - Instruction Level Parallelism
        - Data parallelism
      - Many functional units (aka CUDA Cores CCs)
      - Feed the beast: complex **memory hierarchy**
    - Handle thousands of threads:
      - Hierarchical scheduler (kernels, blocks, warps)
      - **SIMT** execution model
  - Very efficient for e.g. matrix/tensor computations
    - Embedded AI applications (e.g. autonomous vehicles)

# Introduction

- Graphical Processing Units: highly parallel accelerators
  - Major differences w.r.t. CPUs
    - Built for average throughput maximization (vs latency minimization)
      - Exploit parallelism to hide latencies
        - Instruction Level Parallelism
        - Data parallelism
      - Many functional units (aka CUDA Cores CCs)
      - Feed the beast: complex **memory hierarchy**
    - Handle thousands of threads:
      - Hierarchical scheduler (kernels, blocks, warps)
      - **SIMT** execution model
  - Very efficient for e.g. matrix/tensor computations
    - Embedded AI applications (e.g. autonomous vehicles)

#### Can we build an accurate model to predict the timing behavior of a program running on a GPU ?

• Coarse grain vision



• Coarse grain vision

Main program written and compiled for the CPU target

- Sequential or parallel
- Makes calls to GPU functions a.k.a. kernels



• Coarse grain vision

Prog

- Main program written and compiled for the CPU target
  - Sequential or parallel
  - Makes calls to GPU functions a.k.a. kernels



CPU

Kernels written and compiled for the GPU

- Inherently parallel
- Written in C/C++ with extensions (e.g. CUDA, OpenCL, Vulkan)
- Multiple threads execute the **same code** on **different data** 
  - Each thread has a unique identifier



• Entering the GPU



• Entering the GPU



10

• Entering the GPU



• Entering the GPU



• Inside a SM



• Inside a SM



• Inside a SM



• Inside a SMP



• Inside a SMP



• Inside a SMP



# Nvidia Jetson TX2

- Embedded System-on-Chip
  - 6 ARM cores
  - 1 Pascal GPU
    - 2 SMs, 8 SMPs, 256 Cuda Cores
    - "Embedded GPU"
- Launched in 2016
  - Branded as go-to chip for embedded applications, including autonomous driving
  - Research results now available mostly based on reverse engineering
  - Newer models introduce new features, but built on the same base mechanisms

#### • Software-defined scoreboard: exploit ILP

- Upon long latency instructions, continue execution until the produced data is actually needed
- SCHI instructions + DEPBAR instructions
  - Compiler optimizations

#### • Warp scheduler: exploit data parallelism

- $\circ$   $\,$  Swap the active warp when stalled
- $\circ$  1 per SMP
- $\circ$  Looks for ready warps
  - Instruction buffer for each active warp
    - Contains the next instruction(s) for each warp
  - Software-defined scoreboard
- Scheduling policy is unknown
  - Allegedly Loose Round-Robin or Greedy-Then-Oldest
  - Potentially Two-level scheduler
  - Very hard to assess experimentally
  - Influences capacity to hide latency + cache contents / locality

| • | Sof  | tware-defined                  | accrobacrd: avalait II D           |                           |
|---|------|--------------------------------|------------------------------------|---------------------------|
|   | 0    | Upon long latend               | SCHI % each instr increments SB1   | I data is actually needed |
|   | 0    | SCHI instructions              | LD R1, [R2] % SB1 = 1              |                           |
|   |      | <ul> <li>Compiler o</li> </ul> | LD R3, [R4] % SB1 = 2              |                           |
|   | \\/a | rn scheduler: <b>d</b>         | LD R5, [R6] % SB1 = 3              |                           |
|   | vva  | Swap the active                | SCHI % not relevant                |                           |
|   | 0    | Swap the active                | ADD R7. R8. R9 % not dependent     |                           |
|   | 0    | 1 per SMP                      | DEPBAR SB1 2                       |                           |
|   | 0    | Looks for ready v              | EMUL R1 R1 R10 % dependent on R1   |                           |
|   |      | Instruction                    |                                    |                           |
|   |      | <ul> <li>Conta</li> </ul>      |                                    |                           |
|   |      | <ul> <li>Software-d</li> </ul> | DEPBAR SBI, I                      |                           |
|   | 0    | Scheduling polic               | FMUL R3, R3, R11 % dependent on R3 |                           |
|   |      | Allegedly L                    | DEPBAR SB1, 0                      |                           |
|   |      | Potentially                    | SCHI % not relevant                |                           |
|   |      | Very hard t                    | FMUL R5, R5, R12 % dependent on R5 |                           |
|   |      | Influences                     |                                    | itv                       |
|   |      |                                |                                    |                           |

#### • Software-defined scoreboard: exploit ILP

- Upon long latency instructions, continue execution until the produced data is actually needed
- SCHI instructions + DEPBAR instructions
  - Compiler optimizations

#### • Warp scheduler: exploit data parallelism

- $\circ$   $\,$  Swap the active warp when stalled
- $\circ$  1 per SMP
- $\circ$  Looks for ready warps
  - Instruction buffer for each active warp
    - Contains the next instruction(s) for each warp
  - Software-defined scoreboard
- Scheduling policy is unknown
  - Allegedly Loose Round-Robin or Greedy-Then-Oldest
  - Potentially Two-level scheduler
  - Very hard to assess experimentally
  - Influences capacity to hide latency + cache contents / locality

• Software-defined scoreboard: exploit ILP



Influences capacity to hide latency + cache contents / locality

- Lockstep execution inside warps
- Conditional branch => divergence within warp
  - Mask inactive threads when executing each branch in sequence, then reconverge

- Lockstep execution inside warps
- Conditional branch => divergence within warp
  - Mask inactive threads when executing each branch in sequence, then reconverge



- Lockstep execution inside warps
- Conditional branch => divergence within warp
  - Mask inactive threads when executing each branch in sequence, then reconverge





- Lockstep execution inside warps
- Conditional branch => divergence within warp
  - Mask inactive threads when executing each branch in sequence, then reconverge





- Lockstep execution inside warps
- Conditional branch => divergence within warp
  - Mask inactive threads when executing each branch in sequence, then reconverge





- Lockstep execution inside warps
- Conditional branch => divergence within warp
  - Mask inactive threads when executing each branch in sequence, then reconverge





- Lockstep execution inside warps
- Conditional branch => divergence within warp
  - Mask inactive threads when executing each branch in sequence, then reconverge





- Lockstep execution inside warps
- Conditional branch => divergence within warp
  - Mask inactive threads when executing each branch in sequence, then reconverge





- Lockstep execution inside warps
- Conditional branch => divergence within warp
  - Mask inactive threads when executing each branch in sequence, then reconverge





- Lockstep execution inside warps
- Conditional branch => divergence within warp
  - Mask inactive threads when executing each branch in sequence, then reconverge





- Lockstep execution inside warps
- Conditional branch => divergence within warp
  - Mask inactive threads when executing each branch in sequence, then reconverge





- Lockstep execution inside warps
- Conditional branch => divergence within warp
  - Mask inactive threads when executing each branch in sequence, then reconverge





- Lockstep execution inside warps
- Conditional branch => divergence within warp
  - Mask inactive threads when executing each branch in sequence, then reconverge
  - Two kinds of divergence/convergence
    - SSY/SYNC, PBK/BRK
  - SIMT stack
    - Token based (NIL, SSY, PBK)
    - Software managed => compiler



ence

#### n warp

branch in sequence, then reconverge

- Warp CFG requires additional edges
  - Using the SIMT semantics
  - Some may be avoided using static analyses
    - Warp1 and warp2 may not execute the same path, but all threads in each warp execute the same path

# Low-level scheduling

- Still some unknowns
  - e.g. warp scheduler policy
- Complex, but can be modelled
  - Provided NVIDIA provides the implementation details
  - Hardest part: memory hierarchy and interference between warps
    - Cache contents
    - Interconnect/memory bank contentions
    - Same problems as multi-core CPUs, with larger scale

# Conclusion

- Can we build an accurate model to predict the timing behavior of a program running on a GPU ?
  - Yes ! And we're working on it
    - Timing behavior of a single warp
      - Static analyses to build accurate CFG
      - Architectural model
    - Compositional approach for multiple warps
    - Same approach as for **multi-core CPUs**
    - Missing details => make hypotheses
  - What if some mechanisms are too dynamic ?
    - Build our own predictable GPU

#### Thanks for your attention

CAPITAL Workshop: sCalable And Precise Timing AnaLysis for multicore platforms

IRIT, Toulouse, June 13th

In-person or remote attendance

Free registration : https://www.irit.fr/TRACES/site/capital-workshop-2023/























![](_page_51_Figure_0.jpeg)

![](_page_52_Figure_0.jpeg)

![](_page_53_Figure_0.jpeg)

![](_page_54_Figure_0.jpeg)

![](_page_55_Figure_0.jpeg)