Experimental Design, Profiling, and Performance/Energy Optimization

Plot Example - Intro

In the following slides, you will be shown a series of plots; mainly taken from the PPN course reports of previous students.

For each plot:

  • Try to understand what is represented
  • Explain what you observe
  • Give a definitive conclusion from the data shown

Raise your hands when ready to propose an explanation.

Plot Example (1)

**PPN Example** - (No Caption)

PPN Example - (No Caption)

Plot Example (2)

**PPN Example** - (No Caption)

PPN Example - (No Caption)

Plot Example (3)

**PPN Example** - (No Caption)

PPN Example - (No Caption)

Plot Example (4)

**PPN Example** - "Récapitulatif des optimisations faites"

PPN Example - "Récapitulatif des optimisations faites"

Plot Example (5)

**PPN Example** - "Nouveau tracé de la latence cache"

PPN Example - "Nouveau tracé de la latence cache"

Plot Example (6)

**Prof Example** - (KNM): (a) Speedup map of GA-Adaptive (7k samples) over the Intel MKL hand-tuning for dgetrf (LU), higher is better. (b) Analysis of the slowdown region (performance regression). (c) Analysis of the high speedup region. \(3,000\) random solutions were evaluated for each distribution.

Prof Example - (KNM): (a) Speedup map of GA-Adaptive (7k samples) over the Intel MKL hand-tuning for dgetrf (LU), higher is better. (b) Analysis of the slowdown region (performance regression). (c) Analysis of the high speedup region. \(3,000\) random solutions were evaluated for each distribution.

Plot Example (7)

**Prof Example** - (SPR): Geometric mean Speedup (higher is better)  against the MKL reference configuration on dgetrf (LU), depending on the sampling algorithm. 46x46 validation grid. 7k/15k/30k denotes the samples count. GA-Adaptive outperforms all other sampling strategies for auto-tuning. With 30k samples it achieves a mean speedup of \(\times 1.3\) of the MKL dgetrf kernel.

Prof Example - (SPR): Geometric mean Speedup (higher is better) against the MKL reference configuration on dgetrf (LU), depending on the sampling algorithm. 46x46 validation grid. 7k/15k/30k denotes the samples count. GA-Adaptive outperforms all other sampling strategies for auto-tuning. With 30k samples it achieves a mean speedup of \(\times 1.3\) of the MKL dgetrf kernel.

Plot Example - What makes a good plot

Ask yourself:

  • What do I want to communicate ?
  • What data do I need ?
  • Is my plot understandable in ~10 seconds ?
  • Is my plot self-contained ?
  • Is the context, environment, and methodology clear ?

Plot Example - Summary

HPC is a scientific endeavour; data analysis and plotting are essential.

  • Plots drive decisions
  • Plots make results trustworthy
  • Plots explain complex behaviors

Datasets are large, multi-disciplinary, and often hard to reproduce.

Experimental Methodology

Experimental Methodology - Workflow

Typical experimental workflow

Statistical significance - Introduction

Computers are noisy, complex systems:

  • Thread scheduling is non deterministic -> runtime varies between runs.
  • Dynamic CPU frequency (Turbo/Boost)
  • Systems are heterogeneous (CPU/GPU, dual socket, numa effects, E/P cores)
  • Temperature/thermal throttling can alter runtime

How can we make sure our experimental measurements are reliable and conclusive?

Statistical significance - Warm-up effects

Systems need time to reach steady-state:

On a laptop: \(\mathrm{Mean} = 0.315\ \mathrm{ms},\ \mathrm{CV} = 13.55\%\)

We need "warm-up" iterations to measure stable performance and skip cold caches, page faults, frequency scaling.

Statistical significance - Noise mitigation

Noise can only be mitigated:

  • Stop all other background processes (other users)
  • Stabilize CPU Frequency (sudo cpupower -g performance)
    • Make sure laptops are plugged to avoid powersaving policies
  • Pin threads via taskset, OMP_PLACES and OMP_PROC_BIND
  • Consider hyperthreading
  • Use stable compute nodes

Meta-repetitions are essential to mitigate noisy measurements.

Statistical significance - Example

Same experiment on a stabilized benchmarking server:

On a laptop: \(\mathrm{Mean} = 0.315\ \mathrm{ms},\ \mathrm{CV} = 13.55\%\)
Stabilized node: \(\mathrm{Mean} = 0.582\ \mathrm{ms},\ \mathrm{CV} = 1.14\%\)

Note

Timing on a laptop is always subpar

Statistical significance - Mean, Median, Variance

Single-run measurements are misleading; we need statistics.

  • Mean runtime \(\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i\)
  • Median: less sensitive to outliers than the mean
  • Variance/standard deviation: Measure of uncertainty
  • Relative metrics are useful: Coefficient of variation (\(CV = \frac{\sigma}{\bar{x}} \times 100 \%\))

We usually give both the mean and standard deviation when giving performance results. Plots usually show \(\bar{x} \pm 1 \sigma\) as a shaded region around the mean to represent uncertainty.

Note

Distribution plots can be useful: stable measurements are often close to Gaussian, even if systematic noise may lead to skewed or heavy-tailed distributions.

Statistical significance - Confidence Intervals

How to decide how many repetitions we should perform ?

  • Usually, the costlier the kernels, the less meta-repetitions are expected
  • Short or really short kernels should have more metas to reduce the influence of noise

Remember that:

\[CI_{0.95} \approx \bar{x} \pm 1.96 \cdot \frac{\sigma}{\sqrt{n}}\]

More repetitions increase confidence, but returns diminish:
CI width \(\propto \tfrac{1}{\sqrt{n}}\)

Note

Confidence intervals are a bit less common in plots than \(\pm 1 \sigma\) but can also be used !

Statistical significance - p-score & Hypothesis testing

In HPC, mean/median and variance often suffice, but hypothesis testing can become handy in some contexts.

  • Null hypothesis (\(H_0\)): GPU and CPU have the same performance for small matrixes
    • Differences in measurements are only due to noise
  • Alternative hypothesis: CPU is faster for small matrixes

  • p-value is the probability that \(H_0\) explains a phenomenon.

  • If \(p < 0.05\), we can safely reject \(H_0\) (Statistically significant difference)

Example: \(\bar{x}_{GPU} = 5.0 \mathrm{s}\), \(\sigma_{GPU} = 0.20\), \(\bar{x}_{CPU} = 4.8 \mathrm{s}\), \(\sigma_{CPU} = 0.4\), Two-sample t-test with 10 samples \(p = 0.02\).

The measured differences between CPU and GPU execution time are statistically significant.

Experimental Methodology – Reproducibility

Reproducibility is a very hot topic (Reproducibility crisis in science):

  • Data and protocols are first-class citizens: as important as the plots themselves
  • Transparency matters: make data, scripts, and parameters accessible
  • Enables others to verify, build on, and trust your results

Note

Beware of your mindset: your results should be credible and honest before being "good".

"Our results are unstable, we have yet to understand why, this is what we tried" is a completely valid answer

Plotting Tools

Plotting tools - Cheetsheet

Name Use
pandas Storing and saving tabular data
numpy Numerical arrays, manipulating data
matplotlib Basic 2D plots, full control
seaborn Statistical plots, higher-level API
logging Logging experiment progress/results
OpenCV Image processing, animations/videos
ffmpeg Generating and encoding videos

Lookup the quick reference plotting gallery in the annex!
Both matplotlib and seaborn provide extensive online galleries.

[Live Example of the matplotlib gallery https://matplotlib.org/stable/gallery/index.html]

Plotting tools - Matplotlib

Matplotlib is one of the most widely used plotting libraries.
A figure is built hierarchically from nested elements:

- Figure (The canvas)
  - (Subfigures)
    - Axes (One or more subplots)
      - Axis (x/y/z scales, ticks, labels)
      - Artists (Lines, markers, text, patches, etc.)
  • Data is plotted using axis-level functions like ax.plot, ax.histogram
  • Customization occurs at both the Figure and Axes levels
  • Complex multi plots layout occur at the Figure level

Plotting tools - Matplotlib

Plotting tools - Matplotlib

import matplotlib.pyplot as plt

x = [0, 1, 2, 3]
y = [2.8, 5.7, 12.5, 14]

# Create a new figure, single axis
# Size is 8 inches by 8 inches, and constrained layout
fig, ax = plt.subplots(figsize=(8, 8), layout="constrained")

# Plot a simple line
ax.plot(x, y, color="red", label="My Algorithm")

# Customize the axes
ax.set_xlabel("Iteration") # Name of the X axis
ax.set_ylabel("Time (s)") # Name of the y axis
# Title of the plot
ax.set_title("Evolution of Time with the number of iteration")

ax.margins(0, 0) # Remove white spaces around the figure
ax.legend(loc="upper right") # Draw the legend in the upper right corner

fig.savefig("my_plot.png", dpi=300) # Higher DPI -> bigger image
plt.close() # End the plot and release resources

Plotting tools - Matplotlib (Multi axis)

We can easily have multiple plots on the same figure:

nrows = 5, ncols = 1
fig, axs = plt.subplots(5, 1, figsize(8 * ncols, 3 * nrows))

ax = axs[0]
ax.plot()
...

ax = axs[1]
ax.plot()
...

fig.tight_layout() # Alternative to constrained layout
fig.savefig("my_multiplot.png", dpi=300)

Each axis is its own plot, with its own legend and artists.

Note

Use the reference (https://matplotlib.org/stable/api/index.html) and gallery (https://matplotlib.org/stable/gallery/index.html) extensively !

Plotting tools - Seaborn

Seaborn is an extension of Matplotlib dedicated to statistical visualization:

<https://seaborn.pydata.org/examples/index.html>

It's useful for histograms, bar charts, kdeplots, scatterplots, and is overall a very good companion library.

Plotting tools - Seaborn

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

df = pd.read_csv(...) # Read the dataframe from somewhere

fig, ax = plt.subplots(figsize=(8, 8), layout="constrained")

# We must pass the axis to plot on as an argument
sns.kdeplot(data=df, x="Time", label="Algorithm", color="red", fill=True, ax=ax)

ax.set_title("Distribution of Execution time for the algorithm")
ax.margins(0, 0)
ax.set_xlabel("Time (s)", fontweight="bold")
ax.set_ylabel("Density", fontweight="bold")

ax.set_xticks(np.linspace(df["Time"].min(), df["Time"].max(), 10)
# Format the x axis ticks: `3.25s`
ax.xaxis.set_major_formatter(StrMethodFormatter("{x:.2f}s"))

fig.savefig("my_distribution.png")

https://matplotlib.org/stable/gallery/ticks/tick-formatters.html

Profiling

Profiling - Motivation

  • HPC codes are massive, complex and heterogeneous
  • Humans are bad at predicting bottlenecks
  • Don’t blindly optimize everything
  • Profiling guides optimization

Remember: Always profile first.

Profiling - Amdahl's law

\[ \mathrm{Speedup} = \frac{1}{1 - f + \frac{f}{S}} \]

Where f is the fraction of program improved, and S is the speedup on that fraction.

Example:

  • I have optimized 80% of my application, with a speedup of x10
  • In total, my application is now \(\frac{1}{0.2 + (0.8 / 10)} = 3.57 \times\) faster

The 20% are a bottleneck !

Profiling - Steps

  1. Where (Hotspots) ?
  2. What functions are we spending time/energy in ?
  3. What call-tree are we spending time/energy in ?
  4. Why ?
  5. Arithmetic density, memory access patterns
  6. Cache misses, branch misspredictions, vectorization efficieny (Hardware counters)
  7. What goal ?
  8. Should I optimize for speed ? For energy ? Memory footprint ?
    • What about cold storage size/compression ?
  9. Do I have constraints (i.e. limited memory) ?
  10. Should I optimize or switch algorithm ?

Profiling - Time

It's rather easy to benchmark a single function using a (high-resolution monotonic) clock:

begin = time.now()
my_function()
end = time.now()
elapsed = end - begin

Very simple way to evalute a function cost

Profiling - Time (Stability)

But we have to account for noise:

for _ in range(NWarmup):
  my_function()

times = []
for _ in range(NMeta):
  begin = time.perf_counter()
  my_function()
  times.append(time.perf_counter() - begin) 

median = np.median(times)
std = np.std(times)
print(f"Time: {median} +/- {std}")

We must check that our measures are valid !

Profilers - Introduction

Full application -> Thousands of functions to measure !

  • Profilers are tools to automate this
  • Two main types:
  • Sampling: Pause the program and log where the program is (Costly functions -> More samples !)
  • Instrumentation: Modify the program to automatically add timers

Profilers can also check for thread usage, vectorization, memory access, etc.

Perf - Record

Linux Perf is a powerful and versatile profiler:

perf record -g -- python3 ./scripts/run_bls.py kepler-8
[ perf record: Woken up 255 times to write data ]
[ perf record: Wrote 85.474 MB perf.data (1220354 samples) ]

pert report perf.out

It's a great tool to quickly get a Tree stack without any dependencies.

Profiling - Hardware counters

In reality, perf is not realy a profiler !

  • The Linux Perf API can be used to access many hardware counters
  • Perf record is just one usage of perf

Most CPUs/GPUs have hardware counters that monitors different events:

  • Number of cycles
  • Number of instructions
  • Number of memory access
  • RAPL

Profiling - Perf for Hardware counters

perf stat -e cycles,instructions python3 ./scripts/run_bls.py kepler-8
{'period': 3.520136229965055, 'duration': 0.11662673569985234, 'phase': 0.43, 'depth': 0.004983530583444806, 'power': 0.028861678651344452}

 Performance counter stats for 'python3 ./scripts/run_bls.py kepler-8':

   962,187,248,452      cpu_atom/cycles/                                                        (43.66%)
 1,119,319,677,606      cpu_core/cycles/                                                        (56.34%)
 3,547,146,665,075      cpu_atom/instructions/           #    3.69  insn per cycle              (43.66%)
 2,837,633,772,530      cpu_core/instructions/           #    2.54  insn per cycle              (56.34%)

      12.507192456 seconds time elapsed

VTune

The Intel VTune profiler is more complex but more self-contained than perf:

VTune - CPU Usage

VTune - HPC Performance

VTune has multiple collection mode:

Other profilers

  • MAQAO is a profiler developped by the LIPARAD
  • AMD, NVIDIA and ARM have their own profilers for their platforms
  • And many, many others (likwid, gprof, etc.)

Usually, we combine a "quick" profiler like gprof/perf record with a more indepth one when needed.

Profiling - Energy

Energy is a growing concern:

  • One HPC cluster consume millions of dollars in electricity yearly
  • ChatGPT and other LLM are computationally intensive:
  • Nvidia GPUs consumes lots of energy

On the flip side, measuring energy is harder than measuring time.

Many actors still focus on execution time only -> Energy is perceived as "Second rank"

Profiling - RAPL

Running Average Power Limit (RAPL) is an x86 hardware counter that monitors energy consumption:

  • Energy is tracked at different level
  • Core, Ram, Package, GPU, etc.
  • It does not account for secondary power consummers (Fans, Water cooling, etc.)
  • RAPL is not event based: The entire machine is measured ! (Background processes, etc.)

It requires sudo permissions to access (compared to a clock)

perf stat -a -j -e power/energy-pkg,power/energy-cores <app>
{"counter-value" : "88.445740", "unit" : "Joules", "event" : "power/energy-pkg/", "event-runtime" : 10002168423, "pcnt-running" : 100.00}
{"counter-value" : "10.848633", "unit" : "Joules", "event" : "power/energy-cores/", "event-runtime" : 10002166697, "pcnt-running" : 100.00}

Profiling - Watt-Meter

Yokogawa

Hardware solutions are also available to monitor energy consumption. They typically have a slow sampling resolution (\(\approx 1s\)) and are harder to scale to entire clusters.

On the flip side, they give precise power measurements compared to RAPL.

Profiling - RAPL accuracy

Calibration of RAPL

In practice, RAPL underestimates power consumption, but trends are correctly matched.

Live Demo

Experiment example

Annex/run_experiment.sh

Annex/model_convergence.py