Experimental Design, Profiling, and Performance/Energy Optimization¶
Plot Example - Intro¶
In the following slides, you will be shown a series of plots; mainly taken from the PPN course reports of previous students.
For each plot:
- Try to understand what is represented
- Explain what you observe
- Give a definitive conclusion from the data shown
Raise your hands when ready to propose an explanation.
Plot Example (1)¶

PPN Example - (No Caption)
Plot Example (2)¶

PPN Example - (No Caption)
Plot Example (3)¶

PPN Example - (No Caption)
Plot Example (4)¶

PPN Example - "Récapitulatif des optimisations faites"
Plot Example (5)¶

PPN Example - "Nouveau tracé de la latence cache"
Plot Example (6)¶

Prof Example - (KNM): (a) Speedup map of GA-Adaptive (7k samples) over the Intel MKL hand-tuning for dgetrf (LU), higher is better. (b) Analysis of the slowdown region (performance regression). (c) Analysis of the high speedup region. \(3,000\) random solutions were evaluated for each distribution.
Plot Example (7)¶

Prof Example - (SPR): Geometric mean Speedup (higher is better) against the MKL reference configuration on dgetrf (LU), depending on the sampling algorithm. 46x46 validation grid. 7k/15k/30k denotes the samples count. GA-Adaptive outperforms all other sampling strategies for auto-tuning. With 30k samples it achieves a mean speedup of \(\times 1.3\) of the MKL dgetrf kernel.
Plot Example - Summary¶
HPC is a scientific endeavour; data analysis and plotting are first class citizens.
- Plots drive decisions
- Plots make results trustworthy
- Plots explain complex behaviors
Datasets are large, multi-disciplinary, and often hard to reproduce.
Plot Example - What makes a good plot¶
Ask yourself:
- Who am I speaking to ?
- What's my narrative?
- Is my plot understandable in ~10 seconds?
- Is my plot self-contained?
- Is the context, environment, and methodology clear?
Experimental Methodology¶
Experimental Methodology - Workflow¶

Statistical significance - Introduction¶
Computers are noisy, complex systems:
- Thread scheduling is non deterministic -> runtime varies between runs.
- Dynamic CPU frequency (Turbo/Boost)
- Systems are heterogeneous (CPU/GPU, dual socket, numa effects, E/P cores)
- Temperature/thermal throttling can alter runtime
How can we make sure our experimental measurements are reliable and conclusive?
Statistical significance - Warm-up effects¶
Systems need time to reach steady-state:

On a laptop: \(\mathrm{Mean} = 0.315\ \mathrm{ms},\ \mathrm{CV} = 13.55\%\)
We need "warm-up" iterations to measure stable performance and skip cold caches, page faults, frequency scaling.
Statistical significance - Noise mitigation¶
Noise can only be mitigated:
- Stop all other background processes (other users)
- Stabilize CPU Frequency (
sudo cpupower -g performance)- Make sure laptops are plugged to avoid powersaving policies
- Pin threads via
taskset,OMP_PLACESandOMP_PROC_BIND - Consider hyperthreading
- Use stable compute nodes
Meta-repetitions are essential to mitigate noisy measurements.
Statistical significance - Example¶
Same experiment on a stabilized benchmarking server:

On a laptop: \(\mathrm{Mean} = 0.315\ \mathrm{ms},\ \mathrm{CV} = 13.55\%\)
Stabilized node: \(\mathrm{Mean} = 0.582\ \mathrm{ms},\ \mathrm{CV} = 1.14\%\)
Note¶
Timing on a laptop is always subpar
Statistical significance - Mean, Median, Variance¶
Single-run measurements are misleading; we need statistics.
- Mean runtime \(\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i\)
- Median: less sensitive to outliers than the mean
- Variance/standard deviation: Measure of uncertainty
- Relative metrics are useful: Coefficient of variation (\(CV = \frac{\sigma}{\bar{x}} \times 100 \%\))
We usually give both the mean and standard deviation when giving performance results. Plots usually show \(\bar{x} \pm 1 \sigma\) as a shaded region around the mean to represent uncertainty.
Note¶
Distribution plots can be useful: stable measurements are often close to Gaussian, even if systematic noise may lead to skewed or heavy-tailed distributions.
Statistical significance - Confidence Intervals¶

Plotting variance/uncertainty through confidence intervals can change interpretation.
Statistical significance - Confidence Intervals¶
How to decide how many repetitions we should perform ?
- Usually, the costlier the kernels, the less meta-repetitions are expected
- Short or really short kernels should have more metas to reduce the influence of noise
Remember that:
More repetitions increase confidence, but returns diminish:
CI width \(\propto \tfrac{1}{\sqrt{n}}\)
Note¶
Confidence intervals are a bit less common in plots than \(\pm 1 \sigma\) but can also be used !
Statistical significance - p-score & Hypothesis testing¶
In HPC, mean/median and variance often suffice, but hypothesis testing can become handy in some contexts.
- Null hypothesis (\(H_0\)): GPU and CPU have the same performance for small matrixes
- Differences in measurements are only due to noise
-
Alternative hypothesis: CPU is faster for small matrixes
-
p-value is the probability that \(H_0\) explains a phenomenon.
- If \(p < 0.05\), we can safely reject \(H_0\) (Statistically significant difference)
Example: \(\bar{x}_{GPU} = 5.0 \mathrm{s}\), \(\sigma_{GPU} = 0.20\), \(\bar{x}_{CPU} = 4.8 \mathrm{s}\), \(\sigma_{CPU} = 0.4\), Two-sample t-test with 10 samples \(p = 0.02\).
The measured differences between CPU and GPU execution time are statistically significant.
Experimental Methodology – Reproducibility¶
Reproducibility is a very hot topic (Reproducibility crisis in science):
- Data and protocols are first-class citizens: as important as the plots themselves
- Transparency matters: make data, scripts, and parameters accessible
- Enables others to verify, build on, and trust your results
Note¶
Beware of your mindset: your results should be credible and honest before being "good".
"Our results are unstable, we have yet to understand why, this is what we tried" is a completely valid answer
Plotting Tools¶
Plotting tools - Cheetsheet¶
| Name | Use |
|---|---|
| pandas | Storing and saving tabular data |
| numpy | Numerical arrays, manipulating data |
| matplotlib | Basic 2D plots, full control |
| seaborn | Statistical plots, higher-level API |
| logging | Logging experiment progress/results |
| OpenCV | Image processing, animations/videos |
| ffmpeg | Generating and encoding videos |
Lookup the quick reference plotting gallery in the annex!
Both matplotlib and seaborn provide extensive online galleries.
[Live Example of the matplotlib gallery https://matplotlib.org/stable/gallery/index.html]
Plotting tools - Matplotlib¶
Matplotlib is one of the most widely used plotting libraries.
A figure is built hierarchically from nested elements:
- Figure (The canvas)
- (Subfigures)
- Axes (One or more subplots)
- Axis (x/y/z scales, ticks, labels)
- Artists (Lines, markers, text, patches, etc.)
- Data is plotted using axes-level functions like
ax.plot,ax.histogram - Customization occurs at both the Figure and Axes levels
- Complex multi plots layout occur at the Figure level
Plotting tools - Matplotlib¶

Plotting tools - Matplotlib¶
import matplotlib.pyplot as plt
x = [0, 1, 2, 3]
y = [2.8, 5.7, 12.5, 14]
# Create a new figure, single axis
# Size is 8 inches by 8 inches, and constrained layout
fig, ax = plt.subplots(figsize=(8, 8), layout="constrained")
# Plot a simple line
ax.plot(x, y, color="red", label="My Algorithm")
# Customize the axes
ax.set_xlabel("Iteration") # Name of the X axis
ax.set_ylabel("Time (s)") # Name of the y axis
# Title of the plot
ax.set_title("Evolution of Time with the number of iteration")
ax.margins(0, 0) # Remove white spaces around the figure
ax.legend(loc="upper right") # Draw the legend in the upper right corner
fig.savefig("my_plot.png", dpi=300) # Higher DPI -> bigger image
plt.close() # End the plot and release resources
Plotting tools - Matplotlib (Multi axis)¶
We can easily have multiple plots on the same figure:
nrows = 5, ncols = 1
fig, axs = plt.subplots(5, 1, figsize(8 * ncols, 3 * nrows))
ax = axs[0]
ax.plot()
...
ax = axs[1]
ax.plot()
...
fig.tight_layout() # Alternative to constrained layout
fig.savefig("my_multiplot.png", dpi=300)
Each axis is its own plot, with its own legend and artists.
Note¶
Use the reference (https://matplotlib.org/stable/api/index.html) and gallery (https://matplotlib.org/stable/gallery/index.html) extensively !
Plotting tools - Seaborn¶
Seaborn is an extension of Matplotlib dedicated to statistical visualization:

It's useful for histograms, bar charts, kdeplots, scatterplots, and is overall a very good companion library.
Plotting tools - Seaborn¶
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
df = pd.read_csv(...) # Read the dataframe from somewhere
fig, ax = plt.subplots(figsize=(8, 8), layout="constrained")
# We must pass the axis to plot on as an argument
sns.kdeplot(data=df, x="Time", label="Algorithm", color="red", fill=True, ax=ax)
ax.set_title("Distribution of Execution time for the algorithm")
ax.margins(0, 0)
ax.set_xlabel("Time (s)", fontweight="bold")
ax.set_ylabel("Density", fontweight="bold")
ax.set_xticks(np.linspace(df["Time"].min(), df["Time"].max(), 10)
# Format the x axis ticks: `3.25s`
ax.xaxis.set_major_formatter(StrMethodFormatter("{x:.2f}s"))
fig.savefig("my_distribution.png")
https://matplotlib.org/stable/gallery/ticks/tick-formatters.html
Profiling¶
Profiling - Motivation¶
- HPC codes are massive, complex and heterogeneous
- Humans are bad at predicting bottlenecks
- Don’t blindly optimize everything
- Profiling guides optimization
Remember: Always profile first.
Profiling - Amdahl's law¶
Where f is the fraction of program improved, and S is the speedup on that fraction.
Example:
- I have optimized 80% of my application, with a speedup of x10
- In total, my application is now \(\frac{1}{0.2 + (0.8 / 10)} = 3.57 \times\) faster
The 20% are a bottleneck !
Profiling - Steps¶
- Where (Hotspots) ?
- What functions are we spending time/energy in ?
- What call-tree are we spending time/energy in ?
- Why ?
- Arithmetic density, memory access patterns
- Cache misses, branch misspredictions, vectorization efficieny (Hardware counters)
- What goal ?
- Should I optimize for speed ? For energy ? Memory footprint ?
- What about cold storage size/compression ?
- Do I have constraints (i.e. limited memory) ?
- Should I optimize or switch algorithm ?
Profiling - Time¶
It's rather easy to benchmark a single function using a (high-resolution, monotonic) clock:
Quick-and-dirty way to profile part of my program.
Profiling - Time (Stability)¶
But we have to account for noise:
for _ in range(NWarmup):
my_function()
times = []
for i in range(NMeta):
begin = time.perf_counter()
my_function()
times.append((i, time.perf_counter() - begin))
df = pd.DataFrame(times, columns=["Iteration", "Time"])
median = np.median(df["Time"])
std = np.std(df["Time"])
print(f"Time: {median} +/- {std}")
...
# Plot through seaborn !
sns.plot(data=df, x="Iteration", y="Time", ax=ax)
...
We must check that our measures are acceptable!
Profilers - Introduction¶
Full application -> Thousands of functions to measure !
- Profilers are tools to automate this
- Two main types:
- Sampling: Pause the program and log where the program is (Costly functions -> More samples !)
- Instrumentation: Modify the program to automatically add timers
Profilers can also check for thread usage, vectorization, memory access, etc.
Perf - Record¶
Linux Perf is a powerful and versatile profiler:
gcc ... -g -fno-omit-frame-pointer
perf record -g -- ./mytransform ./pipelines/big.pipeline
Loaded image: images/image1.bmp (3660x4875, 3 channels)
[ perf record: Woken up 3 times to write data ]
[ perf record: Captured and wrote 0.484 MB perf.data (2941 samples) ]
perf report

It's a great tool to quickly get call stacks with few dependencies.
Profiling - Hardware counters¶
In reality, perf is not just a profiler !
- The Linux Perf API can be used to access many hardware counters
- Perf record is just one usage of perf
Most CPUs/GPUs have hardware counters that monitors different events:
- Number of cycles
- Number of retired instructions
- Number of memory access
- RAPL energy counters
Profiling - Perf for Hardware counters¶
perf stat -e cycles,instructions python3 ./scripts/run_bls.py ...
...
Performance counter stats:
749,352,412,722 cpu_core/cycles/
3,142,707,494,308 cpu_core/instructions/ # 4.19 insn per cycle
32.363472139 seconds time elapsed
225.351168000 seconds user
0.111367000 seconds sys
- 4.19 instruction per cycle -> Very good vectorization
- time elapsed -> "Wall clock time"
- seconds user -> CPU time in user space -> \(225 / 30 \approx 7\) threads !
Profiling - Perf for Hardware counters¶
perf stat -e cache-references,cache-misses python3 ./scripts/run_bls.py ...
...
Performance counter stats:
394,258,269 cpu_core/cache-references/
36,823,151 cpu_core/cache-misses/ # 9.34% of all cache refs
32.363472139 seconds time elapsed
225.351168000 seconds user
0.111367000 seconds sys
- 394,258,269 references to the LLC (On Intel)
- 36,283,151 LLC misses -> 9.3% miss rate
Profiling - Perf for Hardware counters¶
perf stat -e branches,branch-misses python3 ./scripts/run_bls.py ...
...
Performance counter stats:
761,974,570,065 cpu_core/branches/
248,674,718 cpu_core/branch-misses/ # 0.03% of all branches
32.363472139 seconds time elapsed
225.351168000 seconds user
0.111367000 seconds sys
- 761,974,570,065 execution flow breaks (ifs, returns, loops, etc.)
- 248,674,718 branch misses -> Good branch prediction! (\(0.9%\) miss rate)
Perf - Record with other events¶
We can also use perf record with other events !
perf record -e "cache-references,cache-misses,branches,branch-misses" -g -- ./mytransform ./pipelines/big.pipeline
Loaded image: images/image1.bmp (3660x4875, 3 channels)
[ perf record: Woken up 7 times to write data ]
[ perf record: Captured and wrote 1.709 MB perf.data (10260 samples) ]
perf report

Intel VTune¶
Perf is a bit "barebone": many profilers build upon perf like Intel VTune

VTune - CPU Usage¶

VTune - Collections Mode¶
VTune has multiple collection mode:

VTune - HPC Performance¶
VTune has multiple collection mode:

Other profilers¶
- MAQAO is a profiler developped by the LIPARAD
- AMD, NVIDIA and ARM have their own profilers for their platforms
- And many, many others (likwid, gprof, etc.)
Usually, we combine a "quick" profiler like gprof/perf record with a more indepth one when needed.
Profiling - Energy¶
Energy is a growing concern:
- One HPC cluster consume millions of dollars in electricity yearly
- ChatGPT and other LLM are computationally intensive:
- Nvidia GPUs consumes lots of energy
On the flip side, measuring energy is harder than measuring time.
Many actors still focus on execution time only -> Energy is perceived as "Second rank"
Profiling - RAPL¶
Running Average Power Limit (RAPL) is an x86 hardware counter that monitors energy consumption:
- Energy is tracked at different level
- Core, Ram, Package, GPU, etc.
- It does not account for secondary power consummers (Fans, Water cooling, etc.)
- RAPL is not event based: The entire machine is measured ! (Background processes, etc.)
It requires sudo permissions to access (compared to a clock)
perf stat -a -j -e power/energy-pkg,power/energy-cores <app>
{"counter-value" : "88.445740", "unit" : "Joules", "event" : "power/energy-pkg/", "event-runtime" : 10002168423, "pcnt-running" : 100.00}
{"counter-value" : "10.848633", "unit" : "Joules", "event" : "power/energy-cores/", "event-runtime" : 10002166697, "pcnt-running" : 100.00}
Profiling - Watt-Meter (Yokogawa)¶

Hardware solutions are also available to monitor energy consumption. They typically have a slow sampling resolution (\(\approx 1s\)) and are harder to scale to entire clusters.
On the flip side, they give precise power measurements compared to RAPL.
Profiling - RAPL accuracy¶

In practice, RAPL underestimates power consumption, but trends are correctly matched.