C for High Performance¶

Download as slides 📥

Why C, C++, Python ?¶

Programming occurs at several abstraction levels from the hardware

Hardware to Software layers (https://www.startertutorials.com/blog/basic-software-concepts.html)

Why C, C++, Python ?¶

Layers close to metal are harder to program...
- But they offer maximum control and performance
High-level abstraction maximize productivity...
- But have significant overhead, less control over performance

In practice; we often combine multiple languages

C for performance critical sections, python for higher level APIs

Why C, C++, Python ?¶

Hardware to Software layers (Shershakov, Sergey. (2018). Enhancing Efficiency of Process Mining Algorithms with a Tailored Library: Design Principles and Performance Assessment Technical Report. 10.13140/RG.2.2.18320.46084. )

C Programming - Operations and Typing¶

C is a strongly typed imperative language:

int main() {
  int a = 5;
  int b = 10;

  int c = a + b;
  float d = c / a;
  float e = (float)c / a;

  int f = a * a * a * a;

  return 0;
}

main is the program entry point.

C Programming - Functions¶

#include <stdio.h> // For printf(...)

int sum_and_square(int a, int n) {
  int tmp = a + n;
  return tmp * tmp;
}

int main() {
  int a = 5;
  int b = 4;
  int c = sum_and_square(a, b);
  int d = sum_and_square(3, 9);

  // Print the result to the console
  printf("(5+4)**2: %d\n", c);
  printf("(3+9)**2: %d\n", d);
  return 0;
}

C Programming - Loops¶

Implementation C de $\sum_{i=1}^{100}{i}$

#include <stdio.h> // For printf(...)

int sum_range(const int start, const int end) {
  int sum = 0;
  // Consider start = 0; end = 100
  // For i starting at 0; while i <= 100; increment i by one
  for (int i = start; i <= end; i = i + 1) {
    sum += i;
  }
  return sum;
}

int main() {
  printf("Result: %d\n", sum_range(1, 100));
  return 0;
}

const qualified variable cannot be modified. This may enable optimizations during compilation.

C Programming - Conditions¶

Numbers of multiple of 3 inside $[0, 99]$ (i.e. $i \mod 3 = 0$)

void count_multiples_of_three() {
  unsigned int count = 0;
  // For i starting at 0; while i < 100; increment i by one
  for (unsigned int i = 0; i < 100; i++) {
    // if i % 3 (Remainder of the integer division) is equal to 0
    if (i % 3 == 0) {
      count++;
    }
  }
  printf("Result: %d\n", count);
}

Note¶

Here we could also do for (unsigned int i = 0; i < 100; i += 3)

C Programming - Basic Pointers¶

int a = 0;
int b = 5;

int* c = &a;
*c = *c + b;
printf("a: %d; b: %d; c: %d\n", a, b , *c);

c contains the address of a; so *c = *c + b write in a the sum of a and b.

Adress	Value	Variable
0x004	0	a
0x008	5	b
0x00c	0x004	c
...	...	...

C Programming - Arrays¶

int main() {
  char morpion[9] = {'X', 'O',  '\0',
                     'O', 'X',  '\0',
                     'O', '\0', '\0'};
  morpion[8] = 'X'; // The player clicked on the bottom-right cell !
}

Morpion layout in memory

C Programming - Structures¶

Structures are user-defined composite types:

typedef struct {
  char* first_name;
  char* last_name;
  int age;
  float mean_grade;
  char gender;
} Student;

Student e1 = {"Dupont", "Pierre", 22, 13, 'm'};
Student e2 = {"Major", "Major", 22, 13.5, 'a'};
Student e3 = {"Martin", "Evelynne", 24, 14, 'f'};

if (e1.mean_grade > 10) {
  printf("(%s %s) is a pretty good student !\n", 
         e1.first_name, e1.last_name);
}

C has no concept of class, object, or method !

C programming - Structures 2¶

void display_student(Student* s) {
  // s->age is equivalent to (*s).age
  printf("%s %s (%i): %f\n", s->first_name, 
                             s->last_name, s->age,
                             s->mean_grade);
}

// We can have arrays of any types !
Student students[3] = {{"Dupont", "Pierre", 22, 13, 'm'}, ...};
for (int i = 0; i < 3; i++)
  display_student(&students[i]);

Array of Structure (AoS) layout

C Programming - Trading Abstraction for performance¶

In C, we must manually take care of very low level concepts¶

We care about data layout, memory addresses, pointers, etc.
The language doesn't provide linked lists, dynamic arrays, dictionaries, etc.
No basic algorithms like sorting

On the flip side, we can¶

Manually lay out data to maximize efficiency
Remove any abstractions and overhead to maximize performance
Generate code that runs as close to the metal as possible
Optimize our program for the hardware

C Programming - Trading Abstraction for performance (Example)¶

Consider the following python and C code:

sum = 0
for i in range(ub):
  sum += i
print(sum)

unsigned long long sum = 0;
for (unsigned int i = 0; i < ub; i++){
    sum += i;
}
printf("Sum of first %llu integers is: %llu\n", ub, sum);

Where ub is a very large number (100 Millions in this example). Which one is faster, and by how much ?

C Programming - Trading Abstraction for performance (Example)¶

Results:

C version: 0.024s
Python version: 5.650s

That's a speedup of $\times 235$.

We will see later in this course how this is possible.

Numpy and other libraries¶

Note that we could use numpy or the sum python function: but those are actually implemented in C !

Managing Memory¶

Managing Memory - Concept¶

In High-level languages¶

We operate on abstracted data structures (lists, dictionaries, etc.)
Memory is managed automatically (allocation, resizing, deallocation)
We don't care about memory alignment, stack vs. heap, page size, Numa effects, etc.

In C¶

We perform directly with primitive data and raw memory
We are responsible for allocation, layout, and cleanup
We can only request chunks of raw memory, and fill it however we choose
This is critical for performance

This low level control is critical for performance; hence we must understand how memory works under the hood !

Managing Memory - Memory Types¶

We can distinguish two types of memory

Memory automatically allocated by the compiler on the stack.
- Stores variables, functions arguments, etc.
- Fast but limited in size
Memory that is (manually) dynamically allocated on the heap
- Must be allocated and freed by the developer !

The kernel (Linux / Windows) allocates memory pages and operates at a coarse grain level.
The standard library (libc) manipulates pages on a finer scale and provides memory to the user.

Managing Memory - Allocation¶

#include <time.h> // for time
#include <stdlib.h> // For malloc,  srand, rand

int do_the_thing(int n) {
  // We allocate n numbers
  float* numbers = malloc(sizeof(float) * n);

  // We seed the random number generator
  srand(time(NULL));

  // We generate nsamples random numbers
  for (int i = 0; i < n; i++) {
    numbers[i] = (float) rand() / RAND_MAX; // Generate a number in [0, 1]
  }
  ... // Do something complicated here
  free(numbers); // Release memory back to the kernel
  return 0;
}

Managing Memory - Allocation¶

Results of memory allocation

malloc returns a pointer to the beginning of the allocated memory range

Managing Memory - Deallocation¶

Memory is not infinite !

In Python (and Java, C#, etc.); memory is managed by the garbage collector (GC):

The runtime tracks all memory allocations; and all reference(s)
When a memory block is not referenced by the program; the GC will release the memory back to the kernel.

In C/C++, the user must deallocate memory using free(ptr).

Memory leak¶

If memory is not freed (memory leak) the computer can run out:

The kernel can kill the program
The OS can crash
Other applications requesting memory can crash or fail

Virtual And Physical Memory - Problem¶

How can the kernel guarantee that memory is always contiguous?
Can I acess memory from another program and steal their data?
How can multiple applications share the same memory?
Some variables have hard-coded addresses!
How to handle (Internal/External) fragmentation (Empty slot)?

Virtual And Physical Memory - Concept¶

We separate Physical Addresses (locations in memory) from Virtual Addresses (Logic locations) seen by each program !

Physical memory is divided into small fixed-size blocks called pages (typically ~4KB).
The CPU includes a Memory Management Unit (MMU) that translates virtual addresses into physical addresses.
Each program is given its own isolated virtual address space.
The kernel maintains a page table for each program that tells the MMU how to translate addresses.

The Illusion of contiguity¶

Each process believes it has acces to a large, contiguous block of memory; while it can be physically fragmented or shared.

Virtual And Physical Memory - Diagram¶

Virtual And Physical Memory

*Note that this is a simplified representation.

Memory Hierarchy¶

Which memory are we talking about ?

Memory Hierarchy (https://www.geeksforgeeks.org/memory-hierarchy-design-and-its-characteristics/)

Note that GPU(s) also have their own separate memory !

Memory Hierarchy¶

CPU computations are extremely fast, and memory access can be a bottleneck
Registers have the lowest latency
CPU caches (L1, L2, L3) act as fast buffers for memory
DRAM (main memory) is much slower, but cheaper and larger
Accessing DRAM causes significant delays compared to cache

To achieve high performance, we must maximize data reuse in registers or caches, and minimize DRAM access.

CPU Caches¶

Most CPU have 3 levels of cache

L1d - First Level Cache (Very fast)
L2 - Second Level Cache (Fast)
L3 (Last Level Cache - LLC) (Larger but slower than L1/L2)

Some cache level are per-core (L1, often L2) whereas others are shared between multiple cores (L3).

Instruction Cache¶

The assembly instructions are stored in a separate (L1i) instruction cache

CPU Caches¶

CPU Latency and Cache

We speak of Heterogeneous Memory Hierarchy: the same memory accesses can have different latency depending on where the data resides !

[Live example: LSTOPO]

CPU Caches - In practice¶

for (int i = 0; i < n; i++) {
  T[i] = A[i] * B[i];
}

The cache controller looks-up the data inside the CPU cache (L1 -> L2 -> L3)
If available, data is sent to register for the ALU
Else, a memory request is emitted
This introduces latency and a bubble in the CPU pipeline
When the memory request is resolved; execution resumes
The results of $a*b$ is written to cache, and eventually back to main memory later on.

CPU Caches - In practice¶

In practice:

The CPU fetches entire cache line (Often 64 Bytes) at once (If float: $64B / 4B = 16$ values at once)
The CPU can prefetch data: it learns data access patterns and anticipates future memory access.
The CPU can execute out-of-order; independent instructions are executed while the memory request is in flight.

Caches CPU - Strided Access¶

Consider two NBody 3D implementations:

Array Of Structure (AoS)

// We allocate N tuples of (x, y, z) positions
float* positions = malloc(sizeof(float) * N * 3);

Structure Of Array (SoA)

// We allocate separate arrays for each components
float* x = malloc(sizeof(float) * N);
float* y = malloc(sizeof(float) * N);
float* z = malloc(sizeof(float) * N);

Caches CPU - Strided Access¶

We want to record the number of particles with $x \leq 0.5$

Array Of Structure (AoS)

for (int i = 0; i < N; i += 3)
  if (positions[i] < 0.5)
    count++;

Structure Of Array (SoA)

for (int i = 0; i < N; i++)
  if (x[i] < 0.5)
    count++;

Which one is faster; and why ?

Which access pattern makes better use of cache lines ?

Caches CPU - Strided Access¶

Perf results summed across 100 runs:

	Time	# Instr	# L1 Loads	# L1 Miss	# LLC Loads	# LLC Miss
AoS	~1.93s	~14 Billion	~3.5 Billion	~1 Million	~400k	~382k
SoA	~1.75s	~14 Billion	~3.5 Billion	~300k	~24k	~15k

	# Cache references (LLC)	# Cache miss
AoS	~158 Million	~151 Million
SoA	~52 Million	~35 Million

With AoS more load fail in the L1, leading to LLC accesses.

Most LLC loads still results in misses, leading to DRAM access.

Compilation & Assembly¶

Compilation & Assembly - Introduction¶

C is a compiled language: we must translate the source code to assembly for the CPU

gcc ./main.c -o main (<flags>)

Python is interpreted
More flexible but significantly slower
C# and Java are compiled to intermediary bytecode and then executed via a virtual machine (or JIT-ed)
Balances performance and productivity
C/C++/Rust are compiled to assembly code
Poor portability, but no intermediary.

Compilation & Assembly - Simple Loop¶

int sum = 0;
for (int i = 0; i < 100000; i++){
  sum += i;
}

main:
.LFB6:
    pushq   %rbp                 // We record the stack pointer
    movq    %rsp, %rbp
    movl    $0, -4(%rbp)         // Initialize sum
    movl    $0, -8(%rbp)         // Initialize i
    jmp .L2
.L3:
    movl    -8(%rbp), %eax       // Load sum to a register
    addl    %eax, -4(%rbp)       // Add i and sum (from memory)
    addl    $1, -8(%rbp)         // Add 1 to i (from memory)
.L2:
    cmpl    $99999, -8(%rbp)     // Check if i < 100 000
    jle .L3                      // Jump Less Equal
    movl    $0, %eax             // Set the return value of main
    popq    %rbp
    ret                          // Return from main

gcc ./main.c -o main -OO

Compilation & Assembly¶

Assembly is as close to the metal we usually get, and is architecture dependant:

Intel and AMD use the x86 Instruction Set
x86 has multiple extensions (FMA, sse, avx, avx512, etc.)
To maximize performance, we should compile our applications on each platform
Our binaries are not portable
But we can use dedicated instructions
Other instructions set exists (ARM, Risc V, etc.)

Compilation & Assembly - Optimization passes¶

The compiler is not just a translator:

The compiler can generate optimized instructions from our program
Constant values can be propagated, unused values/code removed
Operations can be reordered, inlined, vectorized using SIMD, etc.
Many, many more optimizations

Those optimizations are enable through flags such as -O1, -O2, -O3 which are predefined sets of optimization passes.

The flag -march=native allows the compiler to target the current machine for compilation and use all the available ASM extensions.

Compilation & Assembly - Compiler Pipeline¶

C Compiler Workflow

There are several compilers with varying performance and features:

GCC and Clang-LLVM (The classics)
MSVC (Microsoft), mingw-LLVM, arm-clang (For ARM) and many, many others.

Makefile Basics - Introduction¶

Make is a scripting tool to automate complex compilation workflows. It works by defining rules inside Makefiles.

CC := gcc
CFLAGS := -g

main: main.c my_library.c my_library.h
  $(CC) -o $@ $^ $(CFLAGS)

main is the target (What we want to build)
main.c my_library.c my_library.h are the dependencies: rule reruns if any change
$(CC) -o $@ $< $(CFLAGS) is the recipe
$@ expands to the target name
$^ expands to all dependency

Makefile Basics - Phony rules¶

Makefiles expects that a rule main produces a file called main. However, not all rules produce files:

.PHONY: all clean

all: main mylibrary

...

clean:
  rm -rf *.o
  rm -rf ./main

Here, make all will be an alias to build everything, while make clean is a custom rule to clean all build artifacts. Makefile has many, many other functionalities, outside the scope of this course.

Makefile Basics - Usage¶

The typical projects looks something like:

Project/
  src/
    main.c
    my_library.c
  include/
    my_library.h
  Makefile # We define the Make rules here

make will look for a file in the cwd named Makefile or makefile. You can directly call make all, make clean, etc.

Parallelism Basics¶

Parallelism Basics - Introduction¶

Compiler optimization is only one side of high peformance computing.

If you remember; we saw in LSTOPO that our CPU has many cores:

Every core can perform computations independently of the other
Multiple process (Google, vscode, firefox, excel) can run simultaneously on different cores.
The kernel manages execution through thread scheduling and time-slicing

Main Thread¶

Every process has at least one "thread of execution", which is an ordered sequence of instructions executed by the CPU.

Parallelism Basics - Introduction¶

What if we could split our programs into multiple threads ?

If we have 1 thread only one computation happens at a time
If we have 2 threads, we can potentially double throughput !

In practice, there is some overhead, we must handle dependencies between instructions, etc.

Parallelism Basics - Types of parallelism¶

We consider three main types of parallelism

Single Instruction Multiple Data (SIMD): also called Vectorization
single instruction operates simultaneously on multiple data elements.
Shared Memory: Multiple threads inside the same memory space
Threads share a memory space, enabling fast communication and synchronization.
Distributed Memory: Multiple processes
Communications are slower, but this model enables scaling across multiple machines.

For this course, we will only focus on SIMD and Shared Memory parallelism.

Parallelism Basics - Shared Memory¶

Consider the following loop:

int sum = 0;
for (int i = 0; i < 100; i++)
  sum += i;

We can slice the iteration space in multiple chunks:

Iteration slicing with 4 threads

Parallelism Basics - Shared Memory¶

We split the program into multiple instruction sequences running in parallel.

Every thread operates a sum on a subset of the data
We synchronize every thread and combine the partial sums via a global reduction.

OpenMP is an HPC tool designed for scenarios like this !

It's a simple to use library/compiler pass to parallelize trivial loops.

Parallelism Basics- `OpenMP`¶

int sum = 0;

#pragma omp parallel for reduction(sum: +)
for (int i = 0; i < 100; i++)
  sum += i;

gcc ./main.c -fopenmp -O3 -march=native

This directive automatically distributes the loop iterations across all available CPU cores, performing a thread-safe reduction on sum.

Parallelism Basics - `OpenMP` details¶

OpenMP defines a set of clause which are operations followed by a set of modifiers.

#pragma omp: is the start of all OpenMP clauses
parallel: enable the creations of multiple threads
for: toggle the automatic slicing of following loop
reduction(sum: +): toggles a reductions clause for sum using the + operation.

This code will be enough for most cases; but OpenMP allows for significantly more complex operations.

Parallelism Basics - Advanced `OpenMP` Example¶

float global_min = FLT_MAX;
int global_min_index = -1;
#pragma omp parallel
{
  float min_value = FLT_MAX;
  int min_index = -1;
#pragma omp for nowait schedule(dynamic)
  for (int i = 0; i < N; i++) {
    if (T[i] < min_value) {
      min_value = T[i];
      min_index = i;
    }
  }
#pragma omp critical
  {
    if (min_value < global_min) {
      global_min = min_value;
      global_min_index = min_index;
    }
  }
}

Naive NBody 3D Strong Scaling - Setup¶

We increase the number of threads while keeping the work size constant.

OMP_PLACES={0,2,4,6,8,10,12,14} OMP_PROC_BIND=True OMP_NUM_THREADS=8 ./nbody 10000 sudo cpupower frequency-set -g performance

5 Meta repetitions per run, 13th Gen Intel(R) Core(TM) i7-13850HX @5.30 GHz, 32KB/2MB/30MB:L1/L2/L3 15GB DDR5.

Naive NBody 3D Strong Scaling - Results¶

Speedup of Naive Gravitationnal NBody 3D

Speedup is limited by runtime overhead, concurrency, memory bandwidth, data size, etc.

C for High Performance¶

Why C, C++, Python ?¶

Why C, C++, Python ?¶

Why C, C++, Python ?¶

C Programming - Operations and Typing¶

C Programming - Functions¶

C Programming - Loops¶

C Programming - Conditions¶

Note¶

C Programming - Basic Pointers¶

C Programming - Arrays¶

C Programming - Structures¶

C programming - Structures 2¶

C Programming - Trading Abstraction for performance¶

In C, we must manually take care of very low level concepts¶

On the flip side, we can¶

C Programming - Trading Abstraction for performance (Example)¶

C Programming - Trading Abstraction for performance (Example)¶

Numpy and other libraries¶

Managing Memory¶

Managing Memory - Concept¶

In High-level languages¶

In C¶

Managing Memory - Memory Types¶

Managing Memory - Allocation¶

Managing Memory - Allocation¶

Managing Memory - Deallocation¶

Memory leak¶

Virtual And Physical Memory - Problem¶

Virtual And Physical Memory - Concept¶

The Illusion of contiguity¶

Virtual And Physical Memory - Diagram¶

Memory Hierarchy¶

Memory Hierarchy¶

CPU Caches¶

Instruction Cache¶

CPU Caches¶

CPU Caches - In practice¶

CPU Caches - In practice¶

Caches CPU - Strided Access¶

Caches CPU - Strided Access¶

Caches CPU - Strided Access¶

Compilation & Assembly¶

Compilation & Assembly - Introduction¶

Compilation & Assembly - Simple Loop¶

Compilation & Assembly¶

Compilation & Assembly - Optimization passes¶

Compilation & Assembly - Compiler Pipeline¶

Makefile Basics - Introduction¶

Makefile Basics - Phony rules¶

Makefile Basics - Usage¶

Parallelism Basics¶

Parallelism Basics - Introduction¶

Main Thread¶

Parallelism Basics - Introduction¶

Parallelism Basics - Types of parallelism¶

Parallelism Basics - Shared Memory¶

Parallelism Basics - Shared Memory¶

Parallelism Basics- OpenMP¶

Parallelism Basics - OpenMP details¶

Parallelism Basics - Advanced OpenMP Example¶

Naive NBody 3D Strong Scaling - Setup¶

Naive NBody 3D Strong Scaling - Results¶

Parallelism Basics- `OpenMP`¶

Parallelism Basics - `OpenMP` details¶

Parallelism Basics - Advanced `OpenMP` Example¶