Lab 2: Performance Aware C Computing¶
Objective¶
You are tasked with implementing an efficient image processing pipeline. A pipeline looks like this:
0 -1 load images/image1.png
1 0 quantize 8
2 1 invert
3 2 save output/image0_transformed.png
The line 1 0 quantize 8
declares the node 1, whose parent is the node 0, and performs a quantization transformation with 8 levels.
Note
This structure is a Directed Acyclic Graph (DAG), a type of graph often used in data processing and scheduling.

Provided Files¶
The parsing and pipeline execution logic have already been implemented so you can focus on managing memory and kernel implementations. The starter codebase contains the following directories and files:
Path | Description |
---|---|
images/ |
Image resources for the lab. |
pipelines/ |
Sample transformation pipelines to test your implementation. |
src/ |
C source code for this lab. Contains the files listed below: |
src/main.c |
Parses the transformation graph and executes it. |
src/parser.(c|h) |
DAG representation for the image processing pipeline. |
src/image.(c|h) |
Image structure and utilities for memory allocation and management. |
src/stb_image.h |
Public domain header-only image I/O library (from GitHub). |
src/transformation.c/h |
Implementation of image processing kernels. Most functions are missing and must be implemented by you. |
1 - Compiling the program¶
a) Write a Makefile
that:¶
- Compiles
main.c
,image.c
,parser.c
andtransformation.c
usinggcc
- Produces a binary named
mystransform
at the root of the project - Exposes a
clean
target that removes compiled artifacts- This includes any
.o
,.so
,.out
, as well as themytransform
binary
- This includes any
- You should link to the math standard library by using the
-lm
compilation flag
Danger
The output binary must be named mystransform
, or later parts of the lab may not work correctly.
b) Define a CFLAGS
variable inside the Makefile
¶
To get started, you should use -Og -g -Wall -Wextra
as compilation flags. We will update this later. Ensure the flags are in effect.
c) Run make
then try running mytransform
¶
Execute the following:
create_image - Not implemented yet
Could not allocate image for load
convert_to_grayscale - Not implemented yet
invert_image - Not implemented yet
Node 3 requested to save an image from node 2, which did not produce an image
quantize_image_naive - Not implemented yet
Node 5 requested to save an image from node 4, which did not produce an image
invert_image - Not implemented yet
Node 7 requested to save an image from node 6, which did not produce an image
This indicates that your build works correctly and you can continue the lab. If needed, fix your Makefile
until you obtain the same results.
2 - Allocating & Manipulating memory¶
Look at image.c
and image.h
, and try to understand the provided structure.
- Pixels are stored as
unsigned char* pixels[3]
: what does that mean in practice ? How many arrays do we have to allocate to store an RGB Image ? What's the size of each array ? What happens if we have a grayscale image (black & white). - How do we distinguish between grayscale and RGB images ?
1. Implement Memory Allocation¶
a) Implement image allocation¶
You must implement the function image.c:create_image(...)
.
This function should allocate memory buffers big enough to hold an image of size \(\text{width} \cdot \text{height}\). Note for this lab, channels can either be 1 (grayscale) or 3 (RGB).
b) Implement image deallocation¶
You must implement image.c:free_image(...)
.
Make sure you can free both grayscale and RGB images. Note: you must also free the Image
structure itself.
c) Execute the memory implementation test¶
Compile your program and fix any errors until none are left. Run the following:
You should get the following:
2. Image Copying¶
a) Implement image copy¶
You must implement Image* copy_image(const Image* image)
inside src/image.c
This function receives an Image*
, and produces a deepcopy, meaning that we are not copying the pointers, but allocating new pixels buffers and duplicating the image in memory.
You should not use memcpy
or strcpy
for this exercise: perform a manual copy.
b) Execute the memory implementation test¶
Compile again and run:
You should get the following:
Memory Allocations tests completed in 5 seconds
Copy Test: 3000x3000x3 image -> 3524.74 MIOPS, 3.52 GB/s
c) Implement 2d copy loops¶
When copying, we have to deal with three dimensions: the channels, the width and the height.
In which order does your current implementation traverse the pixel buffer ?
Implement the following loop structure in C:
Tip
You should make a copy of your current implementation before implementing this structure, so you can compare the different versions.
Optionally, make multiple versions of copy_image
with different names, and have copy_image
call the version you want to test.
Now, try swapping out the loop. First iterate on y
, then x
, thenc
. Try all possible combinations to find which one is faster. Can you explain why ?
d) Implement the linear versions¶
While images are 3D structures (When taking the channels into account), you may have noticed that we have stored them as 2D arrays (One linear array per channel). Each channel is stored in row-major order.
- Implement two nested loops: the upper levels iterating on the channels, the inner loop iterating over each pixel.
- Implement one loop: iterate only over the pixel, copying RGB in a single loop.
Compare the performance by running the --memory-check
option. Which loops gives you the best performance ? Do you understand why ?
3 - Executing a transformation pipeline¶
Look at src/parser.h
and try to understand the different structures of the parser. Do the same for src/transformation.c
1. Implementing Grayscale¶
The grayscale transform receives an RGB image and converts it into a black-and-white, single channel image. The formula for the transformation is:
Where C is the grayscale channel, and R,G,B are the corresponding components of the RGB image. Note: If the grayscale transform receives a grayscale image as input, it should output a (deep) copy of the input.
a) Implement this transformation¶
You should read data from node.input
, and allocate and write data to node.output
.
You do not need to free the input
and output
buffer of any of the nodes: it will be handled for you.
Test using
At this stage, output/test_grayscale.png
should contain the grayscale of images/test.png
2. Implementing Inversion¶
The inversion transform is simple:
Where C is one of the input image channels (RGB or Grayscale). Implement this kernel in transformation.c:invert_image(...)
.
3. Implementing Quantization¶
We will implement a uniform quantization transform, which uniformly reduces the number of possible values in each component of the image.
Where \(\text{levels}\) is the target number of discrete values per component. This maps each input color channel value \(C_{in} \in [0, 255]\) to a quantized output \(C_{out}\) in the same range.
a) Implement this transformation using the formula provided above.¶
You must implement this function in transformation.c:quantize_image_naive(...)
b) Optimize using a lookup table¶
Division and floating-point operations can be costly. Since \(C_{in} \in [0, 255]\), we can easily precompute all possible outputs in a lookup table (LUT) of size 256, then convert each pixel using:
We are trading memory overhead for performance, which is a very common pattern.
Implement this function in transformation.c:quantize_image_lut(...)
, and be sure to correctly allocate and free the LUT.
c) Compare performance¶
Modify transformation.c:quantize_image(...)
to either use the LUT or the naive version, and run the following code:
Which version is faster ? What's the speedup of the LUT algorithm over the naive version ?
4) Validation¶
At this stage, your code should be able to execute all pipelines in the pipelines/
directory.
Validate your implementation by checking that output images are generated and appear visually correct.
1. Improving performance¶
Once your implementation is functionally correct, your next goal is to optimize performance.
Use the provided script:
Tip
You should run the previous script with both the LUT and naive implementation of quantization to compare.
Check the resulting plots in ./results/first_version/
. Take a look at ./results/first_version/pipeline.pipeline
to understand which operations are being evaluated.
Re-run the benchmark after every meaningful optimization, and check ./results/comparisons.png
to see if you improved performance or not. Tip: play around with compilation flags.
To remove a version, remove the corresponding folder in ./results
Important
You need to have python installed for run_all.sh
to work. On Fedora:
5 - Summary¶
Upon completing this second lab, you should be able to:
- Create a makefile, link to a dynamic library.
- Explore compilation flags for performance
- Allocate and free memory in C.
- Explore the effect of memory layout and loop order on performance.
- Implement loop-based algorithms.
- Rearrange your algorithm to avoid costly operations.
- Understand the basics of the optimization loop.