Lab 2: Performance Aware C Computing¶

Objective¶

You are tasked with implementing an efficient image processing pipeline. A pipeline looks like this:

Example transformation pipeline

0 -1 load images/image1.png
1 0 quantize 8
2 1 invert
3 2 save output/image0_transformed.png

The line 1 0 quantize 8 declares the node 1, whose parent is the node 0, and performs a quantization transformation with 8 levels.

Note

This structure is a Directed Acyclic Graph (DAG), a type of graph often used in data processing and scheduling.

Processing Example — Example Processing pipeline you will have to implement

Provided Files¶

The parsing and pipeline execution logic have already been implemented so you can focus on managing memory and kernel implementations. The starter codebase contains the following directories and files:

Path	Description
`images/`	Image resources for the lab.
`pipelines/`	Sample transformation pipelines to test your implementation.
`src/`	C source code for this lab. Contains the files listed below:
`src/main.c`	Parses the transformation graph and executes it.
`src/parser.(c\|h)`	DAG representation for the image processing pipeline.
`src/image.(c\|h)`	Image structure and utilities for memory allocation and management.
`src/stb_image.h`	Public domain header-only image I/O library (from GitHub).
`src/transformation.c/h`	Implementation of image processing kernels. Most functions are missing and must be implemented by you.

1 - Compiling the program¶

a) Write a `Makefile` that:¶

Compiles main.c, image.c, parser.c and transformation.c using gcc
Produces a binary named mytransform at the root of the project
Exposes a clean target that removes compiled artifacts
- This includes any .o, .so, .out, as well as the mytransform binary
You should link to the math standard library by using the -lm compilation flag

Danger

The output binary must be named mytransform, or later parts of the lab may not work correctly.

b) Define a `CFLAGS` variable inside the `Makefile`¶

To get started, you should use -Og -g -Wall -Wextra as compilation flags. We will update this later. Ensure the flags are in effect.

c) Run `make` then try running `mytransform`¶

Execute the following:

Running a simple pipeline

./mytransform ./pipelines/test.pipeline

Expected Output

create_image - Not implemented yet
Could not allocate image for load
convert_to_grayscale - Not implemented yet
invert_image - Not implemented yet
Node 3 requested to save an image from node 2, which did not produce an image
quantize_image_naive - Not implemented yet
Node 5 requested to save an image from node 4, which did not produce an image
invert_image - Not implemented yet
Node 7 requested to save an image from node 6, which did not produce an image

This indicates that your build works correctly and you can continue the lab. If needed, fix your Makefile until you obtain the same results.

2 - Allocating & Manipulating memory¶

Look at image.c and image.h, and try to understand the provided structure.

Pixels are stored as unsigned char* pixels[3]: what does that mean in practice ? How many arrays do we have to allocate to store an RGB Image ? What's the size of each array ? What happens if we have a grayscale image (black & white).
How do we distinguish between grayscale and RGB images ?

1. Implement Memory Allocation¶

a) Implement image allocation¶

You must implement the function image.c:create_image(...).

This function should allocate memory buffers big enough to hold an image of size \(\text{width} \cdot \text{height}\). Note for this lab, channels can either be 1 (grayscale) or 3 (RGB).

b) Implement image deallocation¶

You must implement image.c:free_image(...).

Make sure you can free both grayscale and RGB images. Note: you must also free the Image structure itself.

c) Execute the memory implementation test¶

Compile your program and fix any errors until none are left. Run the following:

Memory Test

./mytransform --memory-check

You should get the following:

Expected Output

Memory Allocations tests completed successfully
<Lots of error about copy failure>

2. Image Copying¶

a) Implement image copy¶

You must implement Image* copy_image(const Image* image) inside src/image.c

This function receives an Image*, and produces a deepcopy, meaning that we are not copying the pointers, but allocating new pixels buffers and duplicating the image in memory.

You should not use memcpy or strcpy for this exercise: perform a manual copy.

b) Execute the memory implementation test¶

Compile again and run:

Memory Test

./mytransform --memory-check

You should get the following (Exact results may vary depending on your machine):

Memory Allocations tests completed in 5 seconds
Copy Test: 3000x3000x3 image -> 3524.74 MIOPS, 3.52 GB/s

c) Implement 2d copy loops¶

When copying, we have to deal with three dimensions: the channels, the width and the height.

In which order does your current implementation traverse the pixel buffer ?

Implement the following loop structure in C:

3D loop traversal

for c in channels
  for x in width
    for y in height
      # Do the copy here

Tip

You should make a copy of your current implementation before implementing this structure, so you can compare the different versions. Optionally, make multiple versions of copy_image with different names (e.g. copy_image_cxy, copy_image_xyc), and have copy_image call the version you want to test.

Now, try swapping out the loop. First iterate on y, then x, thenc. Try all possible combinations to find which one is faster. Can you explain why ?

d) Implement the linear versions¶

While images are 3D structures (When taking the channels into account), you may have noticed that we have stored them as 2D arrays (One linear array per channel). Each channel is stored in row-major order.

Implement two nested loops: the upper levels iterating on the channels, the inner loop iterating over each pixel.
Implement one loop: iterate only over the pixel, copying RGB in a single loop.

Compare the performance by running the --memory-check option. Which loops gives you the best performance ? Do you understand why ?

3 - Executing a transformation pipeline¶

Look at src/parser.h and try to understand the different structures of the parser. Do the same for src/transformation.c

1. Implementing Grayscale¶

The grayscale transform receives an RGB image and converts it into a black-and-white, single channel image. The formula for the transformation is:

\[ C_{out} = 0.299 \cdot R_{in} + 0.587 \cdot G_{in} + 0.114 \cdot B_{in} \]

Where C is the grayscale channel, and R,G,B are the corresponding components of the RGB image. Note: If the grayscale transform receives a grayscale image as input, it should output a (deep) copy of the input.

a) Implement this transformation¶

You should read data from node.input, and allocate and write data to node.output. You do not need to free the input and output buffer of any of the nodes: it will be handled for you.

Test using

Test grayscale

mkdir -p ./output
./mytransform ./pipelines/test.pipeline

At this stage, output/test_grayscale.png should contain the grayscale of images/test.png

2. Implementing Inversion¶

The inversion transform is simple:

\[ C_{out} = 255 - C_{in} \]

Where C is one of the input image channels (RGB or Grayscale). Implement this kernel in transformation.c:invert_image(...).

3. Implementing Quantization¶

We will implement a uniform quantization transform, which uniformly reduces the number of possible values in each component of the image.

\[\begin{aligned} \text{step} &= \frac{255}{\text{levels} - 1} \\\\ C_{out} &= \text{round}\left(\frac{C_{in}}{\text{step}}\right) \cdot \text{step} \end{aligned}\]

Where \(\text{levels}\) is the target number of discrete values per component. This maps each input color channel value \(C_{in} \in [0, 255]\) to a quantized output \(C_{out}\) in the same range.

a) Implement this transformation using the formula provided above.¶

You must implement this function in transformation.c:quantize_image_naive(...)

b) Optimize using a lookup table¶

Division and floating-point operations can be costly. Since \(C_{in} \in [0, 255]\), we can easily precompute all possible outputs in a lookup table (LUT) of size 256, then convert each pixel using:

\[ C_{out} = LUT[C_{in}] \]

We are trading memory overhead for performance, which is a very common pattern.

Implement this function in transformation.c:quantize_image_lut(...), and be sure to correctly allocate and free the LUT.

c) Compare performance¶

Modify transformation.c:quantize_image(...) to either use the LUT or the naive version, and run the following code:

Benchmarking

time ./mytransform ./pipelines/quantize_benchmark.pipeline

Which version is faster ? What's the speedup of the LUT algorithm over the naive version ?

4) Validation¶

At this stage, your code should be able to execute all pipelines in the pipelines/ directory. Validate your implementation by checking that output images are generated and appear visually correct.

1. Improving performance¶

Once your implementation is functionally correct, your next goal is to optimize performance.

Use the provided script:

# run_all.sh <run_label>
./run_all.sh first_version

Tip

You should run the previous script with both the LUT and naive implementation of quantization to compare.

Check the resulting plots in ./results/first_version/. Take a look at ./results/first_version/pipeline.pipeline to understand which operations are being evaluated. Re-run the benchmark after every meaningful optimization, and check ./results/comparisons.png to see if you improved performance or not. Tip: play around with compilation flags.

To remove a version, remove the corresponding folder in ./results

Important

You need to have python installed for run_all.sh to work. On Fedora:

Installing python on fedora

sudo dnf install python3 python3-pip python3-virtualenv

5 - Summary¶

Upon completing this second lab, you should be able to:

Create a makefile, link to a dynamic library.
Explore compilation flags for performance
Allocate and free memory in C.
Explore the effect of memory layout and loop order on performance.
Implement loop-based algorithms.
Rearrange your algorithm to avoid costly operations.
Understand the basics of the optimization loop.