Image processing speedups in Halide
Hello! I wanted to explore this programming language, Halide, after hearing about it from my last internship. It’s an older language embedded with C++, but it offers some neat ways to think about optimization, which is especially relevant in today’s “compute-bound vs memory-bound” discussion.
Halide gives the flexibility of a C++ library with the performance of a standalone compiler. You can write your code in C++, but Halide JIT-compiles them into optimized, C ABI-compatible code at runtime. I liked using this language because of what it teaches you: chaining image processing functions kills performance (each function reads from memory, computes, and writes back to memory). Your CPU spends more time moving data than processing it.
Halide lets you write the algorithm once, then experiment with different execution strategies. I.e. if you want to process 64x64 tiles for cache usage, want to vectorize across 8 pixels instead of 4… you can specify
The primary use case for Halide is speeding up image processing. The way we achieve speedups is quite different from existing library optimizations like PyTorch’s JIT Compiler or compiler flags like -o3. Instead we separate the algorithm from the schedule (or what to compute from how to optimize it).
Basic Image Operations Halide can speed up:
- Blur, sharpen, edge detection
- Color space conversions (RGB ↔ YUV, HDR tone mapping)
- Resize, rotate, geometric transforms
- Histogram equalization, gamma correction
Theis post are my rough draft notes and a lot of the info is from the CPPCon’s Halide Demo ! At the bottom is code showing how to apply a blur in Halide.
3 typical optimizations
- Parallelism - split work across cores
- Divide independent work across independent CPU cores
- Recompute redundant work instead of loading from distant memory
- Loading a value from memory is sometimes slower than recomputing it.
- This is especially true when vectorizing
- Locality keep data close together in cache
- If you just wrote a value to memory, you should try to use it quickly, while it’s still in cache
How the tradeoff manifests in practice
We try to trade off between these 3 factors.
- Parallelism vs locality
- Dependency chains
- High parallelism
- If operation B depends on operation A’s output, high parallelism.
- Compute all of A then all of B. poor locality - As results get cold in cache
- High locality
- Compute A then immediately compute B for the same data region (limits parallelism)
- Redundant work
- Have multiple cores each compute the A values they need for their B work
Halide’s solution: you explicitly choose the optimization strategy - tile sizes (how big should each chunk of data be to fit nicely in cache), loop ordering (which dimensions should be processed first), which loops to parallelize. The compiler doesn’t make assumptions about what’s best for the use case - you tell it to compute this much of A, then immediately do the corresponding B work, or fully parallelize A, accept the cache misses, then do B.
Difference between chaining multiple optimized library calls and Halide
When you chain together optimized library calls, each step reads data from DRAM, does its optimized computation, writes results back to DRAM (slow)
Example: Input → [Read] → Func1 → [Write] → [Read] → Func2 → [Write] → [Read] → Func3 → [Write] → Output
You’re executing 6 DRAM operations instead of 2. Each intermediate result materializes in slow memory (200+ cycle latency), immediately gets read back for the next stage. The “optimized” kernels become pessimal when chained due to memory wall constraints.
Halide enables: Input → [Read tile] → Func1→Func2→Func3 → [Write tile] → Next tile
Flow: Read a small tile of input data. Compute step 1,2,3 on that tile while it’s hot in cache. Write the final result. Move to the next tile. This is called “fusion” - instead of materializing intermediate results in slow DRAM, you keep data flowing through the fast cache hierarchy. The optimized kernels are actually pessimal when chained together due to memory bandwidth bottlenecks
Rather than library_func1() -> library_funct2() -> library func3(), Halide lets you fuse the entire pipeline so data flows through the cache
Why not use -O4 flag?
O4 refers to compiler optimization levels, aggressive optimization setting beyond -O3.
The compiler can’t restructure your entire computation pipeline - it optimizes functions individually but can’t fuse separate library calls or reorganize data flow across your whole program
Example: Making a blur faster in Halide
Example: Two-Pass Blur Implementation Setup: 3×3 blur implemented as separable horizontal + vertical passes
Stage 1: Horizontal blur (1×3 kernel) Stage 2: Vertical blur (3×1 kernel) Boundary handling: Extend edges to preserve output dimensions
Traditional implementation: Each pass reads the entire image, writes the entire intermediate result. For a 4K image, that’s ~50MB written and immediately re-read.
Halide fusion: Process 64×64 tiles through both passes while data remains cache-resident. The intermediate horizontal blur result never touches DRAM—it flows directly into the vertical pass. This architectural shift transforms a memory-bound algorithm into a compute-bound one, often delivering 3-10× speedups on bandwidth-constrained hardware.
How to make a blur faster
Tiling: breaks the image into smaller tiles (32x8) pixels that fit better in cache
Parallelization: Uses parallel() to process different tiles on multiple CPU cores
Vectorization: uses vectorize() to process multiple pixels simultaneously using SIMD instructions
Compute scheduling: uses compute_at() to keep intermediate results (horizontal blur) in cache rather than writing to/reading from memory
#include "Halide.h"
using namespace Halide;
Func blur_example() {
// Input image
Buffer input = load_image("input.png");
Func input_func("input");
input_func(x, y) = input(x, y);
// Blur edges
Func clamped("clamped");
clamped(x, y) = input_func(clamp(x, 0, input.width()-1),
clamp(y, 0, input.height()-1));
// Horizontal blur
Func blur_x("blur_x");
blur_x(x, y) = (clamped(x-1, y) + clamped(x, y) + clamped(x+1, y)) / 3;
// Vertical blur
Func blur_y("blur_y");
blur_y(x, y) = (blur_x(x, y-1) + blur_x(x, y) + blur_x(x, y+1)) / 3;
return blur_y;
}