about week ago Wojtek sent me link to “an interesting article about CPU's caches”, called What Every Programmer Should Know About Memory1). this is how it all started…
note that there is also very short summary of all this from original author available as a slide show.
one more interesting article regarding CPU cache and its (side)effects is gallery of processor cache effects.
i already knew something about this but i thought it would be good to deepen this knowledge and settle it right in my mind. i was at least surprised to see 114-pages long article written in standard paper font (i.e.: small). even before getting thought it i felt deep need to test presented results in practice, on my own computer(s). i've wrote few short programs and measured its execution time with /usr/bin/time. they were off course not very exact, but they gave a clue on what is important and how much.
the code i've wrote, allong with proper Makefile is available in cache_usage_effects.tar.bz2 file. inside it there are number of C++ files i'll briefly describe in a moment.
in general each file is compiled twice – with and without MAKE_OPT_VERSION flag. when this flag is not set, the usual “non-optimal” solution is taken. when flag is set, the better performance version is used instead2).
important notices to be kept in mind when running these tests:
in general: you can compile code using (default) 'all' target:
make all
to run examples just type:
make run
notice that 'run' targets depend on 'all' as well.
shows influence of code flow interruptions by conditional jumps. the case here is common discussion about using only one exit point from function or exit just after error has been encountered.
when long jumps have to be taken, code flow is broken and L1i cache has to be loaded with new instructions which costs time. good branch prediction amortizes this cost though.
both cases usually execute in similar time, though on some machines existing immediately returns yields better results (about 2%).
shows influence of data layout in memory vs. prefetching.
when data structure is big, it is prefetched by CPU, even though only its small part is needed (here: index). this pollutes caches with data not used even once. this can be prevented by keeping unrelated data apart from those that are likely to be used.
in this case separating data from index speeds program up about 5 times (500%) on task similar to searching structure by index.
this example shows influence of loop expansions on execution time.
longer code, without jumps takes more space in cache causing more often cache misses. since CPU is much faster then memory loop iteration is, from some point, faster than too-much expanded code.
loop in this case is about 5% faster than it's expansion. noticeable is also fact, that expanded loop binary is 25 times (2500%) bigger than code with simple loop.
classic example showing how cache misses influence processing.
when 2-dimensional array is accessed by iterating over rows in inner loop, for big arrays, all cache reads are misses. when iterating inner loop over columns on the other hand, most reads are hits (thanks to prefetching), and so long memory access is amortized.
code with proper indexing is usually about 10 times (1000%) faster then reversed indexing.
one more, less trivial example of data caching examples.
when computing matrix multiplying using equation directly from definition when do reversed indexing on second matrix. to prevent this code has been slightly reorganized to ensure that as much accesses as possible will be “local” (i.e. element laying nearby in the memory will be used most of the cases).
this code is around 5 times faster (500%) then naive implementation.
interesting thing is that these implementations are nearly identical. both uses 3 nested loops and multiplication inside. they only real difference if that “optimized” version accesses memory in sequential order, rather than “random” (as seen form the cache's prefetcher point of view).
#if MAKE_OPT_VERSION // optimized for(int y=0; y<n; ++y) for(int i=0; i<n; ++i) for(int j=0; j<n; ++j) out[y][j] += i1[y][i] * i2[i][j]; #else // normal for(int y=0; y<n; ++y) for(int x=0; x<n; ++x) for(int i=0; i<n; ++i) out[y][x] += i1[y][i] * i2[i][x]; #endif
writing code such a way in real life would surely require some comment, since it's not obvious and a bit less readable then “normal” implementation.
one more thing to notice is that during the tests of this particular piece of code, compiled with intel's icc compiler, run on itanium CPU… behave exactly the same as the “optimized” one! on gcc and other CPUs it was always around 5-times faster, but icc must have noticed that this is typical matrix multiplication and optimized it in the same (or similar) way.
example showing padding for speeding up memory access.
whan creating structure, one may force compiler to discard any padding. it makes structure smaller (here - on x86-64 and gcc-4.3.4 almost 40%, though one must notice that structure is made in specific way, to ensure high padding) but it requires additional computational time to process its values. key of this part is to show that these extra instructions cost more than additional cache operations for aligned structures.
this code is around 30% slower when using non-padded structures.
example showing influence of polarized branching (i.e. most of the cases only one path is chosen) and its optimization on code execution.
when branches are known to pass/fail most of the times (i.e. error checking is expected to happen less often then actual error situation), knowledge of this can be used to bias compiler to generate code optimized for this condition. longer code (doSth() inlined call) is used between branches to make it more visible.
optimized code is around 1-2% faster than non-optimized one.
here is an output of one of the test run on my home laptop:
######### RUNNING TESTS ######### exit status is only a sort of check sum, to see if both implementations give the same results +++++++++++++++ condition_checks_vs_L1i_cache.out-nonopt Command exited with non-zero status 52 2.97user 0.00system 0:02.97elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+254minor)pagefaults 0swaps +++++++ condition_checks_vs_L1i_cache.out-opt Command exited with non-zero status 52 2.93user 0.00system 0:02.93elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+254minor)pagefaults 0swaps +++++++ data_structures_vs_prefetching.out-nonopt Command exited with non-zero status 105 0.67user 0.04system 0:00.72elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+12813minor)pagefaults 0swaps +++++++ data_structures_vs_prefetching.out-opt Command exited with non-zero status 105 0.07user 0.00system 0:00.07elapsed 97%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+460minor)pagefaults 0swaps +++++++ loop_expansion_vs_code_size.out-nonopt Command exited with non-zero status 127 5.34user 0.00system 0:05.35elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+308minor)pagefaults 0swaps +++++++ loop_expansion_vs_code_size.out-opt Command exited with non-zero status 127 5.48user 0.00system 0:05.48elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+270minor)pagefaults 0swaps +++++++ loop_iteration_index.out-nonopt Command exited with non-zero status 1 1.23user 0.03system 0:01.26elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+16682minor)pagefaults 0swaps +++++++ loop_iteration_index.out-opt Command exited with non-zero status 1 0.11user 0.03system 0:00.16elapsed 89%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+16682minor)pagefaults 0swaps +++++++ matrix_multiply.out-nonopt Command exited with non-zero status 129 1.74user 0.00system 0:01.74elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+1788minor)pagefaults 0swaps +++++++ matrix_multiply.out-opt Command exited with non-zero status 129 0.28user 0.00system 0:00.28elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+1788minor)pagefaults 0swaps +++++++ padding_vs_speed.out-nonopt structure size is 15[B] Command exited with non-zero status 245 6.64user 0.00system 0:06.65elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+307minor)pagefaults 0swaps +++++++ padding_vs_speed.out-opt structure size is 24[B] Command exited with non-zero status 245 3.76user 0.00system 0:03.76elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+330minor)pagefaults 0swaps +++++++ param_check_optimization.out-nonopt Command exited with non-zero status 26 4.72user 0.00system 0:04.73elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+254minor)pagefaults 0swaps +++++++ param_check_optimization.out-opt Command exited with non-zero status 26 4.72user 0.00system 0:04.72elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+254minor)pagefaults 0swaps +++++++
as a comparison, tests results of my laptop (Core 2 Duo) and few other, stationary computers that i have currently access to. notice that these times are one-time-run just to give an overlook on the whole case – to make real comparison it should be run multiple times, standard deviation should be computed and so on.
no | example | version | Core2 2.0GHz | Core2 2.5GHz | Xeon 3.0GHz |
---|---|---|---|---|---|
1a | condition_checks_vs_L1i_cache | non-opt | 2.97 | 2.33 | 3.28 |
1b | condition_checks_vs_L1i_cache | opt | 2.93 | 2.33 | 2.75 |
2a | data_structures_vs_prefetching | non-opt | 0.67 | 0.44 | 0.18 |
2b | data_structures_vs_prefetching | opt | 0.07 | 0.05 | 0.04 |
3a | loop_expansion_vs_code_size | non-opt | 5.34 | 4.28 | 4.76 |
3b | loop_expansion_vs_code_size | opt | 5.48 | 4.38 | 4.38 |
4a | loop_iteration_index | non-opt | 1.23 | 0.82 | 4.41 |
4b | loop_iteration_index | opt | 0.11 | 0.08 | 0.05 |
5a | matrix_multiply | non-opt | 1.74 | 0.72 | 0.70 |
5b | matrix_multiply | opt | 0.28 | 0.18 | 0.21 |
6a | padding_vs_speed | non-opt | 6.64 | 5.28 | 7.37 |
6b | padding_vs_speed | opt | 3.76 | 3.11 | 4.94 |
7a | param_check_optimization | non-opt | 4.72 | 3.75 | 5.64 |
7b | param_check_optimization | opt | 4.72 | 3.75 | 5.46 |
after getting through previous results one can see that among optimizations present there were few that made a great impact on execution time in most of the cases. these are the ones that should be kept in mind during everyday's work:
among many tools presented in the article to help profiling cache usage the one i've liked the most is cachegrind – valgrind plug-in. its usage is extremely simple and intuitive. all you have to do is run:
valgrind --tool=cachegrind ./myBinary.out cg_annotate cachegrind.out.1234 | less -S
where 1234 is a pid of last run program (here: myBinary.out). as an output you get cache miss/hit ratios along with numbers. cg_annotate also displays these data split into function calls.
by default cachegrind takes parameters of cache on your PC, but you can explicitly force it to use other values too.