cache usage effects on modern CPUs

about week ago Wojtek sent me link to “an interesting article about CPU's caches”, called What Every Programmer Should Know About Memory1). this is how it all started…

note that there is also very short summary of all this from original author available as a slide show.

one more interesting article regarding CPU cache and its (side)effects is gallery of processor cache effects.

first steps

i already knew something about this but i thought it would be good to deepen this knowledge and settle it right in my mind. i was at least surprised to see 114-pages long article written in standard paper font (i.e.: small). even before getting thought it i felt deep need to test presented results in practice, on my own computer(s). i've wrote few short programs and measured its execution time with /usr/bin/time. they were off course not very exact, but they gave a clue on what is important and how much.

code examples

the code i've wrote, allong with proper Makefile is available in cache_usage_effects.tar.bz2 file. inside it there are number of C++ files i'll briefly describe in a moment.

in general each file is compiled twice – with and without MAKE_OPT_VERSION flag. when this flag is not set, the usual “non-optimal” solution is taken. when flag is set, the better performance version is used instead2).

important notices to be kept in mind when running these tests:

  1. results are coarse and imprecise – they only give you an idea of what speedup/slowdown magnitudes we're talking about.
  2. all tests return non-zero exit status – this is used to put some output from code so that code optimizer would not remove “unused” code.
  3. compilation might show warnings about conversions from int/floats/doubles – this does not matter here. i just had to put some extra, meaningless computations to make other things more visible (ex. output code size).
  4. one of the files use meta-program to generate huge binary – depending on computer's speed and compiler it might take even around a minute to compile it – just be patient… :)
  5. results differs A LOT when run on different hardware or compiled with different compilers. it's that some CPUs better handle specific situations, some compilers do better optimizations in certain situations, worse in the others, etc…

in general: you can compile code using (default) 'all' target:

make all

to run examples just type:

make run

notice that 'run' targets depend on 'all' as well.


shows influence of code flow interruptions by conditional jumps. the case here is common discussion about using only one exit point from function or exit just after error has been encountered.

when long jumps have to be taken, code flow is broken and L1i cache has to be loaded with new instructions which costs time. good branch prediction amortizes this cost though.

both cases usually execute in similar time, though on some machines existing immediately returns yields better results (about 2%).


shows influence of data layout in memory vs. prefetching.

when data structure is big, it is prefetched by CPU, even though only its small part is needed (here: index). this pollutes caches with data not used even once. this can be prevented by keeping unrelated data apart from those that are likely to be used.

in this case separating data from index speeds program up about 5 times (500%) on task similar to searching structure by index.


this example shows influence of loop expansions on execution time.

longer code, without jumps takes more space in cache causing more often cache misses. since CPU is much faster then memory loop iteration is, from some point, faster than too-much expanded code.

loop in this case is about 5% faster than it's expansion. noticeable is also fact, that expanded loop binary is 25 times (2500%) bigger than code with simple loop.


classic example showing how cache misses influence processing.

when 2-dimensional array is accessed by iterating over rows in inner loop, for big arrays, all cache reads are misses. when iterating inner loop over columns on the other hand, most reads are hits (thanks to prefetching), and so long memory access is amortized.

code with proper indexing is usually about 10 times (1000%) faster then reversed indexing.


one more, less trivial example of data caching examples.

when computing matrix multiplying using equation directly from definition when do reversed indexing on second matrix. to prevent this code has been slightly reorganized to ensure that as much accesses as possible will be “local” (i.e. element laying nearby in the memory will be used most of the cases).

this code is around 5 times faster (500%) then naive implementation.

interesting thing is that these implementations are nearly identical. both uses 3 nested loops and multiplication inside. they only real difference if that “optimized” version accesses memory in sequential order, rather than “random” (as seen form the cache's prefetcher point of view).

  // optimized
  for(int y=0; y<n; ++y)
    for(int i=0; i<n; ++i)
      for(int j=0; j<n; ++j)
        out[y][j] += i1[y][i] * i2[i][j];
  // normal
  for(int y=0; y<n; ++y)
    for(int x=0; x<n; ++x)
      for(int i=0; i<n; ++i)
        out[y][x] += i1[y][i] * i2[i][x];

writing code such a way in real life would surely require some comment, since it's not obvious and a bit less readable then “normal” implementation.

one more thing to notice is that during the tests of this particular piece of code, compiled with intel's icc compiler, run on itanium CPU… behave exactly the same as the “optimized” one! on gcc and other CPUs it was always around 5-times faster, but icc must have noticed that this is typical matrix multiplication and optimized it in the same (or similar) way.


example showing padding for speeding up memory access.

whan creating structure, one may force compiler to discard any padding. it makes structure smaller (here - on x86-64 and gcc-4.3.4 almost 40%, though one must notice that structure is made in specific way, to ensure high padding) but it requires additional computational time to process its values. key of this part is to show that these extra instructions cost more than additional cache operations for aligned structures.

this code is around 30% slower when using non-padded structures.


example showing influence of polarized branching (i.e. most of the cases only one path is chosen) and its optimization on code execution.

when branches are known to pass/fail most of the times (i.e. error checking is expected to happen less often then actual error situation), knowledge of this can be used to bias compiler to generate code optimized for this condition. longer code (doSth() inlined call) is used between branches to make it more visible.

optimized code is around 1-2% faster than non-optimized one.

test runs

here is an output of one of the test run on my home laptop:

######### RUNNING TESTS #########                     
exit status is only a sort of check sum, to see if both implementations give the same results
Command exited with non-zero status 52                                                       
2.97user 0.00system 0:02.97elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k                  
0inputs+0outputs (0major+254minor)pagefaults 0swaps                                          
Command exited with non-zero status 52                                                       
2.93user 0.00system 0:02.93elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k                  
0inputs+0outputs (0major+254minor)pagefaults 0swaps                                          
Command exited with non-zero status 105                                                      
0.67user 0.04system 0:00.72elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k                  
0inputs+0outputs (0major+12813minor)pagefaults 0swaps                                        
Command exited with non-zero status 105                                                      
0.07user 0.00system 0:00.07elapsed 97%CPU (0avgtext+0avgdata 0maxresident)k                  
0inputs+0outputs (0major+460minor)pagefaults 0swaps                                          
Command exited with non-zero status 127                                                      
5.34user 0.00system 0:05.35elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k                  
0inputs+0outputs (0major+308minor)pagefaults 0swaps                                          
Command exited with non-zero status 127                                                      
5.48user 0.00system 0:05.48elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k                 
0inputs+0outputs (0major+270minor)pagefaults 0swaps                                          
Command exited with non-zero status 1                                                        
1.23user 0.03system 0:01.26elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k                 
0inputs+0outputs (0major+16682minor)pagefaults 0swaps                                        
Command exited with non-zero status 1                                                        
0.11user 0.03system 0:00.16elapsed 89%CPU (0avgtext+0avgdata 0maxresident)k                  
0inputs+0outputs (0major+16682minor)pagefaults 0swaps                                        
Command exited with non-zero status 129
1.74user 0.00system 0:01.74elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+1788minor)pagefaults 0swaps
Command exited with non-zero status 129
0.28user 0.00system 0:00.28elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+1788minor)pagefaults 0swaps
structure size is 15[B]
Command exited with non-zero status 245
6.64user 0.00system 0:06.65elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+307minor)pagefaults 0swaps
structure size is 24[B]
Command exited with non-zero status 245
3.76user 0.00system 0:03.76elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+330minor)pagefaults 0swaps
Command exited with non-zero status 26
4.72user 0.00system 0:04.73elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+254minor)pagefaults 0swaps
Command exited with non-zero status 26
4.72user 0.00system 0:04.72elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+254minor)pagefaults 0swaps

as a comparison, tests results of my laptop (Core 2 Duo) and few other, stationary computers that i have currently access to. notice that these times are one-time-run just to give an overlook on the whole case – to make real comparison it should be run multiple times, standard deviation should be computed and so on.

no example version Core2 2.0GHz Core2 2.5GHz Xeon 3.0GHz
1a condition_checks_vs_L1i_cache non-opt 2.97 2.33 3.28
1b condition_checks_vs_L1i_cache opt 2.93 2.33 2.75
2a data_structures_vs_prefetching non-opt 0.67 0.44 0.18
2b data_structures_vs_prefetching opt 0.07 0.05 0.04
3a loop_expansion_vs_code_size non-opt 5.34 4.28 4.76
3b loop_expansion_vs_code_size opt 5.48 4.38 4.38
4a loop_iteration_index non-opt 1.23 0.82 4.41
4b loop_iteration_index opt 0.11 0.08 0.05
5a matrix_multiply non-opt 1.74 0.72 0.70
5b matrix_multiply opt 0.28 0.18 0.21
6a padding_vs_speed non-opt 6.64 5.28 7.37
6b padding_vs_speed opt 3.76 3.11 4.94
7a param_check_optimization non-opt 4.72 3.75 5.64
7b param_check_optimization opt 4.72 3.75 5.46

things to remember

after getting through previous results one can see that among optimizations present there were few that made a great impact on execution time in most of the cases. these are the ones that should be kept in mind during everyday's work:

  1. always use as close (local) memory as possible (no: 2, 4, 5). this means iterating tables in forward/backward manner, minimizing required “memory jumps”, etc…
  2. expansion of the loop is not always faster than loop itself, since loop does not require that much instruction memory accesses. do not manualy expand loop – leave it to the compiler (no: 3).
  3. use arrays as much as possible. it is better to keep indexes inside array, and data it corresponds to outside. once an element is found, it can be accessed with minimal cost (it is just one element now) and search can be as efficient as possible (no: 2)
  4. in general padded structures are faster than non-padded (no: 6). they require some extra memory access but it r/w on structure members are way faster anyhow.


among many tools presented in the article to help profiling cache usage the one i've liked the most is cachegrind – valgrind plug-in. its usage is extremely simple and intuitive. all you have to do is run:

valgrind --tool=cachegrind ./myBinary.out
cg_annotate cachegrind.out.1234 | less -S

where 1234 is a pid of last run program (here: myBinary.out). as an output you get cache miss/hit ratios along with numbers. cg_annotate also displays these data split into function calls.

by default cachegrind takes parameters of cache on your PC, but you can explicitly force it to use other values too.

article you're reading now is just a short overview – you're strongly advised to read the original article anyway
yes – i hate #ifdefs too, this this is short, example code and it fits perfectly fine here.
docs/cache_usage_effects_on_modern_cpus/cache_usage_effects_on_modern_cpus.txt · Last modified: 2021/06/15 20:09 by
Back to top
Valid CSS Driven by DokuWiki Recent changes RSS feed Valid XHTML 1.0