Table of Contents

2012.02.26 - overhead, footprint and stuff

we often tend to put too little focus on keeping their knowledge up to date. more over, programmers tend to do optimizations based solely on “strong feeling” rather than experiments and measurements. plus we all know that premature optimization is the root of all evil, right? ;)

there is an excellent publication the topic of C++ performance, namely - Technical report on C++ performance. this is 2006 publication, written by many, including Bjarne Stroustrup and Dave Abrahams. i strongly encourage you to read it. this whole entry is based on this text and simple experiments i've performed to see the theory in action.

so let's move to the actual cases.

exceptions

footrpint - image thane from http://www.flash-screen.com/free-wallpaper/uploads/201112/thus/2012-animal-calendar-4.jpg during one of recent meetings an issue regarding exceptions was risen. some of the legacy code appeared not to work correctly, when exception was thrown. these things happen - we all love legacy code after all, right? anyway the topic quickly drifted away from the particular code-related problem to more general – exceptions overhead in C++. i argued that AFAIK there is no extra overhead when using the exceptions, compared to alternative usage of error codes and pointers to error codes, passed as function arguments. other ppl were arguing that using pointers introduces extra overhead in runtime, making stack grow bigger, some extra memory locations were busy with exceptions-related structures, etc… as i haven't heard about this i decided to read some more on this topic.

this is the place i discovered Technical report on C++ performance. from this i learned that classically there are 2 basic approaches for implementation of exceptions handling code.

  1. “dynamic” – idea published in 1992 (yes - cfront times!).
  2. “table” – published in 1994 and commonly used since then (with some improvements).

so what have changed in exception handing over last 20 years? i strongly encourage to read the original article, but just for the sake of this entry i'll explain it shortly. “dynamic” approach is based on putting pieces of information, about what d-tors are to be called upon stack unwinding, on the stack, so that in case exception is thrown, proper objects can be destroyed, as required by the standard. it does introduce runtime overhead both in memory usage (stack growing faster) and execution time (extra code to push/pop elements every time some point of exception is reached).

because of it, it was soon replaced with “table” approach, where compiler generates tables with code entries that are responsible for stack unwinding in case exception occurs. what does it mean? statically generated code for handling exceptions, that is never executed, unless an error is thrown. no extra stack overhead. no extra execution time, while in non-exceptional flow.

experimental results

this is a good place to start experiments. i've wrote simple Fibonacci sequence computation, using recursion. program checks stack size and execution times with and without exceptions.

running fib(44), from the (no-)exceptions test examples, gave the following results:

stack size on amd64 with gcc-4.6:

exceptions error code
1116 bytes 1112 bytes

stack size on ia32 with gcc-4.4:

exceptions error code
1104 bytes 1468 bytes

execution time (avg. from 3 runs) on amd64 with gcc-4.6:

exceptions error code
5.13 s 5.48 s

execution time (avg. from 3 runs) on ia32 with gcc-4.4:

exceptions error code
6.20 s 6.45 s

conclusions

and so, ironically, using exit codes makes footprint larger than exceptions (!) and growing with the stack depth. conclusion: let the compiler do the micro optimizations while you focus on the architecture and the design. as Richard Hamming said: “Machines should work. People should think.”.

templates vs. switches: the enums case

in more less the same time as the “exceptions discussion” took place i found a smelly piece of code, that due to lack of elastic architecture, translated one enumeration type to another. there were few problems with that. first of all it is done with switch(){} statement i do not personally like. second thing is it is used in a few places. third thing is that this enumeration translation must be done in a both directions, thus using switch(){} you need to maintain two pieces of code, that are highly coupled.

what i decided to do is to write simple, template-based wrapper, that given single structure, explaining which value from enum1 is which in enum2, can be used to do bidirectional mapping. this means keeping only one table of translation, that works both ways. the main charges i faced, for this solution were, that it is complicated and less effective. frankly i think this piece of code is fairly simple, but even if i'm wrong, it still does not matter much, since as opposite to the 2-functions-with-switch()es solution this code has to be written just once. next time bidirectional mapping is needed, introducing new transition table to a generic template is just enough – it is that simple! when it comes to efficiency GCC 4.6 on amd64 machine produced binaries of the same sizes in both cases.

experimental results

here are my results, tested on two machines, with different systems and architecture. in case of doubt compile the enum test examples and check for your self.

executable size on amd64 with gcc-4.6, optimized for size:

mpl manual
6984 bytes 6984 bytes

executable size on amd64 with gcc-4.6, optimized for speed:

mpl manual
6984 bytes 6984 bytes

executable size on ia32 with gcc-4.4, optimized for size:

mpl manual
4312 bytes 4312 bytes

executable size on ia32 with gcc-4.4, optimized for speed:

mpl manual
4976 bytes 5144 bytes

conclusions

as one can see, gcc-46 on amd64 gave the same results in both cases, as the hand-written code. optimizations work fine, so why to bother repeatedly writing code manually? on the other hand, the older compiler (gcc-4.4) produce far bigger code, when optimized for speed. this usually is not a good sign, since due to processor caches, the same machine code often runs faster, when it is compact. notice however that this does not apply to an extra code, added as an exception handling, since it is never executed, under regular conditions, and (depending on the compiler's code layout), probably won't be even prefetched.