recently i was preparing a training on RAII and using it to handle resources (most commonly – memory). as a part of it C++11's smart pointers were to be presented. i wanted to show why std::make_shared<T>() is better that std::shared_ptr<T>{new T}. STL gave a nice explanation of make_shared memory optimization. how to easily show it takes less allocations and so the allocation is smaller? i got lazy (surprise! surprise! nobody expected spanish inquisition!) and decided just to use valgrind for that. one could easily see that when using make_shared there is 1 allocation, compared to 2, when shared_ptr was created directly – clearly a gain.
unfortunately this approach did not manage to show that, in fact, memory is saved as well. the problem (?) was that valgrind reports exactly how many bytes are requested by the application – not how many would be assigned by actual allocator. but still, the total size should always be the same, right? and here comes the funny part – no, it may not. because of performance reasons shared_ptr use atomics for counting both ref-counters. on most architectures this means that memory for those need to be aligned properly and so the single allocation need to include proper padding. so having 2 64bit atomic counters, gives 16B of must-be-aligned-storage. but what if make_shared allocates user data and counters at once? well – just put counters first and we're done (allocation is always properly aligned). well – not really… if you allocate an array of such a structures, they need to be continuous and aligned. so here is where ending padding kicks in. assuming we need 8B alignment, 2*8B=16B for counters and, say 20B for user-data, we get 36B data structure. now we need a 4B padding so that next one in an array will be aligned as well. the final size of such a structure is 40B.
of course doing separate allocations, and assuming user data needs not to be aligned, we can have two allocations: 16B for counters and 20B for user-data. having perfect allocator it would be so. in real life, for small allocations, we usually get way more than we asked for. this means that usually asking for 16B is no different (from the system PoV) than asking for, say 64B. using make_shared we always save time (2x less allocations) and most of the time also memory, since we pay this overhead just once and padding disappears in this gap.
finally, for the purpose of the training, i've measured exact memory usage using modified program, that keeps all allocated memory in a container, and measuring process memory usage before exiting. for my sample data memory overhead of directly creating shared_ptr, compared to using make_shared, was about 20%. time overhead was nearly 100%, as one could expect, since most of the time program spent in allocators and deallocators. be careful what and how is being measured – there are things going on behind the scenes… ;)