i've spent most of the last working day exploring issue on our CI servers. here is the story…
after adding jobs to rebuild docker images with each revision, i've added an internal docker registry, to make sure we have a common place to store generated images (easy, convenient, reproducible, no need to build anything manually, etc…). since we have multiple agents, i've quickly noticed that even though i explicitly do docker pull on images, they still get rebuild over and over again, even though Dockerfile (nor its dependencies) did not changed.
time went by (images too some time to build) and there was not apparent reason why caching does not work. starting docker daemon with debug logs enabled indicated that cache is not used (“thank you captain obvious”). so what's going on?
querying the internet revealed an interesting fact – in docker 1.10.0 caching has drastically changed. TL;DR. the problem is that, if you allow remote images to add stuff to your local cache (as docker used to do, until some time ago), it will effectively allow anyone to misguide your cache, leading to potential security breach. example provided was sth along this:
FROM debian RUN apt-get update
let's say that this will generate cache entries #1 (“FROM”) and #2 (“RUN”). now malicious image could pollute your cache, putting anything behind “RUN”, and just pretending this was achieved with a declared “RUN” command. fix is simple – do not trust remote cache.
there is however a problem. you no longer can rely on the fact that builds on different machines will generate the same images. in fact it's directly opposite – it will always generate new image! this means, that now each build host must (sooner or later) rebuild image to populate own cache. it also means that pushing to docker registry will take N-times as much space (where N i number of build machines)! so you loose both in terms of time and size.
there are different ways to overcome this issue. one idea is to share caches between different machines and/or propagate caches via docker save + docker load (still shared storage is needed). it however requires another infrastructure element and extra complexity.
another workaround is to explicitly allow caching entries, from a given source, with --cache-from "image:tag" entry. it has entered docker 1.13. this looks the most promising at the moment.
one more alternative i was thinking about was to use cache info on what has already been built locally and store it for extended periods of time. this way it would be enough to keep one hash per layer locally. even though each host still need to rebuild the image at least once, it does not need to keep it forever, just to have caches populated. instead it could push it to (local) instance of docker registry and remove locally. next time it is needed, it's enough to pull it back from registry and re-validate hashes of layers and content match expected ones.