Optimizing for CPU cache is not a trivial task. You'll need to learn a lot about how it works to utilize it at near to 100%.
But for a start you can try to fit all data structures in work at a moment in L1 cache.
For example if all images will be divided in blocks 64x64 pixels (i.e. array[0..64*64-1] of word) you can compare two such blocks and both will be in L1.
The code will become more complex so it's reasonable to do some simple testing to find out which performance boost can be achieved. E.g. compare a 64x64 image to several 64x64 images.