Quote Originally Posted by Mirage View Post
For example if all images will be divided in blocks 64x64 pixels (i.e. array[0..64*64-1] of word) you can compare two such blocks and both will be in L1.
The code will become more complex so it's reasonable to do some simple testing to find out which performance boost can be achieved. E.g. compare a 64x64 image to several 64x64 images.
The operation being performed is a simple sum and difference. Cache optimization would be useful if very complex operations were performed on the data set. In this case, due to triviality of operation and the amount of data to be processed, the cache is useless.

In fact, you could make a dedicated hardware that could do the necessary calculations and access RAM through DMA without the need of CPU. Therefore, the memory bandwidth should be the main concern here. This is why I suggested going for GPGPU as high-end video cards have vastly superior memory interface.

I would, however, support your other suggestion to optimize actual approach instead of trying to optimize the inner loop. That is, make the necessary optimizations for the technique itself so more work can be put on actual CPU with less stress on bandwidth.

For one, I would suggest researching into more advanced techniques that could somehow reduce the problem set instead of performing brute force pixel comparison. Not to mention that RGB color space itself is inaccurate and inadequate for performing any tasks where visual/optical quality is concerned. CIELAB, CIELUV, DIN99, CIECAM and even some of our own mix would be more suited for that purpose.