Quote Originally Posted by Emil View Post
Imagine this function being called about 40 million times during runtime (4000 frames, 200 rectangles, -3 - +3 = 49 places around the rectangle). I'm afraid I can't really go back from 16-bit images back to 8-bit images, and I don't see how I can improve on this code any more.

I can suggest several ways to improve performance, but if you tell us more details on how the function result is used we'll probably be able to suggest more options:

1. Use more cache-friendly data structures. For example, store and handle images as blocks of 32x32 or 64x64 pixels. It may greatly improve performance. The optimal block size depends on CPU cache size and can be determined empirically.
2. Use one-dimensional arrays and pointer arithmetics. The code
Code:
var StartAddr: ^Word;
for y := rect.Top to rect.Bottom do begin
  ImgPtr := currentPixels + y * ImageLineSize + rect.Left;
  for x := rect.Left to rect.Right do begin
    intensityCurLW := intensityCurLW + (ImgPtr^ shr 6); 
    Inc(ImgPtr);
  end;
end;
contains less operations within the inner cycle and also more cache-friendly than this:
Code:
for x := rect.Left to rect.Right do
    for y := rect.Top to rect.Bottom do
      intensityCurLW := intensityCurLW + (currentPixels[y + yo,x + xo] shr 6);
Quote Originally Posted by Emil View Post
Would it be possible to speed things up a bit using for example SSE1/2 instructions, and if so, how?
It depends where the bottleneck is. If it is in memory bandwidth SIMD may not make big difference.