[delphi 7]code optimalisation, comparing 2d matrices of words [SSE?]

**Emil** · 20-08-2012, 08:23 PM

Hi All,

This is my first post here, so please bear with me. I'm developing a project which compares many portions (say 200, of 100x100 pixels regions) of huge amounts (> 3000) of medium resolution (say 640x480) 16-bit images to portions within a single specific 16 bit reference image. I was wondering if it was possible to speed things up a little bit. The following piece of code is used to compare one portion of the image to another, and appears to be the biggest bottleneck in my code.

Code:

// globally known variables are:// referencePixels (2d array of words, contains a reference image)
// currentPixels (2d array of words, contains the current image)
// avgRef (double, mean intensity value of reference image within a specific rectangle) this value is always > 0
// rect (the specific rectangle within images are compared)
// xo, yo offset from a rectangle in the current image (GetImgDifference is usually being called around for xo and yo from -4 to +4 )


function GetImgDifference( rect : TRect; xo, yo : integer): single;
var
  x, y : integer;
  avgCur : double;
  intensityCurLW,multLW : LongWord;
  c : integer;
  dLW: LongWord;
begin
  
  // calculate the mean intensity value of a rectangle within an image
  intensityCurLW := 0;
  for x := rect.Left to rect.Right do
    for y := rect.Top to rect.Bottom do
      intensityCurLW := intensityCurLW + (currentPixels[y + yo,x + xo] shr 6); 


  // the size of the rectangle
  c := ((rect.Right - rect.Left)+1) * ((rect.bottom - rect.top)+1);
  
  // if the current intensity is larger than 0
  if (intensityCurLW > 0) then begin
    // store average intensity of current image in avgCur
    avgCur := intensityCurLW / c; 
    
    // calculate factor to correct for mean difference in current and reference image
    // let's store this in a long word (accurate enough, and fast)
    multLW := Round(16384 * avgRef / avgCur); // 2^14


    // calculate the absolute difference between the reference image and the current image
    dLW := 0;
    for x := rect.Left to rect.Right do
      for y := rect.Top to rect.Bottom do
        dLW := dLW + abs(  ((currentPixels[y + yo,x + xo]  * multLW) shr 14) - referencePixels[y,x] ) shr 6;


    // the result will be the average pixel difference
    result := dLW / c;
  end else begin
    result := PrettyLargeSingle; // return a huge single value when the current image had 0 intensity
  end;
end;

Imagine this function being called about 40 million times during runtime (4000 frames, 200 rectangles, -3 - +3 = 49 places around the rectangle). I'm afraid I can't really go back from 16-bit images back to 8-bit images, and I don't see how I can improve on this code any more. Would it be possible to speed things up a bit using for example SSE1/2 instructions, and if so, how?

If you have any questions, please do let me know!

Kind regards,

Emil

LP · 20-08-2012, 08:39 PM

You could take advantage of SIMD instructions to speed up calculations, but you will have to write your code in assembly, which is not a trivial task.

You would immediately benefit from speed improvements by compiling the above code for 64-bit platform (which would require Delphi XE 2 or FreePascal), as there are more registers that compiler can use to optimize the generated code.

I would suggest instead trying multi-threaded approach first by taking advantage of CPU's multiple cores.

Further optimizations would involve solving memory bandwidth bottleneck, where you would do better by using GPGPU techniques with tools like CUDA.

**Emil** · 20-08-2012, 09:03 PM

Thanks for the quick answer. I forgot to mention that the code is already multi-threaded (and scales very well, people use it with 8 core machines and it can use up all the processing power. In a good way that is, it will be about 8 times as fast

). I was thinking about using GPGPU, but I think it would be too difficult for me at the moment. I realize though that this problem seems to be very well suited to the high number of ALUs available on modern day graphic cards.

Some time ago I briefly played around with Delphi XE2, and after a lot of struggling I compiled a 64-bit executable. But I wasn't able to speed things up, instead everything slowed down by about a factor of two. Perhaps there were some stupid things I overlooked. I will try again, but to be honest, I actually kinda like developing in Delphi 7.

But that's why I was hoping it was possible to use SIMD instructions. Perhaps someone is willing to spend some time with me on this problem? I'm always willing to learn..

**Mirage** · 25-08-2012, 12:17 PM

Originally Posted by Emil

Imagine this function being called about 40 million times during runtime (4000 frames, 200 rectangles, -3 - +3 = 49 places around the rectangle). I'm afraid I can't really go back from 16-bit images back to 8-bit images, and I don't see how I can improve on this code any more.

I can suggest several ways to improve performance, but if you tell us more details on how the function result is used we'll probably be able to suggest more options:

1. Use more cache-friendly data structures. For example, store and handle images as blocks of 32x32 or 64x64 pixels. It may greatly improve performance. The optimal block size depends on CPU cache size and can be determined empirically.
2. Use one-dimensional arrays and pointer arithmetics. The code

Code:

var StartAddr: ^Word;
for y := rect.Top to rect.Bottom do begin
  ImgPtr := currentPixels + y * ImageLineSize + rect.Left;
  for x := rect.Left to rect.Right do begin
    intensityCurLW := intensityCurLW + (ImgPtr^ shr 6); 
    Inc(ImgPtr);
  end;
end;

contains less operations within the inner cycle and also more cache-friendly than this:

Code:

for x := rect.Left to rect.Right do
    for y := rect.Top to rect.Bottom do
      intensityCurLW := intensityCurLW + (currentPixels[y + yo,x + xo] shr 6);

Originally Posted by Emil

Would it be possible to speed things up a bit using for example SSE1/2 instructions, and if so, how?

It depends where the bottleneck is. If it is in memory bandwidth SIMD may not make big difference.

**Emil** · 26-08-2012, 01:33 PM

Interesting suggestion, thank you!

Some more background. GetImgDifference is being used to find the best way to align two sections within two different images. The contents in both images is similar, but not exactly the same.

So for each of >100 rectangles, and for each of >1000 different images, I search in a grid(*) around the center of the rectangle to find the offset in X and Y that provides the smallest difference between the rectangles in the current image compared to that in the reference image. This particular function returns the result for one of those offsets.

* The grid has a maximum size of about 9x9 pixels (the center being an offset of 0,0 , so +4 and -4 in both directions is actually searched for). But I actually don't search this grid exhaustively, often I only need to search in the center 5x5 area to say with high confidence that I found the minimal pixel difference between two image sections.

If something is not clear, please let me know.

**SilverWarior** · 26-08-2012, 03:59 PM

From what you have writen so far I see that you are doing Lucky imaging http://en.wikipedia.org/wiki/Lucky_imaging
You divide every image into grid of rectangles and then compare same rectangles between different images. Right? Now I don't know what your images actually looks like but I asume that it is posible for some of theese rectangles to be completly black (no star in then). If it is so I think you could gain a lot of performance by excluding them from processing (it is useles to compare two completly black squares).

If you ask me I would take a completly different approach. Instead of dividing images into grids I would rather use some object detection algroithm which would be able to detect objects (stars) on the image.
This could be done by simply searching pixel by pixel until you reach one which have at least certain light and then you go and use tree-like pathfinding algorithm to find other nearby pixels which also have at least certain light.
This will give you positions for different stars on your images. So in the end you will be only processing areas around stars positions (areas which matter) and you won't be comparing other areas which doesn't matter.

**Emil** · 26-08-2012, 05:09 PM

Yes, I'm doing lucky imaging.

Where possible, I'm actually already doing what you suggested. I just wanted to sketch the problem here as simple as possible. But the program already incorporates many tricks to speed things up. The software is actually meant for lucky imaging on wider targets: planets, or even the sun or moon. See my website for some example images http://www.astrokraai.nl if you are interested. But it could also work on starfields. The placement of the alignmentpoints (or rectangles in my story here), is an entirely different problem of it's own. Among one thing, it basically makes use of what you suggested. Areas without relevant information will be ignored (but sometimes it is really difficult to tell which are without relevant information. And often the field of view is all relevant information).

The problem I sketched here comes from doing lucky imaging on a larger surface, like the sun in a particular wavelength, where the target does not fit the FOV. So every portion of the image actually is relevant. There are no black rectangles. Everything portion of the image contains (distorted) views of the sun. Hence the rectangles all over the place (sometimes >1000!). If the FOV contains smaller targets, like the planet Mars for example, there will be far fewer rectangles (usually < 10). But the basic problem remains the same: comparing one image to another.

To give you an idea about the size of the project at the moment: it contains well over 12.000 lines of self written pascal source code. I started the project back in 2009, but right now it is being used by many amateur astronomers taking pictures (video sequences actually) of the planets, sun and moon.

Actually comparing one image to the other takes up most of the processing time, and I thought it would be very interesting to see if that could be made faster.

**Emil** · 27-08-2012, 12:23 PM

Originally Posted by Mirage

2. Use one-dimensional arrays and pointer arithmetics.

I had no idea this would make such a big difference. At the moment I'm rewriting some of the main loops, trying to get as many calculations and comparisons outside each time consuming loop as possible and changing to pointer arithmetics. I like it. In the provided example I can get the code to run almost 80% faster using this simple trick. It requires some changes in thinking about the code, but I think that is good as it forces me to think about cleaner solutions (the inner loop really does not need all that arithmetic when you think about it).

Delphi requires a slightly different syntax with pointers. This actually works just fine for me:

Code:

var ImgPtr : PWord; // this is just a ^Word
for y := ...
  ImgPtr := @currentPixels[ystart,xstart]; 
  for x := ..  
     // Do something with ImgPtr^ 
    Inc(ImgPtr);
  end
end

Basically, both for loops now do nothing but act as a counter. X and Y are actually not used at all.

I'm not yet sure about your first suggestion though, could you elaborate on that? From what I understand the CPU cache works like this: each time the CPU requires to read from ram memory, it first checks if the value is already in the cpu cache, if not, it will retrieve the memory variable AND automatically several other items following that location in memory (depending on the size of the cache line). This means that if your main memory data is ordered [1,2,3,4] and the CPU wants to do something with the value of 1, 2,3 and 4 will likely also be placed in cache. That is why it is important to make sure the data you want to read is placed in the proper order, otherwise you'll have a lot of cache misses where the CPU was forced to read from the much slower main memory instead of the cache.

**Mirage** · 27-08-2012, 08:20 PM

Optimizing for CPU cache is not a trivial task. You'll need to learn a lot about how it works to utilize it at near to 100%.
But for a start you can try to fit all data structures in work at a moment in L1 cache.
For example if all images will be divided in blocks 64x64 pixels (i.e. array[0..64*64-1] of word) you can compare two such blocks and both will be in L1.
The code will become more complex so it's reasonable to do some simple testing to find out which performance boost can be achieved. E.g. compare a 64x64 image to several 64x64 images.

LP · 27-08-2012, 09:39 PM

Originally Posted by Mirage

For example if all images will be divided in blocks 64x64 pixels (i.e. array[0..64*64-1] of word) you can compare two such blocks and both will be in L1.
The code will become more complex so it's reasonable to do some simple testing to find out which performance boost can be achieved. E.g. compare a 64x64 image to several 64x64 images.

The operation being performed is a simple sum and difference. Cache optimization would be useful if very complex operations were performed on the data set. In this case, due to triviality of operation and the amount of data to be processed, the cache is useless.

In fact, you could make a dedicated hardware that could do the necessary calculations and access RAM through DMA without the need of CPU. Therefore, the memory bandwidth should be the main concern here. This is why I suggested going for GPGPU as high-end video cards have vastly superior memory interface.

I would, however, support your other suggestion to optimize actual approach instead of trying to optimize the inner loop. That is, make the necessary optimizations for the technique itself so more work can be put on actual CPU with less stress on bandwidth.

For one, I would suggest researching into more advanced techniques that could somehow reduce the problem set instead of performing brute force pixel comparison. Not to mention that RGB color space itself is inaccurate and inadequate for performing any tasks where visual/optical quality is concerned. CIELAB, CIELUV, DIN99, CIECAM and even some of our own mix would be more suited for that purpose.

Moderation Process Reminder

Thread: [delphi 7]code optimalisation, comparing 2d matrices of words [SSE?]

Thread Tools

Display

[delphi 7]code optimalisation, comparing 2d matrices of words [SSE?]

Bookmarks

Bookmarks

Posting Permissions