[delphi 7]code optimalisation, comparing 2d matrices of words [SSE?] [Archive]

View Full Version : [delphi 7]code optimalisation, comparing 2d matrices of words [SSE?]

Emil

20-08-2012, 08:23 PM

Hi All,

This is my first post here, so please bear with me. I'm developing a project which compares many portions (say 200, of 100x100 pixels regions) of huge amounts (> 3000) of medium resolution (say 640x480) 16-bit images to portions within a single specific 16 bit reference image. I was wondering if it was possible to speed things up a little bit. The following piece of code is used to compare one portion of the image to another, and appears to be the biggest bottleneck in my code.

// globally known variables are:// referencePixels (2d array of words, contains a reference image)
// currentPixels (2d array of words, contains the current image)
// avgRef (double, mean intensity value of reference image within a specific rectangle) this value is always > 0
// rect (the specific rectangle within images are compared)
// xo, yo offset from a rectangle in the current image (GetImgDifference is usually being called around for xo and yo from -4 to +4 )

function GetImgDifference( rect : TRect; xo, yo : integer): single;
var
x, y : integer;
avgCur : double;
intensityCurLW,multLW : LongWord;
c : integer;
dLW: LongWord;
begin

// calculate the mean intensity value of a rectangle within an image
intensityCurLW := 0;
for x := rect.Left to rect.Right do
for y := rect.Top to rect.Bottom do
intensityCurLW := intensityCurLW + (currentPixels[y + yo,x + xo] shr 6);

// the size of the rectangle
c := ((rect.Right - rect.Left)+1) * ((rect.bottom - rect.top)+1);

// if the current intensity is larger than 0
if (intensityCurLW > 0) then begin
// store average intensity of current image in avgCur
avgCur := intensityCurLW / c;

// calculate factor to correct for mean difference in current and reference image
// let's store this in a long word (accurate enough, and fast)
multLW := Round(16384 * avgRef / avgCur); // 2^14

// calculate the absolute difference between the reference image and the current image
dLW := 0;
for x := rect.Left to rect.Right do
for y := rect.Top to rect.Bottom do
dLW := dLW + abs( ((currentPixels[y + yo,x + xo] * multLW) shr 14) - referencePixels[y,x] ) shr 6;

// the result will be the average pixel difference
result := dLW / c;
end else begin
result := PrettyLargeSingle; // return a huge single value when the current image had 0 intensity
end;
end;

Imagine this function being called about 40 million times during runtime (4000 frames, 200 rectangles, -3 - +3 = 49 places around the rectangle). I'm afraid I can't really go back from 16-bit images back to 8-bit images, and I don't see how I can improve on this code any more. Would it be possible to speed things up a bit using for example SSE1/2 instructions, and if so, how?

If you have any questions, please do let me know!

Kind regards,

Emil

20-08-2012, 08:39 PM

You could take advantage of SIMD instructions to speed up calculations, but you will have to write your code in assembly, which is not a trivial task.

You would immediately benefit from speed improvements by compiling the above code for 64-bit platform (which would require Delphi XE 2 or FreePascal), as there are more registers that compiler can use to optimize the generated code.

I would suggest instead trying multi-threaded approach first by taking advantage of CPU's multiple cores.

Further optimizations would involve solving memory bandwidth bottleneck, where you would do better by using GPGPU techniques with tools like CUDA.

Emil

20-08-2012, 09:03 PM

Thanks for the quick answer. I forgot to mention that the code is already multi-threaded (and scales very well, people use it with 8 core machines and it can use up all the processing power. In a good way that is, it will be about 8 times as fast ;) ). I was thinking about using GPGPU, but I think it would be too difficult for me at the moment. I realize though that this problem seems to be very well suited to the high number of ALUs available on modern day graphic cards.

Some time ago I briefly played around with Delphi XE2, and after a lot of struggling I compiled a 64-bit executable. But I wasn't able to speed things up, instead everything slowed down by about a factor of two. Perhaps there were some stupid things I overlooked. I will try again, but to be honest, I actually kinda like developing in Delphi 7.

But that's why I was hoping it was possible to use SIMD instructions. Perhaps someone is willing to spend some time with me on this problem? I'm always willing to learn..

Mirage

25-08-2012, 12:17 PM

Imagine this function being called about 40 million times during runtime (4000 frames, 200 rectangles, -3 - +3 = 49 places around the rectangle). I'm afraid I can't really go back from 16-bit images back to 8-bit images, and I don't see how I can improve on this code any more.

I can suggest several ways to improve performance, but if you tell us more details on how the function result is used we'll probably be able to suggest more options:

1. Use more cache-friendly data structures. For example, store and handle images as blocks of 32x32 or 64x64 pixels. It may greatly improve performance. The optimal block size depends on CPU cache size and can be determined empirically.
2. Use one-dimensional arrays and pointer arithmetics. The code

var StartAddr: ^Word;
for y := rect.Top to rect.Bottom do begin
ImgPtr := currentPixels + y * ImageLineSize + rect.Left;
for x := rect.Left to rect.Right do begin
intensityCurLW := intensityCurLW + (ImgPtr^ shr 6);
Inc(ImgPtr);
end;
end;

contains less operations within the inner cycle and also more cache-friendly than this:

for x := rect.Left to rect.Right do
for y := rect.Top to rect.Bottom do
intensityCurLW := intensityCurLW + (currentPixels[y + yo,x + xo] shr 6);

Would it be possible to speed things up a bit using for example SSE1/2 instructions, and if so, how?

It depends where the bottleneck is. If it is in memory bandwidth SIMD may not make big difference.

Emil

26-08-2012, 01:33 PM

Interesting suggestion, thank you!

Some more background. GetImgDifference is being used to find the best way to align two sections within two different images. The contents in both images is similar, but not exactly the same.

So for each of >100 rectangles, and for each of >1000 different images, I search in a grid(*) around the center of the rectangle to find the offset in X and Y that provides the smallest difference between the rectangles in the current image compared to that in the reference image. This particular function returns the result for one of those offsets.

* The grid has a maximum size of about 9x9 pixels (the center being an offset of 0,0 , so +4 and -4 in both directions is actually searched for). But I actually don't search this grid exhaustively, often I only need to search in the center 5x5 area to say with high confidence that I found the minimal pixel difference between two image sections.

If something is not clear, please let me know.

SilverWarior

26-08-2012, 03:59 PM

From what you have writen so far I see that you are doing Lucky imaging http://en.wikipedia.org/wiki/Lucky_imaging
You divide every image into grid of rectangles and then compare same rectangles between different images. Right? Now I don't know what your images actually looks like but I asume that it is posible for some of theese rectangles to be completly black (no star in then). If it is so I think you could gain a lot of performance by excluding them from processing (it is useles to compare two completly black squares).

If you ask me I would take a completly different approach. Instead of dividing images into grids I would rather use some object detection algroithm which would be able to detect objects (stars) on the image.
This could be done by simply searching pixel by pixel until you reach one which have at least certain light and then you go and use tree-like pathfinding algorithm to find other nearby pixels which also have at least certain light.
This will give you positions for different stars on your images. So in the end you will be only processing areas around stars positions (areas which matter) and you won't be comparing other areas which doesn't matter.

Emil

26-08-2012, 05:09 PM

Yes, I'm doing lucky imaging.

Where possible, I'm actually already doing what you suggested. I just wanted to sketch the problem here as simple as possible. But the program already incorporates many tricks to speed things up. The software is actually meant for lucky imaging on wider targets: planets, or even the sun or moon. See my website for some example images http://www.astrokraai.nl if you are interested. But it could also work on starfields. The placement of the alignmentpoints (or rectangles in my story here), is an entirely different problem of it's own. Among one thing, it basically makes use of what you suggested. Areas without relevant information will be ignored (but sometimes it is really difficult to tell which are without relevant information. And often the field of view is all relevant information).

The problem I sketched here comes from doing lucky imaging on a larger surface, like the sun in a particular wavelength, where the target does not fit the FOV. So every portion of the image actually is relevant. There are no black rectangles. Everything portion of the image contains (distorted) views of the sun. Hence the rectangles all over the place (sometimes >1000!). If the FOV contains smaller targets, like the planet Mars for example, there will be far fewer rectangles (usually < 10). But the basic problem remains the same: comparing one image to another.

To give you an idea about the size of the project at the moment: it contains well over 12.000 lines of self written pascal source code. I started the project back in 2009, but right now it is being used by many amateur astronomers taking pictures (video sequences actually) of the planets, sun and moon.

Actually comparing one image to the other takes up most of the processing time, and I thought it would be very interesting to see if that could be made faster.

Emil

27-08-2012, 12:23 PM

2. Use one-dimensional arrays and pointer arithmetics.

I had no idea this would make such a big difference. At the moment I'm rewriting some of the main loops, trying to get as many calculations and comparisons outside each time consuming loop as possible and changing to pointer arithmetics. I like it. In the provided example I can get the code to run almost 80% faster using this simple trick. It requires some changes in thinking about the code, but I think that is good as it forces me to think about cleaner solutions (the inner loop really does not need all that arithmetic when you think about it).

Delphi requires a slightly different syntax with pointers. This actually works just fine for me:

var ImgPtr : PWord; // this is just a ^Word
for y := ...
ImgPtr := @currentPixels[ystart,xstart];
for x := ..
// Do something with ImgPtr^
Inc(ImgPtr);
end
end

Basically, both for loops now do nothing but act as a counter. X and Y are actually not used at all.

I'm not yet sure about your first suggestion though, could you elaborate on that? From what I understand the CPU cache works like this: each time the CPU requires to read from ram memory, it first checks if the value is already in the cpu cache, if not, it will retrieve the memory variable AND automatically several other items following that location in memory (depending on the size of the cache line). This means that if your main memory data is ordered [1,2,3,4] and the CPU wants to do something with the value of 1, 2,3 and 4 will likely also be placed in cache. That is why it is important to make sure the data you want to read is placed in the proper order, otherwise you'll have a lot of cache misses where the CPU was forced to read from the much slower main memory instead of the cache.

Mirage

27-08-2012, 08:20 PM

Optimizing for CPU cache is not a trivial task. You'll need to learn a lot about how it works to utilize it at near to 100%.
But for a start you can try to fit all data structures in work at a moment in L1 cache.
For example if all images will be divided in blocks 64x64 pixels (i.e. array[0..64*64-1] of word) you can compare two such blocks and both will be in L1.
The code will become more complex so it's reasonable to do some simple testing to find out which performance boost can be achieved. E.g. compare a 64x64 image to several 64x64 images.

27-08-2012, 09:39 PM

For example if all images will be divided in blocks 64x64 pixels (i.e. array[0..64*64-1] of word) you can compare two such blocks and both will be in L1.
The code will become more complex so it's reasonable to do some simple testing to find out which performance boost can be achieved. E.g. compare a 64x64 image to several 64x64 images.

The operation being performed is a simple sum and difference. Cache optimization would be useful if very complex operations were performed on the data set. In this case, due to triviality of operation and the amount of data to be processed, the cache is useless.

In fact, you could make a dedicated hardware that could do the necessary calculations and access RAM through DMA (http://en.wikipedia.org/wiki/Direct_memory_access) without the need of CPU. Therefore, the memory bandwidth should be the main concern here. This is why I suggested going for GPGPU as high-end video cards have vastly superior memory interface.

I would, however, support your other suggestion to optimize actual approach instead of trying to optimize the inner loop. That is, make the necessary optimizations for the technique itself so more work can be put on actual CPU with less stress on bandwidth.

For one, I would suggest researching into more advanced techniques that could somehow reduce the problem set instead of performing brute force pixel comparison. Not to mention that RGB color space itself is inaccurate and inadequate for performing any tasks where visual/optical quality is concerned. CIELAB (http://en.wikipedia.org/wiki/CIELAB), CIELUV (http://en.wikipedia.org/wiki/CIELUV), DIN99 (http://www.engl.dfwg.de/doc/dfwg%20homepage%20engl-499.htm), CIECAM (http://en.wikipedia.org/wiki/CIECAM02) and even some of our own mix (http://adsabs.harvard.edu/abs/2011SPIE.8011E.316K) would be more suited for that purpose.

Emil

27-08-2012, 10:28 PM

Thanks again for the suggestions.

... make the necessary optimizations for the technique itself so more work can be put on actual CPU with less stress on bandwidth...
I am doing this, it is actually what I have been doing over the past two years, not fulltime, but still, I do consider myself to be a bit of an expert on these lucky imaging techniques. I have been doing astrophotography since 2006, and have a fairly good understanding of how all of this works.The underlying algorithms are already optimized. There is a lot of specialised code surrounding these few lines I posted here that make sure the brute force pixel comparisons are not done unnecessarily. One way or another, at one point you have to start looking into the actual image data.

I'm not trying to say I know everything there is to know, that is one of the reasons I'm here actually, I want to learn new things, but I do know a lot more than nothing about lucking imaging and relevant image processing techniques. Just not that low-level.

I'm never actually working in RGB color space to begin with, all actual time consuming processing is done on grayscale images. Most input actually is 16 or 8 bit grayscale images. I'm familiar with some of the color spaces you mention, but they don't apply to this project.

To give you an idea about the kinds of processing speeds I get on a pretty old system (dual core, 4 gig ram) right now for this case.
- Processing a 600MB file containing 8-bit grayscale data of a white light image of the sun containing 1638 frames of 640x480. The entire FOV is full.
- Using 910 alignment areas with a size of 25x25 pixels
- Stacking only the best portions out of 200 frames

1 thread / 2 threads
9.4 / 4.9 seconds to align the data (to compensate for global image movements).
14.7 / 7.7 seconds to buffer the data to memory and calculate the quality of the sections of the frames
3.7 / 2.2 seconds to calculate a reference frame
16.8 / 8.8 seconds to align all of the 910 alignment points over the best 200 frames.
1.7 / 1.4 seconds to stack the best 200 frames for each AP given the previously retrieved alignment information.

This is what a single raw frame in that recording looks like (you can see many distortions in this image, but believe it or not, this is actually a pretty good frame. My software won't try to align every single pixel in this image, some portions are clearly not detailed enough and will simply be ignored. If you look careful, you'll see that certain parts are actually kind of sharp. But even those are still warped a little bit. )
http://www.astrokraai.nl/dump/RawFrame.jpg
and this is the resulting image:
http://www.astrokraai.nl/viewimages.php?id=201&cd=11

GPGPU would be very interesting, but at the moment is way to complex for me. It will be the future though, so I'll try to learn about it as much as I can.

28-08-2012, 01:52 AM

By the way, I've been thinking... You could also try some unorthodox approach using simple GPU tricks.

Taking this code:

for x := rect.Left to rect.Right do
for y := rect.Top to rect.Bottom do
dLW := dLW + abs( ((currentPixels[y + yo,x + xo] * multLW) shr 14) - referencePixels[y,x] ) shr 6;

Use the following approach:
1) Load image_1 and image_2 with A16B16G16R16.
2) Draw image_1 on A16B16G16R16 render target.
3) Draw image_2 on the same render target using subtract blending operation.
4) Take render target and generate full set of mipmaps up to 1x1 on GPU.
5) Smallest mipmap (1x1, one pixel) contains the resulting average difference, in 16-bit.

You could even improve resolution to 32-bit, but generating mipmaps will be more complicated. You will have to make them sequentially by drawing 50% image on render target and repeat the same process until last render target is 1x1 pixel. This is most likely how GPU does it anyway though.

Using above approach you take advantage of full GPU's parallel processing power and memory bandwidth, but without using complex GPGPU techniques.

P.S. Using shaders, calculating differences and making mipmaps can be even combined into one single step to reduce number of iterations (e.g. reduce image by 4x, calculating differences for the reduced segment).

Emil

28-08-2012, 02:48 PM

Interesting. I'm all new to GPU Graphics programming, so I'll try to read some more about it.
(Any suggestions for a nice tutorial on how to implement something like this?)