SSE2 tutorials [Archive] - Pascal Game Development

marcov

25-08-2008, 10:03 AM

I don't know if it belongs here, but I currently have the below problem, and it is too slow, and I wanted to speed it up using SSE2 (or 3, the machines are recent core2s).

So now I'm looking for mmx/sse/SSE2 tutorials with commented example using assembler. (not intel intrinsics) Most of what I've been able to find is heaps of sites that list the instructions, and a very few sites that use intel intrinsics.

Note that suggestions to do this in the GPU are not constructive. Transfering each image to and from the GPU is more expensive than even the straight calculation using floating point, even before I introduce SSE. (8600GT, 20ms to card, probably the same to get it back)

-----------------
For the interested people, here is my problem: (from image processing, but the problem there is getting decent FPS there too :-)

- I've a 28mpixel image (4096x7000), but 8bpp gray.
- I need to correct each pixel for lighting circumstances, which means multiplying with a float.
- however the lighting circumstances are not an image but a vectorthat maps to the row (correction is the same for each row).

So I have
- 7000 vectors of 4096 bytes.
- And I need to multiply each vector with another 4096 <item) vector with corrections.

The correction is currently an array[0..4095] of single, but it could be a unsigned 16-bit fixed point value.

So for each byte one would do something like

var b : byte;
w,w2: word;

for x :=0 to 4095 do
begin
b:=imagebyte[x];
w:=correctionword[x];
w2:= saturate16(b* w); // saturated 16bit.
imagebyte[x]:=saturate8(w2 shr fixedpoints bits); // scale down and fit into 8 bits again using saturation
end;

imcold

25-08-2008, 08:31 PM

In reversed order:
You want something like this: correction = unsigned 16-bit value, calculaton =
saturate8( src[x] * corr[x] shr SHIFTBITS ) ? I would rather avoid the int -> floating point -> int conversion, and the intermediate saturation is unnecessary imho - I assume that w2 can be bigger than word, because it's saturated to 8 bits later anyway, so it shouldn't matter.
The width must be divisible by 8 for the sse (sse2 actually) version.

const
SHIFTBITS = 9; //just an example

procedure correction_pas(src, dest: pbyte; width, height: integer; corr: pword);

function clip(i: longword): byte;
begin
if i > 255 then result := 255 else result := i;
end;

function clip16(i: longword): word;
begin
if i > 65535 then result := 65535 else result := i;
end;

var
x, y: integer;
begin
for y := 0 to height - 1 do
for x := 0 to width - 1 do
// dest[y * width + x] := clip(clip16( src[y * width + x] * corr[x] ) shr SHIFTBITS);
dest[y * width + x] := clip( src[y * width + x] * corr[x] shr SHIFTBITS ); //imho this is more desirable
end;

//unpack, no intermed. saturation
procedure correction_sse(src, dest: pbyte; width, height: integer; corr: pword);
var
i: integer;
begin
for i := 0 to height - 1 do begin
asm
mov eax, src
mov edx, dest
mov ebx, corr
mov ecx, width
shr ecx, 3 // width / 8, since we work on 8 pixels at once
pxor xmm7, xmm7 // 0

@loop_x:
movq xmm0, [eax] // load 8 bytes = pixels
punpcklbw xmm0, xmm7 // unpack to 8 words
movdqa xmm1, xmm0 // duplicate

movdqu xmm6, [ebx] // load 8 words = light/correction
pmullw xmm0, xmm6 // multiply 8 words to 8 low word results
pmulhw xmm1, xmm6 // multiply 8 words to 8 high word results
movdqa xmm2, xmm0 // duplicate
movdqa xmm3, xmm1

punpcklwd xmm0, xmm1 // merge 8 low + 8 high words into 8 doublewords (2x4)
punpckhwd xmm2, xmm3

psrld xmm0, SHIFTBITS // right shift 4 dw
psrld xmm2, SHIFTBITS // same

packssdw xmm0, xmm2 // 8 dwords to 8 words (signed, but we don't have to care)
packuswb xmm0, xmm7 // pack 8 words to 8 bytes

movq [edx], xmm0 // store 8 bytes = pixels
add eax, 8
add edx, 8
add ebx, 16

dec ecx
jnz @loop_x
end['eax', 'ebx', 'ecx', 'edx'];
src += width;
dest += width;
end;
end;

For tutorials: the intel developer manuals should be helpful, like: "x64 and IA-32 Optimization Reference Manual", or try this webpage: http://webster.cs.ucr.edu/AoA/Windows/HTML/TheMMXInstructionSeta2.html#1004358 . Drawing the computations and data flow on a sheet of paper is often very useful also. I learned most things from experimenting. Documents with mmx/sse instructions and their descriptions are very handy, too - for example "AMD64 Architecture Programmer‚Äôs Manual Volume 4: 128-Bit Media Instructions" or nasm 0.99.x manual with x86 instruction listing (up to sse2).

chronozphere

25-08-2008, 08:38 PM

src += width;
dest += width;

Huh.... Is += really a valid Operator in Delphi?? :?
If so, since which delphi version..

imcold

25-08-2008, 08:48 PM

It is in Freepascal. I don't care about much about Delphi - and I don't really remember how it handles pointer arithmetics, too. But the outer loop should be rewritten to asm too, to avoid useless repeated loading of pointers to regs, so the pointer arithmetics can be completely avoided.

marcov

26-08-2008, 07:07 AM

It is in Freepascal. I don't care about much about Delphi - and I don't really remember how it handles pointer arithmetics, too. But the outer loop should be rewritten to asm too, to avoid useless repeated loading of pointers to regs, so the pointer arithmetics can be completely avoided.

Thanks a lot really. I'll study it in the coming week. (I expected some urls to start reading, not the ready code :-).

Thanks again.

P.s. While this is for Delphi, I have some FPC experience. (:-))So that is no problem for me. Actually, I'll probably be testing it in FPC anyway.

imcold

26-08-2008, 04:02 PM

No problem, working with mmx/sse stuff is (usually) fun ;) I don't know, if you will find any useful and commented simd examples, so this should be one. Feel free to ask any questions. The code can still be made a bit faster, so there's still some work left on it, too.

Oh, and I believe you have a *lot* of fpc experience ;) Inline asm is cool, you can output the register contents to console very easily, so it's easy to follow the operations on data.

marcov

29-10-2008, 09:49 AM

No problem, working with mmx/sse stuff is (usually) fun ;) I don't know, if you will find any useful and commented simd examples, so this should be one. Feel free to ask any questions. The code can still be made a bit faster, so there's still some work left on it, too.

Oh, and I believe you have a *lot* of fpc experience ;) Inline asm is cool, you can output the register contents to console very easily, so it's easy to follow the operations on data.

A follow up:

Due to busy work (and the relevant projects being postponed a few months by the clients), I only got to real testing today.

The code crashed at first, but that was because the pascal code uses register EBX for the loop counter, and this is not saved. For now I quickly pushed pop, but will do the outer loop in asm in the near future too.

I haven't really validated the data (if the image is processed correctly, since i don't have images to test with yet, but the speed is very promising, exactly 10 times faster!

So thanks again

imcold

03-11-2008, 07:48 PM

Ebx should be saved and restored by compiler, if it's listed in registerlist after an asm block - Ref. guide, 10.3 Assembler statements. If it's not, it's a bug in fpc, I believe.
Two tips for some extra speed:
- align src/dest/corr to addresses that are multiplies of 16 (and replace movdqu with movdqa, it should help mainly on intel cpus)
- try to prefetch the next row (or couple of rows) of pixels (helps in some cases, sometimes it doesnt).

marcov

04-11-2008, 10:42 AM

Ebx should be saved and restored by compiler, if it's listed in registerlist after an asm block - Ref. guide, 10.3 Assembler statements. If it's not, it's a bug in fpc, I believe.
Two tips for some extra speed:
- align src/dest/corr to addresses that are multiplies of 16 (and replace movdqu with movdqa, it should help mainly on intel cpus)
- try to prefetch the next row (or couple of rows) of pixels (helps in some cases, sometimes it doesnt).

src and dest are loaded in 8 byte quantities, and aligned 4 bytes on D7, and 8 bytes on D2006 (fastmm aligns to 8 bytes). I can't change that easily, except by replacing the heapmgr, but if you load in 8 byte values, you'll never stay 16-byte aligned long. I do use D2006 because of this reason for the speed dependant projects. (and also the SSE3 support for LDDQ)

I aligned the cor array to 32 byte, and have an ifdef to load it and use lddq, but no improvement. Probably hitting memory bandwith limits.

I doubt prefetch will do much, since I simply walk through a 4MB memory block from 0 to 4MB-1. If the predictor in the CPU can't predict that, there is no point in having prefetch in the first place ;_)

I'm currently trying this in Delphi btw, so no registerlist. BTW: afaik registerlist works for blocks, but has no effect on assembler procedures.

noeska

04-11-2008, 06:09 PM

Some links on optimizing code generated with delphi:

http://fastcode.sourceforge.net/

:!: http://www.yks.ne.jp/~hori/MMXasm-e.html

:!: http://www.tommesani.com/Features.html

www.optimalcode.com //hmm this one does not exist anymore ...

imcold

05-11-2008, 06:17 PM

Ah, I thought it crashed when you tried the code in FPC. Lddqu is useful only for P4 Prescott, where it solves the cacheline split issue (some interesting reading about this: http://x264dev.multimedia.cx/?p=8) and if you're sure the data is aligned, it doesn't matter anyway.
Btw. doesn't Delphi have an Align() function - or this is only FPC's feature?

marcov

19-11-2008, 09:49 AM

Ah, I thought it crashed when you tried the code in FPC. Lddqu is useful only for P4 Prescott, where it solves the cacheline split issue (some interesting reading about this: http://x264dev.multimedia.cx/?p=8) and if you're sure the data is aligned, it doesn't matter anyway.
Btw. doesn't Delphi have an Align() function - or this is only FPC's feature?

Afaik that is FPC only till now. Delphi has no other archs then 32-bit x86 till now.