Page 1 of 2 12 LastLast
Results 1 to 10 of 12

Thread: SSE2 tutorials

  1. #1

    SSE2 tutorials

    I don't know if it belongs here, but I currently have the below problem, and it is too slow, and I wanted to speed it up using SSE2 (or 3, the machines are recent core2s).

    So now I'm looking for mmx/sse/SSE2 tutorials with commented example using assembler. (not intel intrinsics) Most of what I've been able to find is heaps of sites that list the instructions, and a very few sites that use intel intrinsics.

    Note that suggestions to do this in the GPU are not constructive. Transfering each image to and from the GPU is more expensive than even the straight calculation using floating point, even before I introduce SSE. (8600GT, 20ms to card, probably the same to get it back)

    -----------------
    For the interested people, here is my problem: (from image processing, but the problem there is getting decent FPS there too :-)

    - I've a 28mpixel image (4096x7000), but 8bpp gray.
    - I need to correct each pixel for lighting circumstances, which means multiplying with a float.
    - however the lighting circumstances are not an image but a vectorthat maps to the row (correction is the same for each row).

    So I have
    - 7000 vectors of 4096 bytes.
    - And I need to multiply each vector with another 4096 <item) vector with corrections.

    The correction is currently an array[0..4095] of single, but it could be a unsigned 16-bit fixed point value.

    So for each byte one would do something like

    Code:
    var b &#58; byte;
        w,w2&#58; word;
    
    for x &#58;=0 to 4095 do
    begin
      b&#58;=imagebyte&#91;x&#93;;
      w&#58;=correctionword&#91;x&#93;;
      w2&#58;= saturate16&#40;b* w&#41;; // saturated 16bit.
      imagebyte&#91;x&#93;&#58;=saturate8&#40;w2 shr fixedpoints bits&#41;;  // scale down and fit into 8 bits    again using saturation
    end;

  2. #2

    SSE2 tutorials

    In reversed order:
    You want something like this: correction = unsigned 16-bit value, calculaton =
    saturate8( src[x] * corr[x] shr SHIFTBITS ) ? I would rather avoid the int -> floating point -> int conversion, and the intermediate saturation is unnecessary imho - I assume that w2 can be bigger than word, because it's saturated to 8 bits later anyway, so it shouldn't matter.
    The width must be divisible by 8 for the sse (sse2 actually) version.

    [pascal]
    const
    SHIFTBITS = 9; //just an example

    procedure correction_pas(src, dest: pbyte; width, height: integer; corr: pword);

    function clip(i: longword): byte;
    begin
    if i > 255 then result := 255 else result := i;
    end;

    function clip16(i: longword): word;
    begin
    if i > 65535 then result := 65535 else result := i;
    end;

    var
    x, y: integer;
    begin
    for y := 0 to height - 1 do
    for x := 0 to width - 1 do
    // dest[y * width + x] := clip(clip16( src[y * width + x] * corr[x] ) shr SHIFTBITS);
    dest[y * width + x] := clip( src[y * width + x] * corr[x] shr SHIFTBITS ); //imho this is more desirable
    end;


    //unpack, no intermed. saturation
    procedure correction_sse(src, dest: pbyte; width, height: integer; corr: pword);
    var
    i: integer;
    begin
    for i := 0 to height - 1 do begin
    asm
    mov eax, src
    mov edx, dest
    mov ebx, corr
    mov ecx, width
    shr ecx, 3 // width / 8, since we work on 8 pixels at once
    pxor xmm7, xmm7 // 0

    @loop_x:
    movq xmm0, [eax] // load 8 bytes = pixels
    punpcklbw xmm0, xmm7 // unpack to 8 words
    movdqa xmm1, xmm0 // duplicate

    movdqu xmm6, [ebx] // load 8 words = light/correction
    pmullw xmm0, xmm6 // multiply 8 words to 8 low word results
    pmulhw xmm1, xmm6 // multiply 8 words to 8 high word results
    movdqa xmm2, xmm0 // duplicate
    movdqa xmm3, xmm1

    punpcklwd xmm0, xmm1 // merge 8 low + 8 high words into 8 doublewords (2x4)
    punpckhwd xmm2, xmm3

    psrld xmm0, SHIFTBITS // right shift 4 dw
    psrld xmm2, SHIFTBITS // same

    packssdw xmm0, xmm2 // 8 dwords to 8 words (signed, but we don't have to care)
    packuswb xmm0, xmm7 // pack 8 words to 8 bytes

    movq [edx], xmm0 // store 8 bytes = pixels
    add eax, 8
    add edx, 8
    add ebx, 16

    dec ecx
    jnz @loop_x
    end['eax', 'ebx', 'ecx', 'edx'];
    src += width;
    dest += width;
    end;
    end;
    [/pascal]

    For tutorials: the intel developer manuals should be helpful, like: "x64 and IA-32 Optimization Reference Manual", or try this webpage: http://webster.cs.ucr.edu/AoA/Window...2.html#1004358 . Drawing the computations and data flow on a sheet of paper is often very useful also. I learned most things from experimenting. Documents with mmx/sse instructions and their descriptions are very handy, too - for example "AMD64 Architecture Programmer’s Manual Volume 4: 128-Bit Media Instructions" or nasm 0.99.x manual with x86 instruction listing (up to sse2).

  3. #3

    SSE2 tutorials

    [offtopic]
    [pascal]
    src += width;
    dest += width;
    [/pascal]

    Huh.... Is += really a valid Operator in Delphi?? :?
    If so, since which delphi version..

    [/offtopic]
    Coders rule nr 1: Face ur bugz.. dont cage them with code, kill'em with ur cursor.

  4. #4

    SSE2 tutorials

    It is in Freepascal. I don't care about much about Delphi - and I don't really remember how it handles pointer arithmetics, too. But the outer loop should be rewritten to asm too, to avoid useless repeated loading of pointers to regs, so the pointer arithmetics can be completely avoided.

  5. #5

    SSE2 tutorials

    Quote Originally Posted by imcold
    It is in Freepascal. I don't care about much about Delphi - and I don't really remember how it handles pointer arithmetics, too. But the outer loop should be rewritten to asm too, to avoid useless repeated loading of pointers to regs, so the pointer arithmetics can be completely avoided.
    Thanks a lot really. I'll study it in the coming week. (I expected some urls to start reading, not the ready code :-).

    Thanks again.

    P.s. While this is for Delphi, I have some FPC experience. (:-))So that is no problem for me. Actually, I'll probably be testing it in FPC anyway.

  6. #6

    SSE2 tutorials

    No problem, working with mmx/sse stuff is (usually) fun I don't know, if you will find any useful and commented simd examples, so this should be one. Feel free to ask any questions. The code can still be made a bit faster, so there's still some work left on it, too.

    Oh, and I believe you have a *lot* of fpc experience Inline asm is cool, you can output the register contents to console very easily, so it's easy to follow the operations on data.

  7. #7

    SSE2 tutorials

    Quote Originally Posted by imcold
    No problem, working with mmx/sse stuff is (usually) fun I don't know, if you will find any useful and commented simd examples, so this should be one. Feel free to ask any questions. The code can still be made a bit faster, so there's still some work left on it, too.

    Oh, and I believe you have a *lot* of fpc experience Inline asm is cool, you can output the register contents to console very easily, so it's easy to follow the operations on data.
    A follow up:

    Due to busy work (and the relevant projects being postponed a few months by the clients), I only got to real testing today.

    The code crashed at first, but that was because the pascal code uses register EBX for the loop counter, and this is not saved. For now I quickly pushed pop, but will do the outer loop in asm in the near future too.

    I haven't really validated the data (if the image is processed correctly, since i don't have images to test with yet, but the speed is very promising, exactly 10 times faster!

    So thanks again

  8. #8

    SSE2 tutorials

    Ebx should be saved and restored by compiler, if it's listed in registerlist after an asm block - Ref. guide, 10.3 Assembler statements. If it's not, it's a bug in fpc, I believe.
    Two tips for some extra speed:
    - align src/dest/corr to addresses that are multiplies of 16 (and replace movdqu with movdqa, it should help mainly on intel cpus)
    - try to prefetch the next row (or couple of rows) of pixels (helps in some cases, sometimes it doesnt).

  9. #9

    SSE2 tutorials

    Quote Originally Posted by imcold
    Ebx should be saved and restored by compiler, if it's listed in registerlist after an asm block - Ref. guide, 10.3 Assembler statements. If it's not, it's a bug in fpc, I believe.
    Two tips for some extra speed:
    - align src/dest/corr to addresses that are multiplies of 16 (and replace movdqu with movdqa, it should help mainly on intel cpus)
    - try to prefetch the next row (or couple of rows) of pixels (helps in some cases, sometimes it doesnt).
    src and dest are loaded in 8 byte quantities, and aligned 4 bytes on D7, and 8 bytes on D2006 (fastmm aligns to 8 bytes). I can't change that easily, except by replacing the heapmgr, but if you load in 8 byte values, you'll never stay 16-byte aligned long. I do use D2006 because of this reason for the speed dependant projects. (and also the SSE3 support for LDDQ)

    I aligned the cor array to 32 byte, and have an ifdef to load it and use lddq, but no improvement. Probably hitting memory bandwith limits.

    I doubt prefetch will do much, since I simply walk through a 4MB memory block from 0 to 4MB-1. If the predictor in the CPU can't predict that, there is no point in having prefetch in the first place ;_)

    I'm currently trying this in Delphi btw, so no registerlist. BTW: afaik registerlist works for blocks, but has no effect on assembler procedures.

  10. #10

    SSE2 tutorials

    Some links on optimizing code generated with delphi:

    http://fastcode.sourceforge.net/

    :!: http://www.yks.ne.jp/~hori/MMXasm-e.html

    :!: http://www.tommesani.com/Features.html

    www.optimalcode.com //hmm this one does not exist anymore ...
    http://3das.noeska.com - create adventure games without programming

Page 1 of 2 12 LastLast

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •