Results 1 to 10 of 10

Thread: Premature optimization time: Cross product

  1. #1

    Premature optimization time: Cross product

    Ok it seems like it's that time again. It's time for all use nanosecond shavers to get together and figure out to shave a few cycles of the common operations. Last time, the dot product was squashed by sse

    This time it's time to optimize the crossproduct
    The rules are simple: You can reuse your opponents code as long as you can add a fair bit of your own thought to it. You may use any simd tech up to and with SSE2. Assembler has to be Intel-style

    Here's the function to optimize:

    [pascal]type
    TVector4f = record
    x,y,z,w: single;
    end;

    function Cross(A, B: TVector4f): TVector4f;
    begin
    cross.x := A.y * B.z - B.y * A.z;
    cross.y := A.z * B.x - B.z * A.x;
    cross.z := A.x * B.y - B.x * A.y;
    cross.w := 0;
    end;[/pascal]
    Peregrinus, expectavi pedes meos in cymbalis
    Nullus norvegicorum sole urinat

  2. #2

    Re: Premature optimization time: Cross product

    Single cycle optimizations is a thing of the past tbh,

    Let's pass the vectors by reference and be done with it.
    [pascal]
    function Cross(const A, B: TVector4f): TVector4f;
    begin
    cross.x := A.y * B.z - B.y * A.z;
    cross.y := A.z * B.x - B.z * A.x;
    cross.z := A.x * B.y - B.x * A.y;
    cross.w := 0;
    end;[/pascal]

    I did some profiling on the above, using the original function, the const version and inlined version

    [pascal] for i:=0 to NUM_TESTS do begin
    C:=Cross(A,B);
    C:=Cross(A,B);
    C:=Cross(A,B);
    C:=Cross(A,B);
    C:=Cross(A,B);

    C:=Cross(A,B);
    C:=Cross(A,B);
    C:=Cross(A,B);
    C:=Cross(A,B);
    C:=Cross(A,B);
    end;
    [/pascal]

    Code:
    Testing 1000000 iterations, 10 loop unroll
    First test, standard: 4,59062534 s
    Second test, const parameter: 4,47177979 s
    Third test, inlined: 4,53005200 s
    
    Testing 1000000 iterations, 10 loop unroll
    First test, standard: 4,58233044 s
    Second test, const parameter: 4,46661824 s
    Third test, inlined: 4,53930653 s
    
    Testing 1000000 iterations, 10 loop unroll
    First test, standard: 4,58337498 s
    Second test, const parameter: 4,45365933 s
    Third test, inlined: 4,53174216 s
    And this is using my somewhat bugged XP3200+.

    Interesting to see here is that the inline version actually is slower (inline as in copy + paste) then the other ones.

    As you see in this test you can punch up quite a few cross products per frame without affecting the framerate, are you shure that the cross product is you'r bottleneck ?
    Amnoxx

    Oh, and this code appears to be an approximate replacement for return(random() & 0x01);

    Phoenix Wiki
    http://www.phoenixlib.net/

    Phoenix Forum
    http://www.pascalgamedevelopment.com/viewforum.php?f=71

  3. #3

    Premature optimization time: Cross product

    Oh, there's no bottlenecks ofcourse, but I just think it's fun optimizing functions

    But yesterday evening I was beaten by the GNU C++ compiler, which generated an obscure sequence of sse calls(most of them not even using parallel operations) which was extraordinarily fast. I will post it here and then probably consider the competition over
    Peregrinus, expectavi pedes meos in cymbalis
    Nullus norvegicorum sole urinat

  4. #4

    Re: Premature optimization time: Cross product

    Quote Originally Posted by Andreaz
    Let's pass the vectors by reference and be done with it.
    [pascal]function Cross(const A, B: TVector4f): TVector4f;
    begin
    cross.x := A.y * B.z - B.y * A.z;
    cross.y := A.z * B.x - B.z * A.x;
    cross.z := A.x * B.y - B.x * A.y;
    cross.w := 0;
    end;[/pascal]
    I don't know details about parameter passing, is using const faster than if passed vectors as just pointers, or is there even difference?

  5. #5

    Premature optimization time: Cross product

    If you are using this method for 3D rendering, then you'd assign 1 to w instead of 0.

  6. #6

    Re: Premature optimization time: Cross product

    Quote Originally Posted by User137
    I don't know details about parameter passing, is using const faster than if passed vectors as just pointers, or is there even difference?
    Same thing, but without the messy escaping, and the compiler checks that you dont change the parameters when using constant, to change use var instead.
    Amnoxx

    Oh, and this code appears to be an approximate replacement for return(random() & 0x01);

    Phoenix Wiki
    http://www.phoenixlib.net/

    Phoenix Forum
    http://www.pascalgamedevelopment.com/viewforum.php?f=71

  7. #7

    Premature optimization time: Cross product

    Most effective form will be procedure:

    Code:
    procedure Cross(out Result: TVector4f; const A, B: TVector4f): TVector4f;
    with function compiler will copy result from stack when function returns.
    Unfortunately, Delphi doesn't support inlining normally (not to inline assembler routines is most stupid restriction) so not possible to write result directly to destination variable.

    Concerning SSE optimization - I think it's a compiler's task. And I heard that modern FPC versions can do SSE optmizations.

  8. #8

    Premature optimization time: Cross product

    This got my attention. I've done some benchmarks with FPC 2.2.0 myself. As mirage said, FPC can generate code that uses scalar SSE/SSE2 instructions to do floating point calculations, just like GCC (fpc -CfSSE2 vs gcc -msse -mfpmath=sse).
    Code:
    procedure Cross(out Result: TVector4f; const A, B: TVector4f);
    is the fastest indeed - though you can't have the destination vector appearing in one of parameters (eg. cross(c, a, c)). Results:
    Code:
    fpc -O3 cross.pas
    FPU : 14,5s
    
    fpc -O3 -CfSSE2 cross.pas
    SSE2: 13s                       -a bit faster than fpu
    SSE2 + const param: 6s          -avoids copying the param vectors to procedure stack
    SSE2 + const param + out param: 2s  -avoids copying the return value from stac
    Tested code:
    [pascal]{$mode objfpc}
    type
    TVector4f = record
    x,y,z,w: single;
    end;

    //function Cross(A, B: TVector4f): TVector4f;
    //function Cross(const A, B: TVector4f): TVector4f;
    procedure Cross(out Result: TVector4f; const A, B: TVector4f);
    begin
    Result.x := A.y * B.z - B.y * A.z;
    Result.y := A.z * B.x - B.z * A.x;
    Result.z := A.x * B.y - B.x * A.y;
    Result.w := 0;
    end;


    const
    a: TVector4f = (x:1.2; y:1.4; z:1.5; w:2.0);
    b: TVector4f = (x:2.2; y:2.4; z:2.5; w:4.0);

    var
    c: TVector4f;
    i: integer;

    begin
    for i := 0 to 100000000 do begin
    //c := cross(a, b); c := cross(b, a);
    cross(c, a, b); cross(c, b, a);
    end;
    end.[/pascal]
    Cpu: amd k8 1,6GHz

  9. #9

    Premature optimization time: Cross product

    Very interesting, out is new for me

    This runs a little faster here

    [pascal]type
    TVector4f = record
    x,y,z,w: single;
    end;

    procedure Cross(out Result: TVector4f; const A, B: TVector4f); inline;
    begin
    Result.x := A.y * B.z - B.y * A.z;
    Result.y := A.z * B.x - B.z * A.x;
    Result.z := A.x * B.y - B.x * A.y;
    Result.w := 0;
    end;


    const
    a: TVector4f = (x:1.2; y:1.4; z:1.5; w:2.0);
    b: TVector4f = (x:2.2; y:2.4; z:2.5; w:4.0);

    var
    c: TVector4f;
    i: integer;

    begin
    for i := 0 to 100000000 do begin
    cross(c, a, b); cross(c, b, a);
    end;
    end.[/pascal]
    From brazil (:

    Pascal pownz!

  10. #10

    Premature optimization time: Cross product

    Quote Originally Posted by arthurprs
    This runs a little faster here
    Inlining the function copies it to the place, where it's used - so you will save a function call. On the other hand, it makes your code longer, and many inlines in a row make the code slower.
    I've made a version that uses vector SSE instructions instead of scalar SSE like FPC does: fewer instructions, but it's slower in my tests Here's the code nevertheless:
    [pascal]{
    parameter passing (this is how FPC 2.2.0 at -O3 passes parameters to pascal Cross function; all pointers):
    Var A located in register eax
    Var B located in register edx
    Var $result located in register ecx
    }
    procedure Cross_vec(out Result: TVector4f; const A, B: TVector4f); assembler; nostackframe;
    asm
    movdqu xmm2, [edx] // load A1
    movdqu xmm1, [ecx] // load B
    movdqa xmm0, xmm2 // load A
    movdqa xmm3, xmm1 // load B1

    shufps xmm2, xmm2, $C9 // shuffle A1
    shufps xmm1, xmm1, $C9 // shuffle B

    mulps xmm0, xmm1 // A * B = C
    mulps xmm3, xmm2 // B1 * A1 = D

    subps xmm0, xmm3 // C - D = R
    shufps xmm0, xmm0, $C9 // shuffle R

    movdqu [eax], xmm0 // return R
    end;[/pascal]

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •