Premature optimization time: Cross product

**JSoftware** · 06-11-2007, 08:57 PM

Ok it seems like it's that time again. It's time for all use nanosecond shavers to get together and figure out to shave a few cycles of the common operations. Last time, the dot product was squashed by sse

This time it's time to optimize the crossproduct
The rules are simple: You can reuse your opponents code as long as you can add a fair bit of your own thought to it. You may use any simd tech up to and with SSE2. Assembler has to be Intel-style

Here's the function to optimize:

[pascal]type
TVector4f = record
x,y,z,w: single;
end;

function Cross(A, B: TVector4f): TVector4f;
begin
cross.x := A.y * B.z - B.y * A.z;
cross.y := A.z * B.x - B.z * A.x;
cross.z := A.x * B.y - B.x * A.y;
cross.w := 0;
end;[/pascal]

**Andreaz** · 07-11-2007, 06:34 AM

Single cycle optimizations is a thing of the past tbh,

Let's pass the vectors by reference and be done with it.
[pascal]
function Cross(const A, B: TVector4f): TVector4f;
begin
cross.x := A.y * B.z - B.y * A.z;
cross.y := A.z * B.x - B.z * A.x;
cross.z := A.x * B.y - B.x * A.y;
cross.w := 0;
end;[/pascal]

I did some profiling on the above, using the original function, the const version and inlined version

[pascal] for i:=0 to NUM_TESTS do begin
C:=Cross(A,B);
C:=Cross(A,B);
C:=Cross(A,B);
C:=Cross(A,B);
C:=Cross(A,B);

C:=Cross(A,B);
C:=Cross(A,B);
C:=Cross(A,B);
C:=Cross(A,B);
C:=Cross(A,B);
end;
[/pascal]

Code:

Testing 1000000 iterations, 10 loop unroll
First test, standard&#58; 4,59062534 s
Second test, const parameter&#58; 4,47177979 s
Third test, inlined&#58; 4,53005200 s

Testing 1000000 iterations, 10 loop unroll
First test, standard&#58; 4,58233044 s
Second test, const parameter&#58; 4,46661824 s
Third test, inlined&#58; 4,53930653 s

Testing 1000000 iterations, 10 loop unroll
First test, standard&#58; 4,58337498 s
Second test, const parameter&#58; 4,45365933 s
Third test, inlined&#58; 4,53174216 s

And this is using my somewhat bugged XP3200+.

Interesting to see here is that the inline version actually is slower (inline as in copy + paste) then the other ones.

As you see in this test you can punch up quite a few cross products per frame without affecting the framerate, are you shure that the cross product is you'r bottleneck ?

**JSoftware** · 07-11-2007, 07:59 AM

Oh, there's no bottlenecks ofcourse, but I just think it's fun optimizing functions

But yesterday evening I was beaten by the GNU C++ compiler, which generated an obscure sequence of sse calls(most of them not even using parallel operations) which was extraordinarily fast. I will post it here and then probably consider the competition over

**User137** · 07-11-2007, 11:14 AM

Originally Posted by Andreaz

Let's pass the vectors by reference and be done with it.
[pascal]function Cross(const A, B: TVector4f): TVector4f;
begin
cross.x := A.y * B.z - B.y * A.z;
cross.y := A.z * B.x - B.z * A.x;
cross.z := A.x * B.y - B.x * A.y;
cross.w := 0;
end;[/pascal]

I don't know details about parameter passing, is using const faster than if passed vectors as just pointers, or is there even difference?

LP · 07-11-2007, 02:37 PM

If you are using this method for 3D rendering, then you'd assign 1 to w instead of 0.

**Andreaz** · 07-11-2007, 04:17 PM

Originally Posted by User137

I don't know details about parameter passing, is using const faster than if passed vectors as just pointers, or is there even difference?

Same thing, but without the messy escaping, and the compiler checks that you dont change the parameters when using constant, to change use var instead.

**Mirage** · 08-11-2007, 07:58 AM

Most effective form will be procedure:

Code:

procedure Cross&#40;out Result&#58; TVector4f; const A, B&#58; TVector4f&#41;&#58; TVector4f;

with function compiler will copy result from stack when function returns.
Unfortunately, Delphi doesn't support inlining normally (not to inline assembler routines is most stupid restriction) so not possible to write result directly to destination variable.

Concerning SSE optimization - I think it's a compiler's task. And I heard that modern FPC versions can do SSE optmizations.

**imcold** · 10-11-2007, 11:20 AM

This got my attention. I've done some benchmarks with FPC 2.2.0 myself. As mirage said, FPC can generate code that uses scalar SSE/SSE2 instructions to do floating point calculations, just like GCC (fpc -CfSSE2 vs gcc -msse -mfpmath=sse).

Code:

procedure Cross&#40;out Result&#58; TVector4f; const A, B&#58; TVector4f&#41;;

is the fastest indeed - though you can't have the destination vector appearing in one of parameters (eg. cross(c, a, c)). Results:

Code:

fpc -O3 cross.pas
FPU &#58; 14,5s

fpc -O3 -CfSSE2 cross.pas
SSE2&#58; 13s                       -a bit faster than fpu
SSE2 + const param&#58; 6s          -avoids copying the param vectors to procedure stack
SSE2 + const param + out param&#58; 2s  -avoids copying the return value from stac

Tested code:
[pascal]{$mode objfpc}
type
TVector4f = record
x,y,z,w: single;
end;

//function Cross(A, B: TVector4f): TVector4f;
//function Cross(const A, B: TVector4f): TVector4f;
procedure Cross(out Result: TVector4f; const A, B: TVector4f);
begin
Result.x := A.y * B.z - B.y * A.z;
Result.y := A.z * B.x - B.z * A.x;
Result.z := A.x * B.y - B.x * A.y;
Result.w := 0;
end;

const
a: TVector4f = (x:1.2; y:1.4; z:1.5; w:2.0);
b: TVector4f = (x:2.2; y:2.4; z:2.5; w:4.0);

var
c: TVector4f;
i: integer;

begin
for i := 0 to 100000000 do begin
//c := cross(a, b); c := cross(b, a);
cross(c, a, b); cross(c, b, a);
end;
end.[/pascal]
Cpu: amd k8 1,6GHz

**arthurprs** · 10-11-2007, 02:18 PM

Very interesting, out is new for me

This runs a little faster here

[pascal]type
TVector4f = record
x,y,z,w: single;
end;

procedure Cross(out Result: TVector4f; const A, B: TVector4f); inline;
begin
Result.x := A.y * B.z - B.y * A.z;
Result.y := A.z * B.x - B.z * A.x;
Result.z := A.x * B.y - B.x * A.y;
Result.w := 0;
end;

const
a: TVector4f = (x:1.2; y:1.4; z:1.5; w:2.0);
b: TVector4f = (x:2.2; y:2.4; z:2.5; w:4.0);

var
c: TVector4f;
i: integer;

begin
for i := 0 to 100000000 do begin
cross(c, a, b); cross(c, b, a);
end;
end.[/pascal]

**imcold** · 10-11-2007, 05:27 PM

Originally Posted by arthurprs

This runs a little faster here

Inlining the function copies it to the place, where it's used - so you will save a function call. On the other hand, it makes your code longer, and many inlines in a row make the code slower.
I've made a version that uses vector SSE instructions instead of scalar SSE like FPC does: fewer instructions, but it's slower in my tests

Here's the code nevertheless:
[pascal]{
parameter passing (this is how FPC 2.2.0 at -O3 passes parameters to pascal Cross function; all pointers):
Var A located in register eax
Var B located in register edx
Var $result located in register ecx
}
procedure Cross_vec(out Result: TVector4f; const A, B: TVector4f); assembler; nostackframe;
asm
movdqu xmm2, [edx] // load A1
movdqu xmm1, [ecx] // load B
movdqa xmm0, xmm2 // load A
movdqa xmm3, xmm1 // load B1

shufps xmm2, xmm2, $C9 // shuffle A1
shufps xmm1, xmm1, $C9 // shuffle B

mulps xmm0, xmm1 // A * B = C
mulps xmm3, xmm2 // B1 * A1 = D

subps xmm0, xmm3 // C - D = R
shufps xmm0, xmm0, $C9 // shuffle R

movdqu [eax], xmm0 // return R
end;[/pascal]