PDA

View Full Version : Premature optimization time: Cross product



JSoftware
06-11-2007, 08:57 PM
Ok it seems like it's that time again. It's time for all use nanosecond shavers to get together and figure out to shave a few cycles of the common operations. Last time, the dot product was squashed by sse

This time it's time to optimize the crossproduct
The rules are simple: You can reuse your opponents code as long as you can add a fair bit of your own thought to it. You may use any simd tech up to and with SSE2. Assembler has to be Intel-style

Here's the function to optimize:

type
TVector4f = record
x,y,z,w: single;
end;

function Cross(A, B: TVector4f): TVector4f;
begin
cross.x := A.y * B.z - B.y * A.z;
cross.y := A.z * B.x - B.z * A.x;
cross.z := A.x * B.y - B.x * A.y;
cross.w := 0;
end;

Andreaz
07-11-2007, 06:34 AM
Single cycle optimizations is a thing of the past tbh,

Let's pass the vectors by reference and be done with it.

function Cross(const A, B: TVector4f): TVector4f;
begin
cross.x := A.y * B.z - B.y * A.z;
cross.y := A.z * B.x - B.z * A.x;
cross.z := A.x * B.y - B.x * A.y;
cross.w := 0;
end;

I did some profiling on the above, using the original function, the const version and inlined version

for i:=0 to NUM_TESTS do begin
C:=Cross(A,B);
C:=Cross(A,B);
C:=Cross(A,B);
C:=Cross(A,B);
C:=Cross(A,B);

C:=Cross(A,B);
C:=Cross(A,B);
C:=Cross(A,B);
C:=Cross(A,B);
C:=Cross(A,B);
end;




Testing 1000000 iterations, 10 loop unroll
First test, standard: 4,59062534 s
Second test, const parameter: 4,47177979 s
Third test, inlined: 4,53005200 s

Testing 1000000 iterations, 10 loop unroll
First test, standard: 4,58233044 s
Second test, const parameter: 4,46661824 s
Third test, inlined: 4,53930653 s

Testing 1000000 iterations, 10 loop unroll
First test, standard: 4,58337498 s
Second test, const parameter: 4,45365933 s
Third test, inlined: 4,53174216 s

And this is using my somewhat bugged XP3200+.

Interesting to see here is that the inline version actually is slower (inline as in copy + paste) then the other ones.

As you see in this test you can punch up quite a few cross products per frame without affecting the framerate, are you shure that the cross product is you'r bottleneck ?

JSoftware
07-11-2007, 07:59 AM
Oh, there's no bottlenecks ofcourse, but I just think it's fun optimizing functions :)

But yesterday evening I was beaten by the GNU C++ compiler, which generated an obscure sequence of sse calls(most of them not even using parallel operations) which was extraordinarily fast. I will post it here and then probably consider the competition over :(

User137
07-11-2007, 11:14 AM
Let's pass the vectors by reference and be done with it.
function Cross(const A, B: TVector4f): TVector4f;
begin
cross.x := A.y * B.z - B.y * A.z;
cross.y := A.z * B.x - B.z * A.x;
cross.z := A.x * B.y - B.x * A.y;
cross.w := 0;
end;
I don't know details about parameter passing, is using const faster than if passed vectors as just pointers, or is there even difference?

LP
07-11-2007, 02:37 PM
If you are using this method for 3D rendering, then you'd assign 1 to w instead of 0.

Andreaz
07-11-2007, 04:17 PM
I don't know details about parameter passing, is using const faster than if passed vectors as just pointers, or is there even difference?

Same thing, but without the messy escaping, and the compiler checks that you dont change the parameters when using constant, to change use var instead.

Mirage
08-11-2007, 07:58 AM
Most effective form will be procedure:


procedure Cross(out Result: TVector4f; const A, B: TVector4f): TVector4f;

with function compiler will copy result from stack when function returns.
Unfortunately, Delphi doesn't support inlining normally (not to inline assembler routines is most stupid restriction) so not possible to write result directly to destination variable.

Concerning SSE optimization - I think it's a compiler's task. And I heard that modern FPC versions can do SSE optmizations.

imcold
10-11-2007, 11:20 AM
This got my attention. I've done some benchmarks with FPC 2.2.0 myself. As mirage said, FPC can generate code that uses scalar SSE/SSE2 instructions to do floating point calculations, just like GCC (fpc -CfSSE2 vs gcc -msse -mfpmath=sse).
procedure Cross(out Result: TVector4f; const A, B: TVector4f); is the fastest indeed - though you can't have the destination vector appearing in one of parameters (eg. cross(c, a, c)). Results:

fpc -O3 cross.pas
FPU : 14,5s

fpc -O3 -CfSSE2 cross.pas
SSE2: 13s -a bit faster than fpu
SSE2 + const param: 6s -avoids copying the param vectors to procedure stack
SSE2 + const param + out param: 2s -avoids copying the return value from stac
Tested code:
{$mode objfpc}
type
TVector4f = record
x,y,z,w: single;
end;

//function Cross(A, B: TVector4f): TVector4f;
//function Cross(const A, B: TVector4f): TVector4f;
procedure Cross(out Result: TVector4f; const A, B: TVector4f);
begin
Result.x := A.y * B.z - B.y * A.z;
Result.y := A.z * B.x - B.z * A.x;
Result.z := A.x * B.y - B.x * A.y;
Result.w := 0;
end;


const
a: TVector4f = (x:1.2; y:1.4; z:1.5; w:2.0);
b: TVector4f = (x:2.2; y:2.4; z:2.5; w:4.0);

var
c: TVector4f;
i: integer;

begin
for i := 0 to 100000000 do begin
//c := cross(a, b); c := cross(b, a);
cross(c, a, b); cross(c, b, a);
end;
end.
Cpu: amd k8 1,6GHz

arthurprs
10-11-2007, 02:18 PM
Very interesting, out is new for me

This runs a little faster here

type
TVector4f = record
x,y,z,w: single;
end;

procedure Cross(out Result: TVector4f; const A, B: TVector4f); inline;
begin
Result.x := A.y * B.z - B.y * A.z;
Result.y := A.z * B.x - B.z * A.x;
Result.z := A.x * B.y - B.x * A.y;
Result.w := 0;
end;


const
a: TVector4f = (x:1.2; y:1.4; z:1.5; w:2.0);
b: TVector4f = (x:2.2; y:2.4; z:2.5; w:4.0);

var
c: TVector4f;
i: integer;

begin
for i := 0 to 100000000 do begin
cross(c, a, b); cross(c, b, a);
end;
end.

imcold
10-11-2007, 05:27 PM
This runs a little faster here
Inlining the function copies it to the place, where it's used - so you will save a function call. On the other hand, it makes your code longer, and many inlines in a row make the code slower.
I've made a version that uses vector SSE instructions instead of scalar SSE like FPC does: fewer instructions, but it's slower in my tests :( Here's the code nevertheless:
{
parameter passing (this is how FPC 2.2.0 at -O3 passes parameters to pascal Cross function; all pointers):
Var A located in register eax
Var B located in register edx
Var $result located in register ecx
}
procedure Cross_vec(out Result: TVector4f; const A, B: TVector4f); assembler; nostackframe;
asm
movdqu xmm2, [edx] // load A1
movdqu xmm1, [ecx] // load B
movdqa xmm0, xmm2 // load A
movdqa xmm3, xmm1 // load B1

shufps xmm2, xmm2, $C9 // shuffle A1
shufps xmm1, xmm1, $C9 // shuffle B

mulps xmm0, xmm1 // A * B = C
mulps xmm3, xmm2 // B1 * A1 = D

subps xmm0, xmm3 // C - D = R
shufps xmm0, xmm0, $C9 // shuffle R

movdqu [eax], xmm0 // return R
end;