Quote Originally Posted by arthurprs
This runs a little faster here
Inlining the function copies it to the place, where it's used - so you will save a function call. On the other hand, it makes your code longer, and many inlines in a row make the code slower.
I've made a version that uses vector SSE instructions instead of scalar SSE like FPC does: fewer instructions, but it's slower in my tests Here's the code nevertheless:
[pascal]{
parameter passing (this is how FPC 2.2.0 at -O3 passes parameters to pascal Cross function; all pointers):
Var A located in register eax
Var B located in register edx
Var $result located in register ecx
}
procedure Cross_vec(out Result: TVector4f; const A, B: TVector4f); assembler; nostackframe;
asm
movdqu xmm2, [edx] // load A1
movdqu xmm1, [ecx] // load B
movdqa xmm0, xmm2 // load A
movdqa xmm3, xmm1 // load B1

shufps xmm2, xmm2, $C9 // shuffle A1
shufps xmm1, xmm1, $C9 // shuffle B

mulps xmm0, xmm1 // A * B = C
mulps xmm3, xmm2 // B1 * A1 = D

subps xmm0, xmm3 // C - D = R
shufps xmm0, xmm0, $C9 // shuffle R

movdqu [eax], xmm0 // return R
end;[/pascal]