Ok it seems like it's that time again. It's time for all use nanosecond shavers to get together and figure out to shave a few cycles of the common operations. Last time, the dot product was squashed by sse

This time it's time to optimize the crossproduct
The rules are simple: You can reuse your opponents code as long as you can add a fair bit of your own thought to it. You may use any simd tech up to and with SSE2. Assembler has to be Intel-style

Here's the function to optimize:

[pascal]type
TVector4f = record
x,y,z,w: single;
end;

function Cross(A, B: TVector4f): TVector4f;
begin
cross.x := A.y * B.z - B.y * A.z;
cross.y := A.z * B.x - B.z * A.x;
cross.z := A.x * B.y - B.x * A.y;
cross.w := 0;
end;[/pascal]