This got my attention. I've done some benchmarks with FPC 2.2.0 myself. As mirage said, FPC can generate code that uses scalar SSE/SSE2 instructions to do floating point calculations, just like GCC (fpc -CfSSE2 vs gcc -msse -mfpmath=sse).
Code:
procedure Cross(out Result: TVector4f; const A, B: TVector4f);
is the fastest indeed - though you can't have the destination vector appearing in one of parameters (eg. cross(c, a, c)). Results:
Code:
fpc -O3 cross.pas
FPU : 14,5s

fpc -O3 -CfSSE2 cross.pas
SSE2: 13s                       -a bit faster than fpu
SSE2 + const param: 6s          -avoids copying the param vectors to procedure stack
SSE2 + const param + out param: 2s  -avoids copying the return value from stac
Tested code:
[pascal]{$mode objfpc}
type
TVector4f = record
x,y,z,w: single;
end;

//function Cross(A, B: TVector4f): TVector4f;
//function Cross(const A, B: TVector4f): TVector4f;
procedure Cross(out Result: TVector4f; const A, B: TVector4f);
begin
Result.x := A.y * B.z - B.y * A.z;
Result.y := A.z * B.x - B.z * A.x;
Result.z := A.x * B.y - B.x * A.y;
Result.w := 0;
end;


const
a: TVector4f = (x:1.2; y:1.4; z:1.5; w:2.0);
b: TVector4f = (x:2.2; y:2.4; z:2.5; w:4.0);

var
c: TVector4f;
i: integer;

begin
for i := 0 to 100000000 do begin
//c := cross(a, b); c := cross(b, a);
cross(c, a, b); cross(c, b, a);
end;
end.[/pascal]
Cpu: amd k8 1,6GHz