TL DR: even when generating x86-64 code, which uses SSE by default, FPC can mix in FPU instructions, for the sake of complete, to a bit, reproducibility -- which murders performance.
The online compiler explorer https://godbolt.org/
disassembles this:
Code:
{$mode objfpc}
unit quack;
{$fputype sse64}
interface
type float = single;
var a, b, c: float;
procedure testit;
implementation
procedure testit;
begin
b:= a * float(3.14);
end;
procedure testit1;
begin
b:= a * 3.14;
end;
procedure testit2;
begin
b:= a * c;
end;
end.
into this:
Code:
testit():
movss xmm0,DWORD PTR ds:0x431ef8
mulss xmm0,DWORD PTR ds:0x4254b0
movss DWORD PTR ds:0x431efc,xmm0
ret
nop DWORD PTR [rax+0x0]
testit1():
fld DWORD PTR ds:0x431ef8
fld TBYTE PTR ds:0x4254c0
fmulp st(1),st
fstp DWORD PTR ds:0x431efc
ret
nop DWORD PTR [rax+rax*1+0x0]
testit2():
movss xmm0,DWORD PTR ds:0x431ef8
mulss xmm0,DWORD PTR ds:0x431f00
movss DWORD PTR ds:0x431efc,xmm0
ret
Note the vintage Fxxx instructions in testit1() - this is what doing floating point calculations the old, 1980-s way looks like.
My reproducibility test program (see here:
https://www.pascalgamedevelopment.co...l=1#post149998
https://www.pascalgamedevelopment.co...l=1#post149991
)
shows appalling consequences of this: while maintaining strict binary reproducibility (the result of calculations will match to a bit!) the multiplication by a constant NOT wrapped in type-cast is whole 3.007 times slower than the same formula using a typecast-wrapped constant.
..checking x * 3.141592653589793 (inline const)
.................................
..ok, in 73 (pure 14,7) seconds (0,286 GFLOPS)
..md5 checksum = 0FC3738303DEA3CFC8C6F7AFBF585BE6
..checking x * float(3.141592653589793) (inline const with type-cast)
.................................
..ok, in 64 (pure 4,8
seconds (0,86 GFLOPS)
..md5 checksum = 9CA6E7B818FA046C3DAE722C35196729
While 32-bit code has no such penalty:
..checking x * 3.141592653589793 (inline const)
.................................
..ok, in 52 (pure 5,42) seconds (0,774 GFLOPS)
..md5 checksum = 0FC3738303DEA3CFC8C6F7AFBF585BE6
..checking x * float(3.141592653589793) (inline const with type-cast)
.................................
..ok, in 52 (pure 5,24) seconds (0,8 GFLOPS)
..md5 checksum = 9CA6E7B818FA046C3DAE722C35196729
No big surprise considering that for 64-bit code the FPU is an unnatural, legacy thing that is allowed to live out of mercy and backward compatibility.
I am sure there are even more FUN slowdowns when code mixes those methods thus forcing the CPU to switch gears clearing all its XMM registers with chlorine after each such unclean call.
Bookmarks