Surprise! Why multiplication by inline const may work 3 times slower in 64-bit code

**Chebmaster** · 14-05-2023, 03:57 PM

TL DR: even when generating x86-64 code, which uses SSE by default, FPC can mix in FPU instructions, for the sake of complete, to a bit, reproducibility -- which murders performance.

The online compiler explorer https://godbolt.org/
disassembles this:

Code:

{$mode objfpc}
unit quack;
{$fputype sse64}
interface
type float = single;
var a, b, c: float;
procedure testit;
implementation
procedure testit;
begin
  b:= a * float(3.14);
end;
procedure testit1;
begin
  b:= a * 3.14;
end;
procedure testit2;
begin
  b:= a * c;
end;
end.

into this:

Code:

testit():
 movss  xmm0,DWORD PTR ds:0x431ef8
 mulss  xmm0,DWORD PTR ds:0x4254b0
 movss  DWORD PTR ds:0x431efc,xmm0
 ret    
 nop    DWORD PTR [rax+0x0]
testit1():
 fld    DWORD PTR ds:0x431ef8
 fld    TBYTE PTR ds:0x4254c0
 fmulp  st(1),st
 fstp   DWORD PTR ds:0x431efc
 ret    
 nop    DWORD PTR [rax+rax*1+0x0]
testit2():
 movss  xmm0,DWORD PTR ds:0x431ef8
 mulss  xmm0,DWORD PTR ds:0x431f00
 movss  DWORD PTR ds:0x431efc,xmm0
 ret

Note the vintage Fxxx instructions in testit1() - this is what doing floating point calculations the old, 1980-s way looks like.

My reproducibility test program (see here:
https://www.pascalgamedevelopment.co...l=1#post149998
https://www.pascalgamedevelopment.co...l=1#post149991
)

shows appalling consequences of this: while maintaining strict binary reproducibility (the result of calculations will match to a bit!) the multiplication by a constant NOT wrapped in type-cast is whole 3.007 times slower than the same formula using a typecast-wrapped constant.

..checking x * 3.141592653589793 (inline const)
.................................
..ok, in 73 (pure 14,7) seconds (0,286 GFLOPS)
..md5 checksum = 0FC3738303DEA3CFC8C6F7AFBF585BE6

..checking x * float(3.141592653589793) (inline const with type-cast)
.................................
..ok, in 64 (pure 4,8

seconds (0,86 GFLOPS)
..md5 checksum = 9CA6E7B818FA046C3DAE722C35196729

While 32-bit code has no such penalty:

..checking x * 3.141592653589793 (inline const)
.................................
..ok, in 52 (pure 5,42) seconds (0,774 GFLOPS)
..md5 checksum = 0FC3738303DEA3CFC8C6F7AFBF585BE6

..checking x * float(3.141592653589793) (inline const with type-cast)
.................................
..ok, in 52 (pure 5,24) seconds (0,8 GFLOPS)
..md5 checksum = 9CA6E7B818FA046C3DAE722C35196729

No big surprise considering that for 64-bit code the FPU is an unnatural, legacy thing that is allowed to live out of mercy and backward compatibility.
I am sure there are even more FUN slowdowns when code mixes those methods thus forcing the CPU to switch gears clearing all its XMM registers with chlorine after each such unclean call.

Moderation Process Reminder

Thread: Surprise! Why multiplication by inline const may work 3 times slower in 64-bit code

Thread Tools

Display

Threaded View

Surprise! Why multiplication by inline const may work 3 times slower in 64-bit code

Bookmarks

Bookmarks

Posting Permissions