Results 1 to 4 of 4

Thread: Surprise! Why multiplication by inline const may work 3 times slower in 64-bit code

  1. #1

    Surprise! Why multiplication by inline const may work 3 times slower in 64-bit code

    TL DR: even when generating x86-64 code, which uses SSE by default, FPC can mix in FPU instructions, for the sake of complete, to a bit, reproducibility -- which murders performance.

    The online compiler explorer https://godbolt.org/
    disassembles this:
    Code:
    {$mode objfpc}
    unit quack;
    {$fputype sse64}
    interface
    type float = single;
    var a, b, c: float;
    procedure testit;
    implementation
    procedure testit;
    begin
      b:= a * float(3.14);
    end;
    procedure testit1;
    begin
      b:= a * 3.14;
    end;
    procedure testit2;
    begin
      b:= a * c;
    end;
    end.
    into this:
    Code:
    testit():
     movss  xmm0,DWORD PTR ds:0x431ef8
     mulss  xmm0,DWORD PTR ds:0x4254b0
     movss  DWORD PTR ds:0x431efc,xmm0
     ret    
     nop    DWORD PTR [rax+0x0]
    testit1():
     fld    DWORD PTR ds:0x431ef8
     fld    TBYTE PTR ds:0x4254c0
     fmulp  st(1),st
     fstp   DWORD PTR ds:0x431efc
     ret    
     nop    DWORD PTR [rax+rax*1+0x0]
    testit2():
     movss  xmm0,DWORD PTR ds:0x431ef8
     mulss  xmm0,DWORD PTR ds:0x431f00
     movss  DWORD PTR ds:0x431efc,xmm0
     ret
    Note the vintage Fxxx instructions in testit1() - this is what doing floating point calculations the old, 1980-s way looks like.

    My reproducibility test program (see here:
    https://www.pascalgamedevelopment.co...l=1#post149998
    https://www.pascalgamedevelopment.co...l=1#post149991
    )

    shows appalling consequences of this: while maintaining strict binary reproducibility (the result of calculations will match to a bit!) the multiplication by a constant NOT wrapped in type-cast is whole 3.007 times slower than the same formula using a typecast-wrapped constant.
    ..checking x * 3.141592653589793 (inline const)
    .................................
    ..ok, in 73 (pure 14,7) seconds (0,286 GFLOPS)
    ..md5 checksum = 0FC3738303DEA3CFC8C6F7AFBF585BE6

    ..checking x * float(3.141592653589793) (inline const with type-cast)
    .................................
    ..ok, in 64 (pure 4,8 seconds (0,86 GFLOPS)
    ..md5 checksum = 9CA6E7B818FA046C3DAE722C35196729

    While 32-bit code has no such penalty:
    ..checking x * 3.141592653589793 (inline const)
    .................................
    ..ok, in 52 (pure 5,42) seconds (0,774 GFLOPS)
    ..md5 checksum = 0FC3738303DEA3CFC8C6F7AFBF585BE6

    ..checking x * float(3.141592653589793) (inline const with type-cast)
    .................................
    ..ok, in 52 (pure 5,24) seconds (0,8 GFLOPS)
    ..md5 checksum = 9CA6E7B818FA046C3DAE722C35196729
    No big surprise considering that for 64-bit code the FPU is an unnatural, legacy thing that is allowed to live out of mercy and backward compatibility.
    I am sure there are even more FUN slowdowns when code mixes those methods thus forcing the CPU to switch gears clearing all its XMM registers with chlorine after each such unclean call.

  2. #2
    My methodology wasn't entirely fair. Why? Because some functions like Frac() run *much* slower on very large numbers -- exactly the range where they produce meaningless noise.
    I limited tome tests to +/- one million (a reasonably expected range in a game) and added another version of fake quick sin, based on Trunc() instead of Frac(), which runs nearly twice faster on x86 compared to my Frac() based fake sin (but only marginally faster on x86-64).
    Did not test if it really returns sine and not some pink polka-dotted cryptid, though.

    A good news: Round() and Trunc() are nearly as fast as a multiplication (0.62 gigaflops vs 0.82 in 32-bit code, 0.74 vs 0.83 in 64-bit code).
    And Frac(), in the -1000000.0..1000000.0 range, is only 4.7 times slower than multiplication in 64-bit code and only 8.9 times slower in 32-bit code.

    P.S. Where are my manners...
    Code:
      function ebd_sin(a: float): float; inline;
      begin
        a:= Frac(a * float(0.318309886183790671537767526745031));// 1 / 3.141592653589793));
        a:= (float(1.0) - a) * a;
        Result:= float (129600.0) * a / (float(40500.0) - a);
      end;
    Code:
    function tricky_sin(a: float): float; inline;
      var b: float;
      begin
        b:= a * float(0.318309886183790671537767526745031);// 1 / 3.141592653589793
        a:= b - Trunc(b);
        a:= (float(1.0) - a) * a;
        Result:= float (129600.0) * a / (float(40500.0) - a);
      end;
    Last edited by Chebmaster; 15-05-2023 at 12:07 PM.

  3. #3
    Are you testing this only one one machine or multiple different machines? I learned a long time ago that they can be huge difference between how single code runs on different CPU's especially if the code is relying on some extended CPU features. Why so.
    Well most of these features have been developed by one of the CPU maker's companies. So the company that designed such feature usually have advantage over others and thus manages to get better performance out of it. But not always. In some cases the CPU maker can not fully integrate one of such features due to licencing and hence might be forced to enable such feature on their CPU's in what is sometimes called software mode. This would result in much worse performance but at least the code that relies on such feature would not fail to work.

    So you should do your testing on as many different devices as you can before coming to any conclusion as of which code is better.

  4. #4
    Quote Originally Posted by SilverWarior View Post
    Are you testing this only one one machine or multiple different machines?
    Ka-whoops!
    I was only testing on an i5 2450m

    Let's try @ Ryzen 7 5800X...
    ..multiplication not wrapped in type-cast (and thus using FPU) is 2.7 times slower in 32-bit code (0.81 vs 2.18 gigaflops) and 4.21 times slower in 64-bit code (0.46 vs 1.94)
    So on Ryzens this hits even harder, affecting 32-bit code as well as 64-bit.

    P.S. You can try it yourself, as I mentioned before
    (note you need to make sure your browser doesn't correct http into https since I still haven't corrected my server's Let'sEncrypt and the https has invalid sertificate)
    pure source http://chentrah.chebmaster.com/downloads/determchk.zip (7Kb)
    with binaries compiled for x86 and x86-64 using both Free Pascal 3.2.2 and Free Pascal 2.6.4 : http://chentrah.chebmaster.com/downl...thbinaries.zip (199Kb)

    http://chentrah.chebmaster.com/downl...ple_output.txt
    :
    Microsoft Windows [Version 10.0.19044.2965]
    (c) Корпорация Майкрософт (Microsoft Corporation). Все права защищены.

    x:\stuff\determchk>determchk_322_x86.exe

    Determinism checker, built using 3.2.2 for Win32 i386
    (c) 2016, 2023 ChebMaster
    This program calculates md5 checksums over the entire float range
    (4 billion something calculations per formula) to test
    if reproducibility is possible using Free Pascal
    -----------------------------------------
    Init timer...
    Setting hardware timer to 1ms... Ok
    Setting THREAD_PRIORITY_TIME_CRITICAL... Ok
    Measuring TSC frequency... Ok
    Resetting thread priority back to normal... Ok
    Calling timeEndPeriod(1)...Ok
    Ultra-res timer at 4,19 GHz (error of 0,239 nanoseconds)
    -----------------------------------------

    ..checking round(x) (-1 million to +1 million)
    .................................
    ..ok, in 8 (pure 1) seconds (1,21 GFLOPS)
    ..md5 checksum = 71AD5C546C02DCE7A1804554B2ACE0BA

    ..checking trunc(x) (-1 million to +1 million)
    .................................
    ..ok, in 8 (pure 1) seconds (1,21 GFLOPS)
    ..md5 checksum = A5AEE527EC2F8F587A5294C5D9D999A7

    ..checking frac(x) (-1 million to +1 million)
    .................................
    ..ok, in 14 (pure 6,43) seconds (0,188 GFLOPS)
    ..md5 checksum = CA2119DA4E2ECEC02F00B78116120B86

    ..checking sin(x) (0 to Pi)
    .................................
    ..ok, in 42 (pure 35,2) seconds (0,0304 GFLOPS)
    ..md5 checksum = 4DE8EFC27CBB692E5E3DEB7A7E561EAB

    ..checking fake quick sin() (0 to Pi)
    .................................
    ..ok, in 30 (pure 23,9) seconds (0,0447 GFLOPS)
    ..md5 checksum = 78E20BDF40F0D2352EFB0F50427AAFC0

    ..checking tricky fake quick sin() based on Trunc() instead of Frac() (0 to Pi)
    .................................
    ..ok, in 14 (pure 7,7) seconds (0,139 GFLOPS)
    ..md5 checksum = 78E20BDF40F0D2352EFB0F50427AAFC0

    ..checking x * y (two values)
    .................................
    ..ok, in 14 (pure 0,919) seconds (2,32 GFLOPS)
    ..md5 checksum = 3D703727DCD17C3EDCE64B89560A98E9

    ..checking float(x * y) (two values wrapped in type-cast)
    .................................
    ..ok, in 14 (pure 0,916) seconds (2,32 GFLOPS)
    ..md5 checksum = 3D703727DCD17C3EDCE64B89560A98E9

    ..checking x * 3.141592653589793 (inline const)
    .................................
    ..ok, in 31 (pure 5,1 seconds (0,81 GFLOPS)
    ..md5 checksum = 0FC3738303DEA3CFC8C6F7AFBF585BE6

    ..checking x * float(3.141592653589793) (inline const with type-cast)
    .................................
    ..ok, in 27 (pure 1,92) seconds (2,18 GFLOPS)
    ..md5 checksum = 9CA6E7B818FA046C3DAE722C35196729

    ..checking 1/x
    .................................
    ..ok, in 15 (pure 1,61) seconds (1,32 GFLOPS)
    ..md5 checksum = 00144058D1BFF4A090304684F39E6020

    ..checking sqrt(x)
    .................................
    ..ok, in 16 (pure 2,53) seconds (0,844 GFLOPS)
    ..md5 checksum = 10B012DFF8522837F45FBC1DA821B545

    ..checking 1/sqrt(x)
    .................................
    ..ok, in 17 (pure 4,31) seconds (0,494 GFLOPS)
    ..md5 checksum = 7BA70F1439D5E2955151CC565477E924

    ..checking SSE SIMD4 1/sqrt(x)
    .................................
    ..ok, in 14 (pure 1,06) seconds (2 GFLOPS)
    ..md5 checksum = 7BA70F1439D5E2955151CC565477E924

    ..checking SSE SIMD4 RSQRTPS (packed quick reverse square root)
    .................................
    ..ok, in 13 (pure 0,274) seconds (7,78 GFLOPS)
    ..md5 checksum = EF9B294032F7BA3051A1025B06EA3C96


    Press Enter to close.


    x:\stuff\determchk>determchk_322_x86-64.exe

    Determinism checker, built using 3.2.2 for Win64 x86_64
    (c) 2016, 2023 ChebMaster
    This program calculates md5 checksums over the entire float range
    (4 billion something calculations per formula) to test
    if reproducibility is possible using Free Pascal
    -----------------------------------------
    Init timer...
    Setting hardware timer to 1ms... Ok
    Setting THREAD_PRIORITY_TIME_CRITICAL... Ok
    Measuring TSC frequency... Ok
    Resetting thread priority back to normal... Ok
    Calling timeEndPeriod(1)...Ok
    Ultra-res timer at 4,2 GHz (error of 0,238 nanoseconds)
    -----------------------------------------

    ..checking round(x) (-1 million to +1 million)
    .................................
    ..ok, in 10 (pure 0,642) seconds (1,88 GFLOPS)
    ..md5 checksum = 71AD5C546C02DCE7A1804554B2ACE0BA

    ..checking trunc(x) (-1 million to +1 million)
    .................................
    ..ok, in 10 (pure 0,683) seconds (1,77 GFLOPS)
    ..md5 checksum = A5AEE527EC2F8F587A5294C5D9D999A7

    ..checking frac(x) (-1 million to +1 million)
    .................................
    ..ok, in 16 (pure 6,31) seconds (0,191 GFLOPS)
    ..md5 checksum = CA2119DA4E2ECEC02F00B78116120B86

    ..checking sin(x) (0 to Pi)
    .................................
    ..ok, in 17 (pure 8,37) seconds (0,128 GFLOPS)
    ..md5 checksum = 4DE8EFC27CBB692E5E3DEB7A7E561EAB

    ..checking fake quick sin() (0 to Pi)
    .................................
    ..ok, in 11 (pure 2,06) seconds (0,52 GFLOPS)
    ..md5 checksum = 78E20BDF40F0D2352EFB0F50427AAFC0

    ..checking tricky fake quick sin() based on Trunc() instead of Frac() (0 to Pi)
    .................................
    ..ok, in 10 (pure 1,5 seconds (0,677 GFLOPS)
    ..md5 checksum = 78E20BDF40F0D2352EFB0F50427AAFC0

    ..checking x * y (two values)
    .................................
    ..ok, in 18 (pure 1,07) seconds (2 GFLOPS)
    ..md5 checksum = 3D703727DCD17C3EDCE64B89560A98E9

    ..checking float(x * y) (two values wrapped in type-cast)
    .................................
    ..ok, in 18 (pure 1,07) seconds (2 GFLOPS)
    ..md5 checksum = 3D703727DCD17C3EDCE64B89560A98E9

    ..checking x * 3.141592653589793 (inline const)
    .................................
    ..ok, in 43 (pure 8,96) seconds (0,468 GFLOPS)
    ..md5 checksum = 0FC3738303DEA3CFC8C6F7AFBF585BE6

    ..checking x * float(3.141592653589793) (inline const with type-cast)
    .................................
    ..ok, in 36 (pure 2,16) seconds (1,94 GFLOPS)
    ..md5 checksum = 9CA6E7B818FA046C3DAE722C35196729

    ..checking 1/x
    .................................
    ..ok, in 19 (pure 1,7) seconds (1,25 GFLOPS)
    ..md5 checksum = 00144058D1BFF4A090304684F39E6020

    ..checking sqrt(x)
    .................................
    ..ok, in 20 (pure 2,56) seconds (0,833 GFLOPS)
    ..md5 checksum = 10B012DFF8522837F45FBC1DA821B545

    ..checking 1/sqrt(x)
    .................................
    ..ok, in 22 (pure 4,33) seconds (0,492 GFLOPS)
    ..md5 checksum = 7BA70F1439D5E2955151CC565477E924

    ..checking SSE SIMD4 1/sqrt(x)
    Unknown check kind (dck_sse_one_div_sqrt)
    ..checking SSE SIMD4 RSQRTPS (packed quick reverse square root)
    Unknown check kind (dck_sse_rsqrtps)

    Press Enter to close.


    x:\stuff\determchk>
    Note that all checksums should match, on all platforms and CPU models, *except* the SSE SIMD4 RSQRTPS one. That one will have a different checksum on each CPU model, thus not suitable for physics since it is not deterministic, but very useful for secondary things like animation.

    P.S. Now I'm itching to unearth my Core2 Duo rig and check also there (only the 32-bit version, alas, because although the CPU itself is 64-bit, WinXP reigning there is not).
    Last edited by SilverWarior; 20-05-2023 at 12:42 PM. Reason: Fixed second download link

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •