Google translate, I call upon you to let me bridge the language gap for free!

(from https://freepascal-ru.translate.goog..._x_tr_sch=http )

(my reply to discussion about reproducibility and how to achieve it)

Re: Cheb's Game Engine

Message Cheb » 02.03.2023 15:10:10

The trick is to:

a) strictly 32-bit floats.

b) you wrap *any* constant in the code in a typecast to a float. Any. Anytime and anywhere. a:= b * Single(2.0); Otherwise, Pascal tries to calculating in as wide format as possible and does it in a platform-dependent way: doubles, extendeds, black magic ...

Added after 3 hours 54 minutes 43 seconds:

PS. I do not take anything for granted, I experiment, I have a built-in tester in the engine that calculates md5 over the entire 32-bit range (4 billion in total).

Damn, that's when it's inconvenient that the engine is not going to at all.

AFAIR, I compared x86, x86-64 and arm from raspberries - and everywhere the sine converged to a bit.

Added after 1 minute 16 seconds:

P.P.S. BUT! then I collected in 2.6.4 for x86-64 and, AFAIR, 2.6.4 also for arm.

Added after 5 hours 37 minutes 26 seconds:

P.P.P.S. I started a separate test program consisting of a single source file, ripped from the engine - but when would it be ready I really dunno, there is no time at all, a lot of things from all sides.

User avatar

Cheb

enthusiast

Messages: 985

Registered: 06/06/2005 15:54:34

to come back to the beginning

Re: Cheb's Game Engine

Message Cheb » 04.03.2023 15:44:36

Oh, how many wonderful discoveries we have! :shock: :x :evil:

(note: if you looked at the indicator of your processor in the Intel Burn Test / Lintel and dreamed - prepare for dashed expectations. On a processor with a limit of 20 gigaflops, the Pascal program will give out around 0.8. Because there are spherical cows coded in the most exalted AVX by special people - and then there are one-at-a-time calculations with guaranteed bitwise reproducibility)

1. Frac () is a monstrously slow function. Lowest of the low at the Sin() level. If you were hoping to make an accelerated fake sine like

Code: Select all

- forget it, it will wallow in the same ditch with the sine and they will be oinking head to head (sin() 0.04 gigaflops, ebd_sin() 0.05).Code:function ebd_sin(a: float): float; inline; begin a:= frac(a * float(0.318309886183790671537767526745031));// 1 / 3.141592653589793)); a:= (float(1.0) - a) * a; Result:= float (129600.0) * a / (float(40500.0) - a); end;

Which is 13 times slower than multiplication and one and a half times slower than 1/sqrt(x).

2. In 64-bit code, some things are much slower, and some things are much faster - but the reproducibility is ideal. Checksums always match those from the 32-bit code. In order to get a mismatch, you need to climb into the assemblly language and stick your fingers in the electric socket of RSQRTPS (quick and dirty inverse square root). That one - yes, that one will have a different checksum on each CPU model, not just compile target.

AFAIR, on the Cortex A7, the checksums were exactly the same - although it would seem. I can't check right now, all my raspberries and oranges are gathering dust on the shelf. And even more so, I can’t check arm 64: I simply don’t have such. I bought an orange last year - I even was wondering why was it so cheap. It turned out that inside there is the same Cortex A7 in an embrace with Mali 400. That is: Orange Pi PC is a Chinese analogue of Raspberry Pi 2B, not higher. And it's still is being sold!

Anyway, on x86-64 (compared to x86):

- Frac() got exactly three times faster, making ebd_sin() outperform Sin() by 3.4 times - because that function slowed *even more*, down to 0.035 gigaflops. Do they have a special competition or wut?

- multiplication by a constant not wrapped in a typecast to float slowed down by 2.78 times compared to wrapped one. Moreover, the checksums of that of the other option match with their counterparts from the 32-bit code (and they are different from each other).

More details (including the test source) - when I fix my server and there will be somewhere to post it.

Added after 21 hours 10 minutes 8 seconds:

Furthering the topic of speed: SQRTPS + DIVPS with 1.0s preloaded into the registers are *exactly* four times faster than the standard 1/ sqrt(x). Obviously, the compiler uses exactly the same instructions - only scalar, not vector. Doing four operations at a time accelerates calculations by exactly four times. I have RCPPS commented out there - obviously, the checksum did not match, bitwise it turned out differently than honest 1 / x through DIVPS.

But just look at RSQRTPS going at it! (four and a half times faster than the reproducible sse and eighteen times faster than the regular 1/ sqrt (x)) - and it becomes obvious that this is not a bad compiler, this is a processor getting lost in thought when you require bitwise conformance to standards.

..checking 1/sqrt(x)

..................................

ok, in 45 (pure 21.2) seconds (0.1 GFLOPS)

..md5 checksum = 7BA70F1439D5E2955151CC565477E924

..checking SSE SIMD4 1/sqrt(x)

...................... ...........

..ok, in 29 (pure 5.31) seconds (0.401 GFLOPS)

..md5 checksum = 7BA70F1439D5E2955151CC565477E924

..checking SSE SIMD4 RSQRTPS (packed quick reverse square root)

... ..............................

..ok, in 25 (pure 1.18 ) seconds (1.81 GFLOPS)

. .md5 checksum = F881C03FB2C6F5BBDFF57AE5532CFFFD

Let me remind you, this is on a CPU for which Lintel reports 20 gigaflops per core (and 30 for two, because both do not fit into TDP at full tilt making effectively a 1.5 core CPU).

Added after 3 minutes 45 seconds:

Code: Select all

Code:dck_one_div_sqrt: begin for m:= 0 to (mm div 8) - 1 do begin pointer(pv):= p + m * 8 * sizeof(float); pv[0]:= 1/sqrt(pv[0]); pv[1]:= 1/sqrt(pv[1]); pv[2]:= 1/sqrt(pv[2]); pv[3]:= 1/sqrt(pv[3]); pv[4]:= 1/sqrt(pv[4]); pv[5]:= 1/sqrt(pv[5]); pv[6]:= 1/sqrt(pv[6]); pv[7]:= 1/sqrt(pv[7]); end; end; {$if defined(cpu386)} dck_sse_one_div_sqrt: begin for m:= 0 to (mm div 8) - 1 do begin pointer(pv):= p + m * 8 * sizeof(float); asm mov eax, [fourones] MOVAPS xmm5, [eax] mov eax, [pv] MOVAPS xmm6, [eax] SQRTPS xmm6, xmm6 MOVAPS xmm4, xmm5 DIVPS xmm4, xmm6 //RCPPS xmm6, xmm6 //Reciprocal Parallel Scalars or, simply speaking, 1.0/x MOVAPS xmm7, [eax + 16] SQRTPS xmm7, xmm7 MOVAPS [eax], xmm4 DIVPS xmm5, xmm7 //RCPSS xmm7, xmm7 MOVAPS [eax + 16], xmm5 end['eax', 'xmm6', 'xmm7', 'xmm4', 'xmm5']; end; end; dck_sse_rsqrtps: begin for m:= 0 to (mm div 8) - 1 do begin pointer(pv):= p + m * 8 * sizeof(float); asm mov eax, [pv] MOVAPS xmm6, [eax] RSQRTPS xmm6, xmm6 MOVAPS xmm7, [eax + 16] RSQRTPS xmm7, xmm7 MOVAPS [eax], xmm6 MOVAPS [eax + 16], xmm7 end['eax', 'xmm6', 'xmm7']; end; end; {$endif}

, where mm in most cases = 2048

User avatar

Cheb

enthusiast

Messages: 985

Registered: 06/06/2005 15:54:34

to come back to the beginning

Re: Cheb's Game Engine

Message Cheb » 10.03.2023 22:53:15

Updated requirements, cleaned definitions in the code from unnecessary variability

Reason: my minimums include Athlon 64 X2 (2005, alas, I don't have it) and Pentium E2140 (2007, computer named Gray Goose). Both of these dual-core processors are 64-bit (alas, WinXP has no usable 64-bit version) and support SSE3.

Then what the (insert expletive here) was I doing basing my code on SSE2 instead of SSE3?

From now on, any code for x86 and x86-64, in any assembler inserts, assumes that SSE3's availability is guaranteed.

I am not going to consider SSE4 and higher, because if the E2140 with its two 1.6 GHz cores has enough horse power, then any modern one would fly into orbit and there is simply no point in working myself hard about this. My good intentions towards AVX/AVX512 will likely remain intentions.

That's it, all done..

Further, for LinuxSBC I have those minimals: Cortex A7. It has VFPv4-16, and I declare the same in my code as the only supported option - if I ever get to assembler under arm.

All arrived.

TL; DR: Free Pascal is optimized for *reproducibility*, bitwise matching results on all platforms. It seems it sacrifices lots of performance to reach that goal.

## Bookmarks