Page 4 of 5 FirstFirst ... 2345 LastLast
Results 31 to 40 of 48

Thread: Cheb's project will be here.

  1. #31
    Google translate, I call upon you to let me bridge the language gap for free!
    (from https://freepascal-ru.translate.goog..._x_tr_sch=http )

    (my reply to discussion about reproducibility and how to achieve it)

    Re: Cheb's Game Engine

    Message Cheb » 02.03.2023 15:10:10
    The trick is to:
    a) strictly 32-bit floats.
    b) you wrap *any* constant in the code in a typecast to a float. Any. Anytime and anywhere. a:= b * Single(2.0); Otherwise, Pascal tries to calculating in as wide format as possible and does it in a platform-dependent way: doubles, extendeds, black magic ...

    Added after 3 hours 54 minutes 43 seconds:
    PS. I do not take anything for granted, I experiment, I have a built-in tester in the engine that calculates md5 over the entire 32-bit range (4 billion in total).
    Damn, that's when it's inconvenient that the engine is not going to at all.
    AFAIR, I compared x86, x86-64 and arm from raspberries - and everywhere the sine converged to a bit.

    Added after 1 minute 16 seconds:
    P.P.S. BUT! then I collected in 2.6.4 for x86-64 and, AFAIR, 2.6.4 also for arm.

    Added after 5 hours 37 minutes 26 seconds:
    P.P.P.S. I started a separate test program consisting of a single source file, ripped from the engine - but when would it be ready I really dunno, there is no time at all, a lot of things from all sides.

    User avatar
    Cheb
    enthusiast

    Messages: 985
    Registered: 06/06/2005 15:54:34

    to come back to the beginning
    Re: Cheb's Game Engine

    Message Cheb » 04.03.2023 15:44:36
    Oh, how many wonderful discoveries we have! :shock: :x :evil:

    (note: if you looked at the indicator of your processor in the Intel Burn Test / Lintel and dreamed - prepare for dashed expectations. On a processor with a limit of 20 gigaflops, the Pascal program will give out around 0.8. Because there are spherical cows coded in the most exalted AVX by special people - and then there are one-at-a-time calculations with guaranteed bitwise reproducibility)

    1. Frac () is a monstrously slow function. Lowest of the low at the Sin() level. If you were hoping to make an accelerated fake sine like

    Code: Select all
    Code:
          function ebd_sin(a: float): float; inline;
          begin
            a:= frac(a * float(0.318309886183790671537767526745031));// 1 / 3.141592653589793));
            a:= (float(1.0) - a) * a;
            Result:= float (129600.0) * a / (float(40500.0) - a);
          end;
    - forget it, it will wallow in the same ditch with the sine and they will be oinking head to head (sin() 0.04 gigaflops, ebd_sin() 0.05).
    Which is 13 times slower than multiplication and one and a half times slower than 1/sqrt(x).

    2. In 64-bit code, some things are much slower, and some things are much faster - but the reproducibility is ideal. Checksums always match those from the 32-bit code. In order to get a mismatch, you need to climb into the assemblly language and stick your fingers in the electric socket of RSQRTPS (quick and dirty inverse square root). That one - yes, that one will have a different checksum on each CPU model, not just compile target.

    AFAIR, on the Cortex A7, the checksums were exactly the same - although it would seem. I can't check right now, all my raspberries and oranges are gathering dust on the shelf. And even more so, I can’t check arm 64: I simply don’t have such. I bought an orange last year - I even was wondering why was it so cheap. It turned out that inside there is the same Cortex A7 in an embrace with Mali 400. That is: Orange Pi PC is a Chinese analogue of Raspberry Pi 2B, not higher. And it's still is being sold!

    Anyway, on x86-64 (compared to x86):
    - Frac() got exactly three times faster, making ebd_sin() outperform Sin() by 3.4 times - because that function slowed *even more*, down to 0.035 gigaflops. Do they have a special competition or wut?
    - multiplication by a constant not wrapped in a typecast to float slowed down by 2.78 times compared to wrapped one. Moreover, the checksums of that of the other option match with their counterparts from the 32-bit code (and they are different from each other).

    More details (including the test source) - when I fix my server and there will be somewhere to post it.

    Added after 21 hours 10 minutes 8 seconds:
    Furthering the topic of speed: SQRTPS + DIVPS with 1.0s preloaded into the registers are *exactly* four times faster than the standard 1/ sqrt(x). Obviously, the compiler uses exactly the same instructions - only scalar, not vector. Doing four operations at a time accelerates calculations by exactly four times. I have RCPPS commented out there - obviously, the checksum did not match, bitwise it turned out differently than honest 1 / x through DIVPS.

    But just look at RSQRTPS going at it! (four and a half times faster than the reproducible sse and eighteen times faster than the regular 1/ sqrt (x)) - and it becomes obvious that this is not a bad compiler, this is a processor getting lost in thought when you require bitwise conformance to standards.

    ..checking 1/sqrt(x)
    ..................................
    ok, in 45 (pure 21.2) seconds (0.1 GFLOPS)
    ..md5 checksum = 7BA70F1439D5E2955151CC565477E924

    ..checking SSE SIMD4 1/sqrt(x)
    ...................... ...........
    ..ok, in 29 (pure 5.31) seconds (0.401 GFLOPS)
    ..md5 checksum = 7BA70F1439D5E2955151CC565477E924

    ..checking SSE SIMD4 RSQRTPS (packed quick reverse square root)
    ... ..............................
    ..ok, in 25 (pure 1.18 ) seconds (1.81 GFLOPS)
    . .md5 checksum = F881C03FB2C6F5BBDFF57AE5532CFFFD


    Let me remind you, this is on a CPU for which Lintel reports 20 gigaflops per core (and 30 for two, because both do not fit into TDP at full tilt making effectively a 1.5 core CPU).

    Added after 3 minutes 45 seconds:

    Code: Select all
    Code:
                  dck_one_div_sqrt: begin
                    for m:= 0 to (mm div 8) - 1  do begin
                      pointer(pv):= p + m * 8 * sizeof(float);
                      pv[0]:= 1/sqrt(pv[0]);
                      pv[1]:= 1/sqrt(pv[1]);
                      pv[2]:= 1/sqrt(pv[2]);
                      pv[3]:= 1/sqrt(pv[3]);
                      pv[4]:= 1/sqrt(pv[4]);
                      pv[5]:= 1/sqrt(pv[5]);
                      pv[6]:= 1/sqrt(pv[6]);
                      pv[7]:= 1/sqrt(pv[7]);
                    end;
                  end;
                {$if defined(cpu386)}
                  dck_sse_one_div_sqrt: begin
                    for m:= 0 to (mm div 8) - 1  do begin
                      pointer(pv):= p + m * 8 * sizeof(float);
                      asm
                        mov eax, [fourones]
                        MOVAPS xmm5, [eax]
                        mov eax, [pv]
                        MOVAPS xmm6, [eax]
                        SQRTPS xmm6, xmm6
                        MOVAPS xmm4, xmm5
                        DIVPS xmm4, xmm6 //RCPPS   xmm6, xmm6 //Reciprocal Parallel Scalars or, simply speaking, 1.0/x
                        MOVAPS xmm7, [eax + 16]
                        SQRTPS xmm7, xmm7
                        MOVAPS [eax], xmm4
                        DIVPS xmm5, xmm7 //RCPSS xmm7, xmm7
                        MOVAPS [eax + 16], xmm5
                      end['eax', 'xmm6', 'xmm7', 'xmm4', 'xmm5'];
                    end;
                  end;
                  dck_sse_rsqrtps: begin
                    for m:= 0 to (mm div 8) - 1  do begin
                      pointer(pv):= p + m * 8 * sizeof(float);
                      asm
                        mov eax, [pv]
                        MOVAPS xmm6, [eax]
                        RSQRTPS xmm6, xmm6
                        MOVAPS xmm7, [eax + 16]
                        RSQRTPS xmm7, xmm7
                        MOVAPS [eax], xmm6
                        MOVAPS [eax + 16], xmm7
                      end['eax', 'xmm6', 'xmm7'];
                    end;
                  end;
                {$endif}

    , where mm in most cases = 2048

    User avatar
    Cheb
    enthusiast

    Messages: 985
    Registered: 06/06/2005 15:54:34

    to come back to the beginning
    Re: Cheb's Game Engine

    Message Cheb » 10.03.2023 22:53:15
    Updated requirements, cleaned definitions in the code from unnecessary variability

    Reason: my minimums include Athlon 64 X2 (2005, alas, I don't have it) and Pentium E2140 (2007, computer named Gray Goose). Both of these dual-core processors are 64-bit (alas, WinXP has no usable 64-bit version) and support SSE3.
    Then what the (insert expletive here) was I doing basing my code on SSE2 instead of SSE3?
    From now on, any code for x86 and x86-64, in any assembler inserts, assumes that SSE3's availability is guaranteed.

    I am not going to consider SSE4 and higher, because if the E2140 with its two 1.6 GHz cores has enough horse power, then any modern one would fly into orbit and there is simply no point in working myself hard about this. My good intentions towards AVX/AVX512 will likely remain intentions.
    That's it, all done..

    Further, for LinuxSBC I have those minimals: Cortex A7. It has VFPv4-16, and I declare the same in my code as the only supported option - if I ever get to assembler under arm.
    All arrived.

    TL; DR: Free Pascal is optimized for *reproducibility*, bitwise matching results on all platforms. It seems it sacrifices lots of performance to reach that goal.
    Last edited by Chebmaster; 24-04-2023 at 09:42 PM.

  2. #32
    Thanks for sharing . Although I didn't grasp the details, despite Google's effort to bridge the gap, I think the conclusion is reasonable.

  3. #33
    Argh. [headdesk] Argh.
    Corrected google's translation by hand. In so many places it raises questions: why even bother using it. It's much better than 10 years ago but there are still so many things it fails to understand and convey.
    Whom am I kidding. Correcting is so much easier than translating 100% by myself.

    Reproducibility is important for me, since my multiplayer model will be an evolved lockstep:
    Code:
    TLayerRole = (
        lro_Bottom, {
          In multiplayer, runs at -500ms using perfect inputs finalized by the server.
          This is also the only layer that can be serialized. }
        lro_DeepUpwell, {
          Propagates changes from the bottom to the thermocline, thus lazily correcting for late inputs }
        lro_Thermocline, {
          Holds steady at -150ms, assuming most inputs arrive *above* it }
        lro_FastSurfacing, {
          Bubbles the changes from the thermocline to the surface thus doing the bulk of lag compensation }
        lro_PresentSurface {
          Runs on local player inputs }
      );
    As always, lots of stuff distracting me (mainly work at work) leaving me no time to move the project further. Frustrating.

  4. #34
    Quote Originally Posted by Jonax View Post
    Although I didn't grasp the details, despite Google's effort to bridge the gap
    Don't be hard on yourself if you don't understand all the details. Google translate seems to have done quite a good job. But the topic that Chebmaster is talking about is very complex.

    He is talking about hardware-level optimization and making use of extended CPU features for accelerating specific processing. This is a very complex stuff especially if you take into account that some of these extended features might be vendor specific (proprietorially owned by Intel or AMD). This means that if you want to make use of some Intel proprietary feature on AMD CPU or vice versa, the specific feature might not be directly supported by that CPU so a fallback methods which is usually slower is used to at least get the desired results. Otherwise such code would simply break.

    Another important thing is to make sure that you feed the CPU with data in the correct format that is required by the specific extended feature. Failing to do so could also result in CPU resulting in the use of some fallback methods and thus hurting performance.

  5. #35
    Quote Originally Posted by SilverWarior View Post
    Don't be hard on yourself if you don't understand all the details. Google translate seems to have done quite a good job. But the topic that Chebmaster is talking about is very complex..

    Yeah, my problem is not the quality of the translation. Which I can't comment on other than the sentences seems to have good spelling and structure.


    It's good to see some high tech acitivity in the pascal game making field. I, on the other hand, still try to familiarize myself with the basics. There is still a lot of unexplored possibilities for me in the world of 2D standard pascal components. Though I admit the audience potentially interested in my stuff is pretty limited.

  6. #36
    Quote Originally Posted by Jonax View Post
    Though I admit the audience potentially interested in my stuff is pretty limited.
    Well main reason why not many people are interested in your games is because you can find similar games all over the internet in WEB format. So many people may think in a way: Why would I go and download his game if I can find same or similar game on one of those online-games web page.

    But don't put to much thinking into this. We all have to start somewhere. At least you are finishing and publishing some games.
    Me on the other hand have been probably learning game development for far longer (over 15 years now) but since I'm always aiming for to big if ideas I still haven't published any game so far. It is not that I would not have any ideas or knowledge. I have to many ideas but still not enough knowledge to make one of my big ideas into reality.

  7. #37
    Indeed an interesting discussion. It's quite a challenge to reach and please an audience. However I'm afraid we're close to hijacking the Chebmaster Cheb's project thread. Sorry Cheb



    How about starting a new thread somewhere with some general how-to-become-a-successful-game-creator theme? Maybe the last few posts could be a good starting point.

  8. #38
    I, definitely, want to push Free Pascal to its limits and achieve the impossible.

    Here's the determinism check as a standalone project
    (note you need to make sure your browser doesn't correct http into https since I still haven't corrected my server's Let'sEncrypt and the https has invalid sertificate)
    pure source http://chentrah.chebmaster.com/downloads/determchk.zip (7Kb)
    with binaries compiled for x86 and x86-64 using both Free Pascal 3.2.2 and Free Pascal 2.6.4 : http://chentrah.chebmaster.com/downl...thbinaries.zip (199Kb)

    As you can see, the lion's share of processing time goes to calculating those md5 sums.

    A reminder: determinism is required for my planned multiplayer code to work at all. If the checksums do not match between platforms, those platforms wouldn't be able to play together and you'd need a separate server for each of them.

    My friend who is working in in the game industry full time, had to deal with lack of determinism in Unity. Namely, you cannot count on monsters behaving identically if present with identical player actions. He had to improvise, adding a distributed server of sorts where each of the clients in a multiplayer game acted as a server for a fraction of monsters and just broadcast the behavior of those monsters to all other clients.

    Full determinism, on the other hand, allows sending *only* the player inputs over the network. This is MMO-grade stuff: no matter how many monsters are there (even a million) or how massive the changes to the game world (i want the ability to reduce the whole map to a huge crater) the network traffic would remain zilch.

  9. #39
    Have you perhaps considered using some other Hashing algorithm instead of MD5. CRC32 hashing algorithm is way faster but might result in more clashes where different input results in same hash result. On the other hand many modern CPU's have hardware support for SHA based hashing algorithms which could mean that they would be much faster than MD5 which if my memory serves me correctly is rarely hardware accelerated.

    Any way there is a good thread on Stack Overflow about comparison between various hashing algorithms. https://stackoverflow.com/questions/...st-performance
    Granted question poster was interested in performance difference in .NET environment but some people that provided answered have done their own testing in other programming languages even Delphi.

  10. #40
    I just grabbed the one that was easiest to slap on and had a reasonably sized hash.
    Since this code is not going to be part of normal execution but only be used for research during development (or, maybe, as an optional "check your CPU for compatibility" feature).

Page 4 of 5 FirstFirst ... 2345 LastLast

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •