Cheb's project will be here.

**Chebmaster** · 21-02-2023, 02:55 PM

That broom is quite enlightening.
Yes, using long rifles and swinging zweihanders in tight passages... That's arcade, not sim. Who *ever* implements inability to turn around because your shillelagh is longer than the corridor's width...
AFAIR only Tribes: Vengeance even had a mechanic that visibly moved your gun back if you faced a wall (and also called their rocket launcher "spinfusor" which is seriously badass).

At the very least, firearms could be balanced along movement vs accuracy axis. If you are on the move or change your aim rapidly, you get atrocious random spread (which shotguns and smgs partly negate by having their own spread). If you want an accurate shot you switch to aiming stance, either by stopping and reducing your mouse movements, or by pressing a dedicated button that hampers your movement and zooms.
If a game doesn't have that, it's an arcade and should look long and hard at the Quake series.

Hmm... Maybe i should review my concept. Not forcing movement penalty while spell is being charged, but inflicting large random sway instead (& hiding the crosshair). Then "Charging your enemy while charging your shot" becomes a valid strategy. Also, directing homing projectiles while sprinting (the controllable fireball from Dark Messiah of Might and Magic is my shining ideal).

Loadout opportunities arise.
If the spell that serves the role of shotgun (120% damage total scattered in a wide cone) could be pre-charged to fire instantly on release while sprinting, and its alt-fire works like Q3 nailgun (a long-range with very little spatial but large velocity spread) penalized with a loud sound and standing still...
If the spell that mimics Q3 plasma, at the same time, has a sizable firing delay and no way to pre-charge it because its alt fire consists of controllable single shots for long-range harassing instead...
That gives depth to the rock-paper-scissors interplay between those two.

Pair with more class-specific spells, like a controllable fireball that has hefty mana cost, and you get seriously fun gameplay with very few actual "weapons".

**SilverWarior** · 21-02-2023, 04:26 PM

Originally Posted by Chebmaster

AFAIR only Tribes: Vengeance even had a mechanic that visibly moved your gun back if you faced a wall

Actually there are several games that have this mechanic. If my memory serves me correctly both Crysis and Far Cry 3 have this mechanic.

**Jonax** · 21-02-2023, 07:37 PM

Actually I never play that type of games anymore. Last game where I was running around shooting uglies was the great adventure game 'Legacy' from I think 1993. It ran well on brave 1 MB RAM and 386 processor and VGA monitor. In fact I still got the game on Dosbox. Though I don't think there were any sniper rifles in that game.

Point is I can say as often. I'm happy to see activity in the Pascal crowd but I can't really comment much on the current topic. Sniper rifle and its properties.

**Chebmaster** · 22-03-2023, 10:35 AM

Still *deep* in rehauling the very foundations.

Who could have thought that browsing Wikipedia about supercontinental cycles could give you ideas!

My former Logic, bloated to unsustainability and stifled by being the root managed object of the graph, split apart like Pangaea -- and things are becoming so, so much simpler!
Each of the resulting entities is quite manageable, I am in process of stuffing them full of methods scavenged from my old TAbstractLogic and organizing their interactions.
Also, the root managed object of the graph that goes into a sav, is now a transient thing, created just before serialization and disposed of after deserialization. Thus decoupling save file structure from the actual data structure.

I would never get anywhere with layered lag compensation had i not made this split.

Will also help me nicely to separate GUI (a local client entity, not existent on a dedicated server) from the game world.
I am positive I could present a lag-compensated multi-player rotating cube this autumn.

About first person shooters: with the exception of occasional delves into Brutal Doom, I prefer team multiplayer games of a run-and-gun variety. Namely, Jagex Ace of Spades (before it went down) and TF2. Unlike the mindless npc slaying of single-player shooters, those are tactical struggles against fellow humans, your equals in cunning, and working with your team to achieve set goals (usually capturing/holding control points, capture the flag or defense against the other team dragging a bomb towards your base).

When I finally release my design document for my planned game, you will see it's basically an AoS clone with ideas borrowed from TF2 and some of my own.
When I initially laid foundations for my engine, I wanted to make a 4X game -- maybe that, too, in time. Too ambitious, just like me struggling for years trying to one-up Unreal Engine instead of making a game.

TL; DR: snipers are anti-thesis to run-and-gun. Like in Open Arena: you have a fun rocket duel, then comes some killjoy with a railgun. Not on my watch. All my planned weapons are projectile-based.

**Chebmaster** · 24-04-2023, 10:56 AM

Google translate, I call upon you to let me bridge the language gap for free!
(from https://freepascal-ru.translate.goog..._x_tr_sch=http )

(my reply to discussion about reproducibility and how to achieve it)

Re: Cheb's Game Engine

Message Cheb » 02.03.2023 15:10:10
The trick is to:
a) strictly 32-bit floats.
b) you wrap *any* constant in the code in a typecast to a float. Any. Anytime and anywhere. a:= b * Single(2.0); Otherwise, Pascal tries to calculating in as wide format as possible and does it in a platform-dependent way: doubles, extendeds, black magic ...

Added after 3 hours 54 minutes 43 seconds:
PS. I do not take anything for granted, I experiment, I have a built-in tester in the engine that calculates md5 over the entire 32-bit range (4 billion in total).
Damn, that's when it's inconvenient that the engine is not going to at all.
AFAIR, I compared x86, x86-64 and arm from raspberries - and everywhere the sine converged to a bit.

Added after 1 minute 16 seconds:
P.P.S. BUT! then I collected in 2.6.4 for x86-64 and, AFAIR, 2.6.4 also for arm.

Added after 5 hours 37 minutes 26 seconds:
P.P.P.S. I started a separate test program consisting of a single source file, ripped from the engine - but when would it be ready I really dunno, there is no time at all, a lot of things from all sides.

User avatar
Cheb
enthusiast

Messages: 985
Registered: 06/06/2005 15:54:34

to come back to the beginning
Re: Cheb's Game Engine

Message Cheb » 04.03.2023 15:44:36
Oh, how many wonderful discoveries we have! :shock: :x :evil:

(note: if you looked at the indicator of your processor in the Intel Burn Test / Lintel and dreamed - prepare for dashed expectations. On a processor with a limit of 20 gigaflops, the Pascal program will give out around 0.8. Because there are spherical cows coded in the most exalted AVX by special people - and then there are one-at-a-time calculations with guaranteed bitwise reproducibility)

1. Frac () is a monstrously slow function. Lowest of the low at the Sin() level. If you were hoping to make an accelerated fake sine like

Code: Select all

Code:

      function ebd_sin(a: float): float; inline;
      begin
        a:= frac(a * float(0.318309886183790671537767526745031));// 1 / 3.141592653589793));
        a:= (float(1.0) - a) * a;
        Result:= float (129600.0) * a / (float(40500.0) - a);
      end;

- forget it, it will wallow in the same ditch with the sine and they will be oinking head to head (sin() 0.04 gigaflops, ebd_sin() 0.05).
Which is 13 times slower than multiplication and one and a half times slower than 1/sqrt(x).

2. In 64-bit code, some things are much slower, and some things are much faster - but the reproducibility is ideal. Checksums always match those from the 32-bit code. In order to get a mismatch, you need to climb into the assemblly language and stick your fingers in the electric socket of RSQRTPS (quick and dirty inverse square root). That one - yes, that one will have a different checksum on each CPU model, not just compile target.

AFAIR, on the Cortex A7, the checksums were exactly the same - although it would seem. I can't check right now, all my raspberries and oranges are gathering dust on the shelf. And even more so, I can’t check arm 64: I simply don’t have such. I bought an orange last year - I even was wondering why was it so cheap. It turned out that inside there is the same Cortex A7 in an embrace with Mali 400. That is: Orange Pi PC is a Chinese analogue of Raspberry Pi 2B, not higher. And it's still is being sold!

Anyway, on x86-64 (compared to x86):
- Frac() got exactly three times faster, making ebd_sin() outperform Sin() by 3.4 times - because that function slowed *even more*, down to 0.035 gigaflops. Do they have a special competition or wut?
- multiplication by a constant not wrapped in a typecast to float slowed down by 2.78 times compared to wrapped one. Moreover, the checksums of that of the other option match with their counterparts from the 32-bit code (and they are different from each other).

More details (including the test source) - when I fix my server and there will be somewhere to post it.

Added after 21 hours 10 minutes 8 seconds:
Furthering the topic of speed: SQRTPS + DIVPS with 1.0s preloaded into the registers are *exactly* four times faster than the standard 1/ sqrt(x). Obviously, the compiler uses exactly the same instructions - only scalar, not vector. Doing four operations at a time accelerates calculations by exactly four times. I have RCPPS commented out there - obviously, the checksum did not match, bitwise it turned out differently than honest 1 / x through DIVPS.

But just look at RSQRTPS going at it! (four and a half times faster than the reproducible sse and eighteen times faster than the regular 1/ sqrt (x)) - and it becomes obvious that this is not a bad compiler, this is a processor getting lost in thought when you require bitwise conformance to standards.

..checking 1/sqrt(x)
..................................
ok, in 45 (pure 21.2) seconds (0.1 GFLOPS)
..md5 checksum = 7BA70F1439D5E2955151CC565477E924

..checking SSE SIMD4 1/sqrt(x)
...................... ...........
..ok, in 29 (pure 5.31) seconds (0.401 GFLOPS)
..md5 checksum = 7BA70F1439D5E2955151CC565477E924

..checking SSE SIMD4 RSQRTPS (packed quick reverse square root)
... ..............................
..ok, in 25 (pure 1.18 ) seconds (1.81 GFLOPS)
. .md5 checksum = F881C03FB2C6F5BBDFF57AE5532CFFFD

Let me remind you, this is on a CPU for which Lintel reports 20 gigaflops per core (and 30 for two, because both do not fit into TDP at full tilt making effectively a 1.5 core CPU).

Added after 3 minutes 45 seconds:

Code: Select all

Code:

              dck_one_div_sqrt: begin
                for m:= 0 to (mm div 8) - 1  do begin
                  pointer(pv):= p + m * 8 * sizeof(float);
                  pv[0]:= 1/sqrt(pv[0]);
                  pv[1]:= 1/sqrt(pv[1]);
                  pv[2]:= 1/sqrt(pv[2]);
                  pv[3]:= 1/sqrt(pv[3]);
                  pv[4]:= 1/sqrt(pv[4]);
                  pv[5]:= 1/sqrt(pv[5]);
                  pv[6]:= 1/sqrt(pv[6]);
                  pv[7]:= 1/sqrt(pv[7]);
                end;
              end;
            {$if defined(cpu386)}
              dck_sse_one_div_sqrt: begin
                for m:= 0 to (mm div 8) - 1  do begin
                  pointer(pv):= p + m * 8 * sizeof(float);
                  asm
                    mov eax, [fourones]
                    MOVAPS xmm5, [eax]
                    mov eax, [pv]
                    MOVAPS xmm6, [eax]
                    SQRTPS xmm6, xmm6
                    MOVAPS xmm4, xmm5
                    DIVPS xmm4, xmm6 //RCPPS   xmm6, xmm6 //Reciprocal Parallel Scalars or, simply speaking, 1.0/x
                    MOVAPS xmm7, [eax + 16]
                    SQRTPS xmm7, xmm7
                    MOVAPS [eax], xmm4
                    DIVPS xmm5, xmm7 //RCPSS xmm7, xmm7
                    MOVAPS [eax + 16], xmm5
                  end['eax', 'xmm6', 'xmm7', 'xmm4', 'xmm5'];
                end;
              end;
              dck_sse_rsqrtps: begin
                for m:= 0 to (mm div 8) - 1  do begin
                  pointer(pv):= p + m * 8 * sizeof(float);
                  asm
                    mov eax, [pv]
                    MOVAPS xmm6, [eax]
                    RSQRTPS xmm6, xmm6
                    MOVAPS xmm7, [eax + 16]
                    RSQRTPS xmm7, xmm7
                    MOVAPS [eax], xmm6
                    MOVAPS [eax + 16], xmm7
                  end['eax', 'xmm6', 'xmm7'];
                end;
              end;
            {$endif}

, where mm in most cases = 2048

User avatar
Cheb
enthusiast

Messages: 985
Registered: 06/06/2005 15:54:34

to come back to the beginning
Re: Cheb's Game Engine

Message Cheb » 10.03.2023 22:53:15
Updated requirements, cleaned definitions in the code from unnecessary variability

Reason: my minimums include Athlon 64 X2 (2005, alas, I don't have it) and Pentium E2140 (2007, computer named Gray Goose). Both of these dual-core processors are 64-bit (alas, WinXP has no usable 64-bit version) and support SSE3.
Then what the (insert expletive here) was I doing basing my code on SSE2 instead of SSE3?
From now on, any code for x86 and x86-64, in any assembler inserts, assumes that SSE3's availability is guaranteed.

I am not going to consider SSE4 and higher, because if the E2140 with its two 1.6 GHz cores has enough horse power, then any modern one would fly into orbit and there is simply no point in working myself hard about this. My good intentions towards AVX/AVX512 will likely remain intentions.
That's it, all done..

Further, for LinuxSBC I have those minimals: Cortex A7. It has VFPv4-16, and I declare the same in my code as the only supported option - if I ever get to assembler under arm.
All arrived.

TL; DR: Free Pascal is optimized for *reproducibility*, bitwise matching results on all platforms. It seems it sacrifices lots of performance to reach that goal.

**Jonax** · 24-04-2023, 02:43 PM

Thanks for sharing

. Although I didn't grasp the details, despite Google's effort to bridge the gap, I think the conclusion is reasonable.

**Chebmaster** · 24-04-2023, 09:52 PM

Argh. [headdesk] Argh.
Corrected google's translation by hand. In so many places it raises questions: why even bother using it. It's much better than 10 years ago but there are still so many things it fails to understand and convey.
Whom am I kidding. Correcting is so much easier than translating 100% by myself.

Reproducibility is important for me, since my multiplayer model will be an evolved lockstep:

Code:

TLayerRole = (
    lro_Bottom, {
      In multiplayer, runs at -500ms using perfect inputs finalized by the server.
      This is also the only layer that can be serialized. }
    lro_DeepUpwell, {
      Propagates changes from the bottom to the thermocline, thus lazily correcting for late inputs }
    lro_Thermocline, {
      Holds steady at -150ms, assuming most inputs arrive *above* it }
    lro_FastSurfacing, {
      Bubbles the changes from the thermocline to the surface thus doing the bulk of lag compensation }
    lro_PresentSurface {
      Runs on local player inputs }
  );

As always, lots of stuff distracting me (mainly work at work) leaving me no time to move the project further. Frustrating.

**SilverWarior** · 24-04-2023, 10:16 PM

Originally Posted by Jonax

Although I didn't grasp the details, despite Google's effort to bridge the gap

Don't be hard on yourself if you don't understand all the details. Google translate seems to have done quite a good job. But the topic that Chebmaster is talking about is very complex.

He is talking about hardware-level optimization and making use of extended CPU features for accelerating specific processing. This is a very complex stuff especially if you take into account that some of these extended features might be vendor specific (proprietorially owned by Intel or AMD). This means that if you want to make use of some Intel proprietary feature on AMD CPU or vice versa, the specific feature might not be directly supported by that CPU so a fallback methods which is usually slower is used to at least get the desired results. Otherwise such code would simply break.

Another important thing is to make sure that you feed the CPU with data in the correct format that is required by the specific extended feature. Failing to do so could also result in CPU resulting in the use of some fallback methods and thus hurting performance.

Thread: Cheb's project will be here.

Thread Tools

Display

Hybrid View

Bookmarks

Bookmarks

Posting Permissions