Page 1 of 4 123 ... LastLast
Results 1 to 10 of 33

Thread: FPC4GBA vs GPC4GBA :P

  1. #1

    FPC4GBA vs GPC4GBA :P

    I have made some comparing tests... (Almost) same code, compiled with fpc and gpc, gives very different results (both running speed and exec size)... Here you can find a small demo compiled with fpc, gpc and gcc, to take a look at the differences.
    I'm a bit confused... my question is: should we stay with FPC, hoping in a better speed in future, or should we switch to GPC?
    :?:
    Get your fpc4gba copy now!
    Get your fpc4nds copy now!

  2. #2

    FPC4GBA vs GPC4GBA :P

    Please post the source, so I can see what goes wrong. I'm not really known with the ARM stuff, but according to Florian FPC produces better code than GCC on ARM.

    What I can see from the executables is that the FPC executable contains RTTI, which most propably means the FPC executable was compiled without smartlinking, which would explain the large executable.

  3. #3

    FPC4GBA vs GPC4GBA :P

    [quote="dmantione"]
    What I can see from the executables is that the FPC executable contains RTTI, which most propably means the FPC executable was compiled without smart]

    Yes, that's true. BTW, do this mean that smartlinking now works fine in FPC?

    Here the code I have used for freepascal:
    [pascal]
    program fillscreen_fpc;

    type
    // Unsigned types
    u8 = byte;
    u16 = word;
    u32 = cardinal;

    // Signed types
    s8 = shortint;
    s16 = smallint;
    s32 = longint;



    var
    VideoBuffer : ^u16 = pointer($6000000); (* Display Memory (the screen) *)
    DISPCNT : ^u16 = pointer($4000000); (* Initialization register *)
    i,j : integer;

    begin
    DISPCNT^ := $403;
    for i:=0 to 159 do
    for j:=0 to 239 do
    VideoBuffer[j + 240 * i] := i * j mod 31;
    end.
    [/pascal]

    and here the code used for gpc:
    [pascal]
    program fillscreen_gpc;
    {$X+}

    type
    u8 = byte;
    u16 = ShortWord;
    u32 = cardinal;

    s8 = ByteInt;
    s16 = ShortInt;
    s32 = integer;


    const
    fb = $6000000;


    var
    VideoBuffer : array [0..0] of u16 absolute fb; (* Display Memory (the screen) *)
    DISPCNT : ^u16 = pointer($4000000); (* Initialization register *)
    i, j : integer;


    begin
    DISPCNT^ := $403;
    for i:=0 to 159 do
    for j:=0 to 239 do
    VideoBuffer[j + 240 * i] := i * j mod 31;
    end.

    [/pascal]

    The only big difference is that in fpc I can't figure a suitabe way to declare 'absolute' VideoBuffer.
    Get your fpc4gba copy now!
    Get your fpc4nds copy now!

  4. #4

    FPC4GBA vs GPC4GBA :P

    [quote="Legolas"]
    Yes, that's true. BTW, do this mean that smart]

    Yes, it does All executables in 2.0.2 are compiled with smartlinking.

    I compiled the code with ppcarm. I noted that the code didn't look very good without register variables, for example the prologue of the outer for loop is:

    Code:
    # [23] for i:=0 to 159 do
            mov     r1,#0
            ldr     r0,.Lj10
            strh    r1,[r0]
            ldr     r0,.Lj11
            ldrsh   r1,[r0]
            sub     r1,r1,#1
            ldr     r0,.Lj12
            strh    r1,[r0]
            .balign 4
    .Lj9:
            ldr     r0,.Lj13
            ldrsh   r0,[r0]
            add     r0,r0,#1
            ldr     r1,.Lj14
            strh    r0,[r1]
    The reason the compiler generates such huge code is that on the ARM you cannot access variables directly and the compiler has to build a pointer to the variable first before it can access it. However, with register variables enabled the result looks quite good:

    Code:
    # [24] for j:=0 to 239 do
            mov     r4,#0
            sub     r4,r4,#1
            .balign 4
    .Lj12:
            add     r4,r4,#1
    Since GPC uses register variables by default this could be one of the causes of the difference. I haven't checked what GPC generates, so it is a bit of guesswork.

    The code has opportunities for global optimizations. For example, a compiler that can do indunction variables converts the array index to pointers and can save screen address calculation each iteration. I don't think GPC does this, because I never saw GCC using an induction variable. FPC cannot do induction variables either, but Delphi can for example.

    However, the most likely cause for the slowdown can be found if we look at the loop body. FPC's code generation for the loop body looks very reasonable:
    Code:
    # [25] VideoBuffer[j + 240 * i] := i * j mod 31;
            ldr     r0,.Lj15
            ldr     r6,[r0]
            mov     r0,r4
            mov     r1,r5
            mov     r2,#240
            mul     r1,r2,r1
            add     r7,r1,r0
            mov     r7,r7,lsl #1
            mov     r0,r5
            mov     r2,r4
            mul     r1,r2,r0
            mov     r0,#31
            bl      fpc_mod_longint
            mov     r0,r0,lsl #16
            mov     r0,r0,lsr #16
            strh    r0,[r6, r7]
    ...however, FPC calls a helper to calculate the modulo. Look at the arm.inc file in the rtl source code (rtl/arm/arm.inc). There is no fpc_mod_longint here.

    So, the actual fpc_mod_longint used is in rtl/inc/generic.inc. A 100% Pascal routine that calculates the modulo cpu independently, most likely it ain't very effficient.

    So, somebody needs to code a fast assembler version of fpc_mod_longint for the ARM and most likely the problem will be solved.

    Quote Originally Posted by Legolas
    The only big difference is that in fpc I can't figure a suitabe way to declare 'absolute' VideoBuffer.
    That's simple to solve: Free Pascal can do absolute, but it is only enabled for Dos. We need to enable it for the GBA as well.

  5. #5

    FPC4GBA vs GPC4GBA :P

    You should get a nomination for the "Best answer ever" Oscar on this forum :mrgreen:
    Seriously, now I can figure what happens. With register variables enabled I get similar results, so the only problem should be the mod function. I *badly* need to learn some assembly :read:
    Get your fpc4gba copy now!
    Get your fpc4nds copy now!

  6. #6

    FPC4GBA vs GPC4GBA :P

    Well... The problem was really mod function
    I have found that gba bios embeds some math functions; among these, a mod function. This is my implementation:
    [pascal]
    function modulus(number: s32; denom: s32): s32; assembler;
    asm
    swi #0x060000
    mov r0, r1
    bx lr
    end;
    [/pascal]

    Now, changing my previous code:

    [pascal]
    program fillscreen_fpc;

    {...cut...}

    function modulus(number: s32; denom: s32): s32; assembler;
    asm
    swi #0x060000
    mov r0, r1
    bx lr
    end;


    begin
    DISPCNT^ := $403;
    for i:=0 to 159 do
    for j:=0 to 239 do
    VideoBuffer[j + 240 * i] := modulus(i * j, 31);
    end.
    [/pascal]

    all works fine and the speed is similar to gpc and gcc :thumbup:
    Get your fpc4gba copy now!
    Get your fpc4nds copy now!

  7. #7

    FPC4GBA vs GPC4GBA :P

    Good

    I have no experience at all with Arm assembler code, but I'm wondering why you return manually, i.e. the "bx lr". That means the compiler gets no chance to clean up the stack frame.

    It looks a good idea to use the BIOS as much as possible on the GBA since memory is tight. You can add this code to system.pas for the gba, put this in system.pp:

    [pascal]
    {$DEFINE FPC_SYSTEM_HAS_MOD_LONGINT}
    function fpc_mod_longint(n,z : longint) : longint;[public,alias:'FPC_MOD_LONGINT'];compilerproc;assembler
    asm
    swi #0x060000
    mov r0, r1
    bx lr
    end;
    [/pascal]

    ... and the compiler should use it.

    You can even try to inline it, since the code is so short that a procedure call is already overhead. I don't know if inlining this procedure will actually work (the compiler might assume that it should be able to call the helper), you woud have to check that.

  8. #8

    FPC4GBA vs GPC4GBA :P

    Uhm... ok! Even stripping out "bx lr" it works. This code comes from libgba (that is a part of devkitpro)... I have simply put it inside an asm-end block, so I don't really know how'n'why it works (maybe magic?) :lol:
    There are a bunch of other bios functions too, so I'll try to put it in system.pp
    Get your fpc4gba copy now!
    Get your fpc4nds copy now!

  9. #9

    FPC4GBA vs GPC4GBA :P

    Yes, you should leave out the bx lr, because the compiler automatically adds code to clean up the stack frame and the variables and return to the caller after the last instruction. It is recommended not to return yourself, otherwise you might loose stack memory.

  10. #10

    FPC4GBA vs GPC4GBA :P

    Quote Originally Posted by dmantione
    You can add this code to system.pas for the gba, put this in system.pp:

    [pascal]
    {$DEFINE FPC_SYSTEM_HAS_MOD_LONGINT}
    function fpc_mod_longint(n,z : longint) : longint;[public,alias:'FPC_MOD_LONGINT'];compilerproc;assembler
    asm
    swi #0x060000
    mov r0, r1
    bx lr
    end;
    [/pascal]

    ... and the compiler should use it.
    If I try to add your code, i get
    Code:
    E:\fpc\fpc-2.0.x.source\rtl\gba>make CPU_TARGET=arm OS_TARGET=gba PP=ppcarm OPT="-Tgba"
    ppcarm.exe -Tgba -XParm-gba- -Xc -Xr -Fi../inc -Fi../arm -Fi../unix -Fiarm -FE. -FU../../rtl/units/arm-gba -Tgba -darm  -Us -Sg system.pp
    ..\..\rtl\units\arm-gba\system.s: Assembler messages:
    ..\..\rtl\units\arm-gba\system.s:49992: Error: symbol `fpc_mod_longint' is already defined
    ..\..\rtl\units\arm-gba\system.s:49994: Error: symbol `FPC_MOD_LONGINT' is already defined
    system.pp(124,3) Error: Error while assembling exitcode 1
    system.pp(124,3) Fatal: There were 2 errors compiling module, stopping
    system.pp(124,3) Fatal: Compilation aborted
    make: *** [system.ppu] Error 1
    I'm trying to modify linux rtl. Maybe I should start a rtl porting from scratch :?
    Get your fpc4gba copy now!
    Get your fpc4nds copy now!

Page 1 of 4 123 ... LastLast

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •