PDA

View Full Version : FPC4GBA vs GPC4GBA :P



Legolas
12-01-2006, 03:24 PM
I have made some comparing tests... (Almost) same code, compiled with fpc and gpc, gives very different results (both running speed and exec size)... Here (http://itaprogaming.free.fr/download/fpc_vs_gpc.zip) you can find a small demo compiled with fpc, gpc and gcc, to take a look at the differences.
I'm a bit confused... my question is: should we stay with FPC, hoping in a better speed in future, or should we switch to GPC?
:?:

dmantione
12-01-2006, 03:50 PM
Please post the source, so I can see what goes wrong. I'm not really known with the ARM stuff, but according to Florian FPC produces better code than GCC on ARM.

What I can see from the executables is that the FPC executable contains RTTI, which most propably means the FPC executable was compiled without smartlinking, which would explain the large executable.

Legolas
12-01-2006, 07:33 PM
[quote="dmantione"]
What I can see from the executables is that the FPC executable contains RTTI, which most propably means the FPC executable was compiled without smart]

Yes, that's true. BTW, do this mean that smartlinking now works fine in FPC?

Here the code I have used for freepascal:

program fillscreen_fpc;

type
// Unsigned types
u8 = byte;
u16 = word;
u32 = cardinal;

// Signed types
s8 = shortint;
s16 = smallint;
s32 = longint;



var
VideoBuffer : ^u16 = pointer($6000000); (* Display Memory (the screen) *)
DISPCNT : ^u16 = pointer($4000000); (* Initialization register *)
i,j : integer;

begin
DISPCNT^ := $403;
for i:=0 to 159 do
for j:=0 to 239 do
VideoBuffer[j + 240 * i] := i * j mod 31;
end.


and here the code used for gpc:

program fillscreen_gpc;
{$X+}

type
u8 = byte;
u16 = ShortWord;
u32 = cardinal;

s8 = ByteInt;
s16 = ShortInt;
s32 = integer;


const
fb = $6000000;


var
VideoBuffer : array [0..0] of u16 absolute fb; (* Display Memory (the screen) *)
DISPCNT : ^u16 = pointer($4000000); (* Initialization register *)
i, j : integer;


begin
DISPCNT^ := $403;
for i:=0 to 159 do
for j:=0 to 239 do
VideoBuffer[j + 240 * i] := i * j mod 31;
end.



The only big difference is that in fpc I can't figure a suitabe way to declare 'absolute' VideoBuffer.

dmantione
12-01-2006, 08:34 PM
Yes, that's true. BTW, do this mean that smart]

Yes, it does :) All executables in 2.0.2 are compiled with smartlinking.

I compiled the code with ppcarm. I noted that the code didn't look very good without register variables, for example the prologue of the outer for loop is:



# [23] for i:=0 to 159 do
mov r1,#0
ldr r0,.Lj10
strh r1,[r0]
ldr r0,.Lj11
ldrsh r1,[r0]
sub r1,r1,#1
ldr r0,.Lj12
strh r1,[r0]
.balign 4
.Lj9:
ldr r0,.Lj13
ldrsh r0,[r0]
add r0,r0,#1
ldr r1,.Lj14
strh r0,[r1]


The reason the compiler generates such huge code is that on the ARM you cannot access variables directly and the compiler has to build a pointer to the variable first before it can access it. However, with register variables enabled the result looks quite good:



# [24] for j:=0 to 239 do
mov r4,#0
sub r4,r4,#1
.balign 4
.Lj12:
add r4,r4,#1


Since GPC uses register variables by default this could be one of the causes of the difference. I haven't checked what GPC generates, so it is a bit of guesswork.

The code has opportunities for global optimizations. For example, a compiler that can do indunction variables converts the array index to pointers and can save screen address calculation each iteration. I don't think GPC does this, because I never saw GCC using an induction variable. FPC cannot do induction variables either, but Delphi can for example.

However, the most likely cause for the slowdown can be found if we look at the loop body. FPC's code generation for the loop body looks very reasonable:


# [25] VideoBuffer[j + 240 * i] := i * j mod 31;
ldr r0,.Lj15
ldr r6,[r0]
mov r0,r4
mov r1,r5
mov r2,#240
mul r1,r2,r1
add r7,r1,r0
mov r7,r7,lsl #1
mov r0,r5
mov r2,r4
mul r1,r2,r0
mov r0,#31
bl fpc_mod_longint
mov r0,r0,lsl #16
mov r0,r0,lsr #16
strh r0,[r6, r7]


...however, FPC calls a helper to calculate the modulo. Look at the arm.inc file in the rtl source code (rtl/arm/arm.inc). There is no fpc_mod_longint here.

So, the actual fpc_mod_longint used is in rtl/inc/generic.inc. A 100% Pascal routine that calculates the modulo cpu independently, most likely it ain't very effficient.

So, somebody needs to code a fast assembler version of fpc_mod_longint for the ARM and most likely the problem will be solved.

[quote="Legolas"]
The only big difference is that in fpc I can't figure a suitabe way to declare 'absolute' VideoBuffer.

That's simple to solve: Free Pascal can do absolute, but it is only enabled for Dos. We need to enable it for the GBA as well.

Legolas
12-01-2006, 09:50 PM
You should get a nomination for the "Best answer ever" Oscar on this forum :mrgreen:
Seriously, now I can figure what happens. With register variables enabled I get similar results, so the only problem should be the mod function. I *badly* need to learn some assembly :read:

Legolas
13-01-2006, 12:18 AM
Well... The problem was really mod function :D
I have found that gba bios embeds some math functions; among these, a mod function. This is my implementation:

function modulus(number: s32; denom: s32): s32; assembler;
asm
swi #0x060000
mov r0, r1
bx lr
end;


Now, changing my previous code:


program fillscreen_fpc;

{...cut...}

function modulus(number: s32; denom: s32): s32; assembler;
asm
swi #0x060000
mov r0, r1
bx lr
end;


begin
DISPCNT^ := $403;
for i:=0 to 159 do
for j:=0 to 239 do
VideoBuffer[j + 240 * i] := modulus(i * j, 31);
end.


all works fine and the speed is similar to gpc and gcc :thumbup:

dmantione
13-01-2006, 08:21 AM
Good :D

I have no experience at all with Arm assembler code, but I'm wondering why you return manually, i.e. the "bx lr". That means the compiler gets no chance to clean up the stack frame.

It looks a good idea to use the BIOS as much as possible on the GBA since memory is tight. You can add this code to system.pas for the gba, put this in system.pp:


{$DEFINE FPC_SYSTEM_HAS_MOD_LONGINT}
function fpc_mod_longint(n,z : longint) : longint;[public,alias:'FPC_MOD_LONGINT'];compilerproc;assembler
asm
swi #0x060000
mov r0, r1
bx lr
end;


... and the compiler should use it.

You can even try to inline it, since the code is so short that a procedure call is already overhead. I don't know if inlining this procedure will actually work (the compiler might assume that it should be able to call the helper), you woud have to check that.

Legolas
13-01-2006, 02:10 PM
Uhm... ok! Even stripping out "bx lr" it works. :) This code comes from libgba (that is a part of devkitpro)... I have simply put it inside an asm-end block, so I don't really know how'n'why it works (maybe magic?) :lol:
There are a bunch of other bios functions too, so I'll try to put it in system.pp

dmantione
13-01-2006, 03:20 PM
Yes, you should leave out the bx lr, because the compiler automatically adds code to clean up the stack frame and the variables and return to the caller after the last instruction. It is recommended not to return yourself, otherwise you might loose stack memory.

Legolas
14-01-2006, 03:46 PM
You can add this code to system.pas for the gba, put this in system.pp:


{$DEFINE FPC_SYSTEM_HAS_MOD_LONGINT}
function fpc_mod_longint(n,z : longint) : longint;[public,alias:'FPC_MOD_LONGINT'];compilerproc;assembler
asm
swi #0x060000
mov r0, r1
bx lr
end;


... and the compiler should use it.


If I try to add your code, i get


E:\fpc\fpc-2.0.x.source\rtl\gba>make CPU_TARGET=arm OS_TARGET=gba PP=ppcarm OPT="-Tgba"
ppcarm.exe -Tgba -XParm-gba- -Xc -Xr -Fi../inc -Fi../arm -Fi../unix -Fiarm -FE. -FU../../rtl/units/arm-gba -Tgba -darm -Us -Sg system.pp
..\..\rtl\units\arm-gba\system.s: Assembler messages:
..\..\rtl\units\arm-gba\system.s:49992: Error: symbol `fpc_mod_longint' is already defined
..\..\rtl\units\arm-gba\system.s:49994: Error: symbol `FPC_MOD_LONGINT' is already defined
system.pp(124,3) Error: Error while assembling exitcode 1
system.pp(124,3) Fatal: There were 2 errors compiling module, stopping
system.pp(124,3) Fatal: Compilation aborted
make: *** [system.ppu] Error 1


I'm trying to modify linux rtl. Maybe I should start a rtl porting from scratch :?

dmantione
14-01-2006, 04:01 PM
The magic here is the {$DEFINE FPC_SYSTEM_HAS_MOD_LONGINT}, which instructs the system unit not to include the default fpc_mod_longint. It must be defined before generic.inc gets procesed; it might help moving this define up to the top of the file.

Legolas
14-01-2006, 04:10 PM
The magic here is the {$DEFINE FPC_SYSTEM_HAS_MOD_LONGINT}, which instructs the system unit not to include the default fpc_mod_longint. It must be defined before generic.inc gets procesed; it might help moving this define up to the top of the file.

And - of course - this way it works... :D
I was losing myself in tons of include :doh:

It is impressive how quick you answer my questions... Next time I'm thinking that your reply will arrive even before my (stupid) question. :lol:
Thanks :wink:

Legolas
15-01-2006, 01:34 PM
Next chapter! :D
Now previous code returns a black screen. The asm file looks good, because it calls fpc_mod_longint, but seems that something goes wrong in the rtl.
Another question: I have seen that in generic functions there is fpc_mod_longint and fpc_mod_qword. The gba bios has only a function for mod... It is good/enough to replace fpc_mod_qword with same code for longint too?
BTW, I have tryied to compile rtl for smartlinking and it works pretty fine ^_^

dmantione
15-01-2006, 10:07 PM
Hmmm... That is bad.... :? If the screen stays black that most likely means that mod doesn't work at all.

There are two possibilities:
* The return value of fpc_mod_longint gets lost somehow. Perhaps we did something wrong with the "bx lr" somehow (I'm an apprentice at ARM assembler :think: )
* fpc_mod_longint overwrites a register that it isn't allowed to overwrite causing the loops to end prematurely or something.

We need to find out which situation is the case, otherwise ne need to have Florian a look at the assembler code, he's a bit more experienced here.

Regarding fpc_mod_qword, no, it should calculate the modulo between 64-bit unsigned numbers compared to the modulo between 32-bit signed numbers. Code designed for one calculation does not automagically work for the other... :(

What kind of exe size did you get with smartlinking?

Legolas
15-01-2006, 10:24 PM
There are two possibilities:
* The return value of fpc_mod_longint gets lost somehow. Perhaps we did something wrong with the "bx lr" somehow (I'm an apprentice at ARM assembler :think: )
* fpc_mod_longint overwrites a register that it isn't allowed to overwrite causing the loops to end prematurely or something.

Seems like the executable goes in a bad kind of loop, because I have noticed a loss of frame rate on the emulator (working executable runs at 100%, bad one runs at 70%).
[quote]
What kind of exe size did you get with smart]
Well, about 30/35 kb instead of 160. That's fine :D

dmantione
15-01-2006, 10:45 PM
Seems like the executable goes in a bad kind of loop, because I have noticed a loss of frame rate on the emulator (working executable runs at 100%, bad one runs at 70%).

That points into the direction of the second explanations, which is a bit what I was afraid of. Assuming the GBA bios does not destroy registers other than input and output, you should check if the compiler has data stored in r1 before it calls fpc_mod_longint, since r1 is destroyed by your implementation of fpc_mod_longint.


[quote]
What kind of exe size did you get with smart]
Well, about 30/35 kb instead of 160. That's fine :D

Yes, but it should be possible to do better. It might be an idea to check what kind of code is called by the system.pp unit initialization and kick some cruft out. But on the other hand, it's not a big priority at the moment, fast math is much more important.

Legolas
16-01-2006, 01:43 PM
Seems like the executable goes in a bad kind of loop, because I have noticed a loss of frame rate on the emulator (working executable runs at 100%, bad one runs at 70%).

That points into the direction of the second explanations, which is a bit what I was afraid of. Assuming the GBA bios does not destroy registers other than input and output, you should check if the compiler has data stored in r1 before it calls fpc_mod_longint, since r1 is destroyed by your implementation of fpc_mod_longint.


Urgh!!! That hurts... I have found some ASM tutorials for ARM... Maybe this is the time to start reading them :)




[quote]
What kind of exe size did you get with smart]
Well, about 30/35 kb instead of 160. That's fine :D

Yes, but it should be possible to do better. It might be an idea to check what kind of code is called by the system.pp unit initialization and kick some cruft out. But on the other hand, it's not a big priority at the moment, fast math is much more important.

I have used linux rtl and I havent removed alot of lines, indeed.

Legolas
16-01-2006, 09:02 PM
Assuming the GBA bios does not destroy registers other than input and output, you should check if the compiler has data stored in r1 before it calls fpc_mod_longint, since r1 is destroyed by your implementation of fpc_mod_longint.

Uhm... well, according wiht your suggestions I have tryied to save r1 value:


mov r8, r1
swi #0x060000
mov r0, r1
mov r1, r8


I don't really know if this is a suitable way to save and restore r1 value, however it does not work at all :)

Looking at this doc (http://community.freepascal.org:10000/docs-html/prog/progse12.html#x122-1210003.4) I have tryied to tell to fpc compiler which register is affected by fpc_mod_longint function:

function fpc_mod_longint(n,z: longint):longint; compilerproc; assembler; [public, alias: 'FPC_MOD_LONGINT'];
asm
swi #0x060000
mov r0, r1
end['r0','r1','r2','r3'];


or even


function fpc_mod_longint(n,z: longint):longint; compilerproc; [public, alias: 'FPC_MOD_LONGINT'];
begin
asm
swi #0x060000
mov r0, r1
end;
end;


...but no way. What a headache!! :fuzzy:

JSoftware
17-01-2006, 08:18 AM
are you able to maybe push r1 on a stack?

Legolas
17-01-2006, 01:53 PM
are you able to maybe push r1 on a stack?

Not really. Push and pop are allowed only in thumb mode, while fpc handles only arm mode. I have tryied to translate it in a couple of store and load calls, but does not works too.

dmantione
17-01-2006, 08:28 PM
The save into r8 should be a proper save, unless the compiler expects you to save r8 which you destroy. Hmmm.... I'm going need take a look at the code myself, you can send me the code if you wish so... If no time tomorrow though.

In the meantime, please compare a version with the "mod" defined as function in the program and when it is in the RTL. Can you see differences?

Legolas
17-01-2006, 10:31 PM
The save into r8 should be a proper save, unless the compiler expects you to save r8 which you destroy. Hmmm.... I'm going need take a look at the code myself, you can send me the code if you wish so... If no time tomorrow though.

In the meantime, please compare a version with the "mod" defined as function in the program and when it is in the RTL. Can you see differences?

Looking at the compiler generated .s files I can find some small differences, mainly dues to different implementation, I think.
This comes from function version:


# [41] c := modulus((i * j), 31);
mov r1,#31
# Register r0 allocated
mov r0,r4
# Register r2 allocated
mov r2,r5
# Register r0,r2 released
# Register r0 allocated
mul r0,r2,r0
# Register r2,r3,r12,r13,r14,r15 allocated
bl P$FILLSCREEN_FPC_MODULUS$LONGINT$LONGINT$$LONGINT
# Register r1,r2,r3,r12,r13,r14,r15,r0 released
mov r6,r0
# Register r0 allocated
# [43] VideoBuffer[j + 240 * i] := c;


and this other one comes from rtl version:


# [39] c := (i * j) mod 31;
mov r0,r4
# Register r2 allocated
mov r2,r5
# Register r0,r2 released
# Register r1 allocated
mul r1,r2,r0
# Register r0 allocated
mov r0,#31
# Register r2,r3,r12,r13,r14,r15 allocated
bl fpc_mod_longint
# Register r1,r2,r3,r12,r13,r14,r15,r0 released
mov r6,r0
# Register r0 allocated
# [43] VideoBuffer[j + 240 * i] := c;


This is the 'modulus' function:


.globl P$FILLSCREEN_FPC_MODULUS$LONGINT$LONGINT$$LONGINT
P$FILLSCREEN_FPC_MODULUS$LONGINT$LONGINT$$LONGINT:
# Temps allocated between r11-44 and r11-44
# Register r13,r11,r12 allocated
mov r12,r13
stmfd r13!,{r11,r12,r14,r15}
sub r11,r12,#4
# Register r12 released
sub r13,r13,#44
# Var number located in register
# Var denom located in register
# Temp -44,4 allocated
# Var $result located at r11-44
# Register r0,r1,r2,r3,r12,r13,r14,r15 allocated
# [27] swi #0x060000
swi #393216
# [28] mov r0, r1
mov r0,r1
# Register r0,r1,r2,r3,r12,r13,r14,r15 released
# Temp -44,4 released
ldmea r11,{r11,r13,r15}
# Register r0 released
.Le0:
.size P$FILLSCREEN_FPC_MODULUS$LONGINT$LONGINT$$LONGINT, .Le0 -
P$FILLSCREEN_FPC_MODULUS$LONGINT$LONGINT$$LONGINT

It does some "strange" things with r11, r12 and r13 that i can't understand (maybe something related to stack, given that r13 is the stack pointer?).

I can zip all rtl sources and this small example and send it to your mailbox, if you want. :)

dmantione
17-01-2006, 11:01 PM
I think I see the problem! :twisted:

Function:


mul r0,r2,r0 # In other words: the function stores i*j in r0
bl P$FILLSCREEN_FPC_MODULUS$LONGINT$LONGINT$$LONGINT

RTL:


mov r0,#31 # In other words, the rtl stores #31 in r0
bl fpc_mod_longint


In other words, the parameters are reversed!!

Try this:


function fpc_mod_longint(n,z: longint):longint; compilerproc; assembler; [public, alias: 'FPC_MOD_LONGINT'];
asm
; Reverse the parameters!
mov r2,r0
mov r0,r1
mov r1,r2
; Do software interrupt.
swi #0x060000
mov r0, r1
end['r0','r1','r2','r3'];

Legolas
17-01-2006, 11:10 PM
YAY!!! It works!!! :mrgreen: :mrgreen: :mrgreen:
You, THE genius! :clap:

In other words, I don't understand ASM :lol:

Thanks alot, Daniel. I should offer a beer to you ^_^

savage
18-01-2006, 09:19 AM
So any screen shots or something to show working?

Legolas
18-01-2006, 01:29 PM
So any screen shots or something to show working?

Oh, well... nothing so fun to show. I only have this screenshot:

http://img496.imageshack.us/img496/4618/fpcfill5mp.png

The code is tacken from an example on Mr. Harbour's book (http://www.jharbour.com/gameboy/default.aspx). The interesting thing is that now fpc executable runs faster than the gcc one :o

BTW, I have found a nice trick for swapping two registers without involving a third one:

eor r0, r0, r1
eor r1, r1, r0
eor r0, r0, r1
:D

Legolas
20-01-2006, 02:26 PM
Work goes on... I have discovered why the 2 registers for mod are swapped: in thumb mode it should be used SWI 6 (r0->number, r1->denom); in arm mode SWI 7 (r1->number, r0->denom). So, no need to swap registers... I only have mistaken SWI :oops:

BTW, now I have some problems with asm 'dialect' in fpc. Seems like it is not so much standard compliant. For example, it does not understand asm comments (@, ;) but pascal ones (//); labels should always start with .L; in some cases, asm code that works in gas is not understood by fpc:


mov r0, #0x4000006


generates an "invalid constant" error;


mov r0, r0, lsl #0x10


generates an obscure "internal error 200501051"; elsewhere 'lsl' is an invalid opcode. My asm skills are near to 0, so probably I'm doing something wrong in the code. :?:

dmantione
21-01-2006, 01:24 PM
This is really a question for Florian. The second one is definately a bug, so feel free to submit one. The first one I don't know, perhaps try to use Pascal syntax, $ instead of ox? Anyway, please ask Florian what to do here.

Legolas
25-01-2006, 02:45 PM
mov r0, #0x4000006

generates an "invalid constant" error;


Oh, I respond by myself: ARM does not allow to load any value in a register, but only (8 bit value) << (x*2) values, according with this faq (http://devrs.com/gba/files/gbadevfaqs.php#Shift8bConst).
About the other question, I have submitted a bug report, so I'm waiting for a fix.

In the meanwhile I'm trying to make some nicer-to-look demos, hoping that this can attract more people joining fpc4gba project... :P

dmantione
26-01-2006, 08:27 AM
Just a suggestion, you're posting news on your own site. There is nothing wrong with that, but the FPC4GBA url is a little more published. It is there where you need to show the world your progress.

Legolas
26-01-2006, 09:26 AM
Just a suggestion, you're posting news on your own site. There is nothing wrong with that, but the FPC4GBA url is a little more published. It is there where you need to show the world your progress.

I know, but fpc4gba main site is owned by WILL. I don't have access to it... :)

savage
26-01-2006, 01:50 PM
I'll have a word with WILL when he gets back so that when you post a news item, it gets replicated on fpc4gba, your site and the pgd news. I'm sure PHP could be used to automate all that. Then it would less hassle for all concerned.

Legolas
26-01-2006, 03:06 PM
I'll have a word with WILL when he gets back so that when you post a news item, it gets replicated on fpc4gba, your site and the pgd news. I'm sure PHP could be used to automate all that. Then it would less hassle for all concerned.

Ok, good news! :D
Let me know if I can help you and WILL in some way :rambo: