[quote="Legolas"]
Yes, that's true. BTW, do this mean that smart]
Yes, it does All executables in 2.0.2 are compiled with smartlinking.
I compiled the code with ppcarm. I noted that the code didn't look very good without register variables, for example the prologue of the outer for loop is:
Code:
# [23] for i:=0 to 159 do
mov r1,#0
ldr r0,.Lj10
strh r1,[r0]
ldr r0,.Lj11
ldrsh r1,[r0]
sub r1,r1,#1
ldr r0,.Lj12
strh r1,[r0]
.balign 4
.Lj9:
ldr r0,.Lj13
ldrsh r0,[r0]
add r0,r0,#1
ldr r1,.Lj14
strh r0,[r1]
The reason the compiler generates such huge code is that on the ARM you cannot access variables directly and the compiler has to build a pointer to the variable first before it can access it. However, with register variables enabled the result looks quite good:
Code:
# [24] for j:=0 to 239 do
mov r4,#0
sub r4,r4,#1
.balign 4
.Lj12:
add r4,r4,#1
Since GPC uses register variables by default this could be one of the causes of the difference. I haven't checked what GPC generates, so it is a bit of guesswork.
The code has opportunities for global optimizations. For example, a compiler that can do indunction variables converts the array index to pointers and can save screen address calculation each iteration. I don't think GPC does this, because I never saw GCC using an induction variable. FPC cannot do induction variables either, but Delphi can for example.
However, the most likely cause for the slowdown can be found if we look at the loop body. FPC's code generation for the loop body looks very reasonable:
Code:
# [25] VideoBuffer[j + 240 * i] := i * j mod 31;
ldr r0,.Lj15
ldr r6,[r0]
mov r0,r4
mov r1,r5
mov r2,#240
mul r1,r2,r1
add r7,r1,r0
mov r7,r7,lsl #1
mov r0,r5
mov r2,r4
mul r1,r2,r0
mov r0,#31
bl fpc_mod_longint
mov r0,r0,lsl #16
mov r0,r0,lsr #16
strh r0,[r6, r7]
...however, FPC calls a helper to calculate the modulo. Look at the arm.inc file in the rtl source code (rtl/arm/arm.inc). There is no fpc_mod_longint here.
So, the actual fpc_mod_longint used is in rtl/inc/generic.inc. A 100% Pascal routine that calculates the modulo cpu independently, most likely it ain't very effficient.
So, somebody needs to code a fast assembler version of fpc_mod_longint for the ARM and most likely the problem will be solved.
Originally Posted by
Legolas
The only big difference is that in fpc I can't figure a suitabe way to declare 'absolute' VideoBuffer.
That's simple to solve: Free Pascal can do absolute, but it is only enabled for Dos. We need to enable it for the GBA as well.
Bookmarks