[quote="Legolas"]
Yes, that's true. BTW, do this mean that smart]

Yes, it does All executables in 2.0.2 are compiled with smartlinking.

I compiled the code with ppcarm. I noted that the code didn't look very good without register variables, for example the prologue of the outer for loop is:

Code:
# [23] for i:=0 to 159 do
        mov     r1,#0
        ldr     r0,.Lj10
        strh    r1,[r0]
        ldr     r0,.Lj11
        ldrsh   r1,[r0]
        sub     r1,r1,#1
        ldr     r0,.Lj12
        strh    r1,[r0]
        .balign 4
.Lj9:
        ldr     r0,.Lj13
        ldrsh   r0,[r0]
        add     r0,r0,#1
        ldr     r1,.Lj14
        strh    r0,[r1]
The reason the compiler generates such huge code is that on the ARM you cannot access variables directly and the compiler has to build a pointer to the variable first before it can access it. However, with register variables enabled the result looks quite good:

Code:
# [24] for j:=0 to 239 do
        mov     r4,#0
        sub     r4,r4,#1
        .balign 4
.Lj12:
        add     r4,r4,#1
Since GPC uses register variables by default this could be one of the causes of the difference. I haven't checked what GPC generates, so it is a bit of guesswork.

The code has opportunities for global optimizations. For example, a compiler that can do indunction variables converts the array index to pointers and can save screen address calculation each iteration. I don't think GPC does this, because I never saw GCC using an induction variable. FPC cannot do induction variables either, but Delphi can for example.

However, the most likely cause for the slowdown can be found if we look at the loop body. FPC's code generation for the loop body looks very reasonable:
Code:
# [25] VideoBuffer[j + 240 * i] := i * j mod 31;
        ldr     r0,.Lj15
        ldr     r6,[r0]
        mov     r0,r4
        mov     r1,r5
        mov     r2,#240
        mul     r1,r2,r1
        add     r7,r1,r0
        mov     r7,r7,lsl #1
        mov     r0,r5
        mov     r2,r4
        mul     r1,r2,r0
        mov     r0,#31
        bl      fpc_mod_longint
        mov     r0,r0,lsl #16
        mov     r0,r0,lsr #16
        strh    r0,[r6, r7]
...however, FPC calls a helper to calculate the modulo. Look at the arm.inc file in the rtl source code (rtl/arm/arm.inc). There is no fpc_mod_longint here.

So, the actual fpc_mod_longint used is in rtl/inc/generic.inc. A 100% Pascal routine that calculates the modulo cpu independently, most likely it ain't very effficient.

So, somebody needs to code a fast assembler version of fpc_mod_longint for the ARM and most likely the problem will be solved.

Quote Originally Posted by Legolas
The only big difference is that in fpc I can't figure a suitabe way to declare 'absolute' VideoBuffer.
That's simple to solve: Free Pascal can do absolute, but it is only enabled for Dos. We need to enable it for the GBA as well.