Optimizing in Free Pascal.

**Ñuño Martínez** · 16-12-2009, 02:20 PM

Now that the we have fixed the last known bug of Allegro (

) I wonder if I can optimize it before next release.

First I've read that INLINE keyword only has effect in the unit it's used. That means I can't use it to avoid the extra-call at all. I that true? Most functions and procedures are just wrappers around the actual call. For example, the sprite drawing procedure:

Code:

PROCEDURE al_draw_sprite_ex (bmp, sprite: AL_BITMAPptr; x, y, mode, flip: LONGINT);
BEGIN
 bmp^.vtable^.draw_sprite_ex (bmp, sprite, x, y, mode, flip);
END;

Another group of procedures uses wrappers to make it more Pascal-like and/or use Pascal data types:

Code:

(* Function for messages. *)
 PROCEDURE _allegro_message (CONST msg: PCHAR); CDECL;
  EXTERNAL ALLEGRO_SHARED_LIBRARY_NAME NAME 'allegro_message';

(* Outputs a message. *)
 PROCEDURE al_message (CONST msg: STRING);
 BEGIN
  _allegro_message (PCHAR (msg));
 END;

I think that just "inlining" that calls (and there are a lot of them) performance should rise up a lot because almost all sprite and polygon drawing are like the previous ones. Actually they're implemented as macros in C (you know, " #define ...").

Other question. How can I profile FPC programs? I did use gprof some years ago but I don't remember how does it work. Can I use it with FPC?

**jdarling** · 16-12-2009, 03:22 PM

I've been looking for a good Profiler for FPC for a while now to little or no avail. You can use Valgrind on Linux (http://lazarusroad.blogspot.com/2007...ofile-fpc.html shows a start) but for Windows there really isn't an answer. Lots of people have talked about porting DelphiTools' Sampling Profiler (http://delphitools.info/samplingprofiler/) but I've not seen a single one complete the task (don't know how hard it would really be).

This post (http://www.freepascal.org/docs-html/user/userse56.html) seems to allude to using gprof with the --pg compiler flag, but again this seems to be a Linux only solution.

I'll be interested to see if anyone else finds anything beyond this

UPDATE:
Looks like built in profiling is broke http://wiki.freepascal.org/Profiling...ofiler_support except in trunk 251

- Jeremy

**User137** · 16-12-2009, 03:43 PM

Originally Posted by ?ëu?±o Mart??nez

I think that just "inlining" that calls (and there are a lot of them) performance should rise up a lot because almost all sprite and polygon drawing are like the previous ones. Actually they're implemented as macros in C (you know, " #define ...").

How much is "a lot"? I can hardly imagine optimizing this would give any noticable performance increase.

But i try to avoid wrapping single functions as much as possible. If i can call it directly from unit like OpenGL header i do that.

**Ñuño Martínez** · 16-12-2009, 04:26 PM

Originally Posted by jdarling

UPDATE:
Looks like built in profiling is broke http://wiki.freepascal.org/Profiling...ofiler_support except in trunk 251

it says it's broken only for gprof. I'll try with Valgrind.

Thanks form the suggestion.

Originally Posted by User137

How much is "a lot"? I can hardly imagine optimizing this would give any noticable performance increase.

"a lot" is "more than a little"... or something. Actually I don't know how much, but it should in some cases. Any game should draw/blit more than 100 bitmaps/frame so currently it does 200 calls (with parameter passing). Using INLINE it would be reduced to 100 calls. Add calls for input testing (mouse, joystick...) and sound. Yesterday I was rewriting a voxel renderer I wrote some years ago in C. It does test several thousands of voxels each frame, 30 frames per second. That are a lot of calls.

Originally Posted by User137

But i try to avoid wrapping single functions as much as possible. If i can call it directly from unit like OpenGL header i do that.

My fault. I tried to make the API Pascal-like and use Pascal types (i.e. STRING, ARRAY OF SOMETHING) instead of the C ones (pointers, pointers, and more pointers...) whih will force to use typecasting in a lot of calls.

**arthurprs** · 16-12-2009, 06:21 PM

Inline those, not only faster, but smaller binary

**de_jean_7777** · 16-12-2009, 06:41 PM

Originally Posted by arthurprs

Inline those, not only faster, but smaller binary

While the above statement might be true, this is usually wrong. The entire code of an inline routine is placed wherever the routine is called, and if you have sufficiently complex routines, this results in bigger code, not smaller.

Inline routines are inline throughout the entire program, not only in the units they're contained within, this is at least true for FPC (version 2.0.2 or greater). However, this depends on the compiler, which may decide that some routines cannot be inlined and therefore will be executed as a regular routine call.

Depending on the nature of a routine, a inline routine may be up to 2x faster than a non-inline routine (because the overhead of calling the routine is non-existent). This can be verified by a simple check, write a inline routine in a unit that performs some mathematical operation (e.g. normalization of a vector) and call it in a program a lot of times (1,000,000 or more). Measure times when you call a inlined and a non-inlined version of the routine. The differences can be seen.

**arthurprs** · 20-12-2009, 02:28 PM

Originally Posted by de_jean_7777

Originally Posted by arthurprs

Inline those, not only faster, but smaller binary

While the above statement might be true, this is usually wrong. The entire code of an inline routine is placed wherever the routine is called, and if you have sufficiently complex routines, this results in bigger code, not smaller.

Inline routines are inline throughout the entire program, not only in the units they're contained within, this is at least true for FPC (version 2.0.2 or greater). However, this depends on the compiler, which may decide that some routines cannot be inlined and therefore will be executed as a regular routine call.

Depending on the nature of a routine, a inline routine may be up to 2x faster than a non-inline routine (because the overhead of calling the routine is non-existent). This can be verified by a simple check, write a inline routine in a unit that performs some mathematical operation (e.g. normalization of a vector) and call it in a program a lot of times (1,000,000 or more). Measure times when you call a inlined and a non-inlined version of the routine. The differences can be seen.

In this case it probably save a few bytes, instead of calling the warper and then call the target, just call the target.

You should inline those very small functions that are called a lot of times.

**Ñuño Martínez** · 21-12-2009, 03:38 PM

After adding "INLINE" to a lot of procedures and functions, I did some tests compiling with and without the "-Si" option, but I can't see almost difference. Not sure if the compiler is deciding my code can't be "inlined".

**paul_nicholls** · 22-12-2009, 04:40 AM

I don't know if it works for freepascal and/or lazarus (probably not), but there is a free Delphi profiler program you can find here:

http://delphitools.info/

cheers,
Paul

**phibermon** · 10-04-2010, 05:57 PM

I'm guilty of not profiling my code, I just do my best, write each new processor intensive task in a seperate test app, set some arbitary but future fixed usage pattern for the test and get it running as quick as I can before I get fed up optimizing.

That way if I ever decide that I need to optimize some more I can just go back to any suspiciously expensive test app and poke and prod it a bit more.

obviously this won't work for any task that has multiple steps that can't be divide out into seperate test apps due to mutal dependance and it's not a true test for a typical usage pattern in the system as a whole..

But it works for me and encourages a good modular design.

edit : I'm informed that this technique is similar to extreme programming (http://www.extremeprogramming.org/).

An interesting idea.