Quote Originally Posted by imcold
Ebx should be saved and restored by compiler, if it's listed in registerlist after an asm block - Ref. guide, 10.3 Assembler statements. If it's not, it's a bug in fpc, I believe.
Two tips for some extra speed:
- align src/dest/corr to addresses that are multiplies of 16 (and replace movdqu with movdqa, it should help mainly on intel cpus)
- try to prefetch the next row (or couple of rows) of pixels (helps in some cases, sometimes it doesnt).
src and dest are loaded in 8 byte quantities, and aligned 4 bytes on D7, and 8 bytes on D2006 (fastmm aligns to 8 bytes). I can't change that easily, except by replacing the heapmgr, but if you load in 8 byte values, you'll never stay 16-byte aligned long. I do use D2006 because of this reason for the speed dependant projects. (and also the SSE3 support for LDDQ)

I aligned the cor array to 32 byte, and have an ifdef to load it and use lddq, but no improvement. Probably hitting memory bandwith limits.

I doubt prefetch will do much, since I simply walk through a 4MB memory block from 0 to 4MB-1. If the predictor in the CPU can't predict that, there is no point in having prefetch in the first place ;_)

I'm currently trying this in Delphi btw, so no registerlist. BTW: afaik registerlist works for blocks, but has no effect on assembler procedures.