Ebx should be saved and restored by compiler, if it's listed in registerlist after an asm block - Ref. guide, 10.3 Assembler statements. If it's not, it's a bug in fpc, I believe.
Two tips for some extra speed:
- align src/dest/corr to addresses that are multiplies of 16 (and replace movdqu with movdqa, it should help mainly on intel cpus)
- try to prefetch the next row (or couple of rows) of pixels (helps in some cases, sometimes it doesnt).