src and dest are loaded in 8 byte quantities, and aligned 4 bytes on D7, and 8 bytes on D2006 (fastmm aligns to 8 bytes). I can't change that easily, except by replacing the heapmgr, but if you load in 8 byte values, you'll never stay 16-byte aligned long. I do use D2006 because of this reason for the speed dependant projects. (and also the SSE3 support for LDDQ)Originally Posted by imcold
I aligned the cor array to 32 byte, and have an ifdef to load it and use lddq, but no improvement. Probably hitting memory bandwith limits.
I doubt prefetch will do much, since I simply walk through a 4MB memory block from 0 to 4MB-1. If the predictor in the CPU can't predict that, there is no point in having prefetch in the first place ;_)
I'm currently trying this in Delphi btw, so no registerlist. BTW: afaik registerlist works for blocks, but has no effect on assembler procedures.
Bookmarks