Speeds with an AMD Athalon 4800 X2 (From your download, Chebs):
Delphi - 3782
Turbo Delphi, Single - 3609
Turbo Delphi, Double - 5422
MSVC Single SSE - 2156
FPC Double - 6047
FPC Double SSE2 - 2922 (!!!)
FPC Single - 3234
FPC Single SSE - 3515 (??)

It looks to me that the overall best speeds go to the SSE2 optimized code, but especially the Double code for SSE2. Kind of surprising, the boost for doubles, but pleasing. Only 750ms behind C++.

My only question is that if you want to enable single/double optimization by SSE2, how to you guarantee that the program will still run on a system without SSE2? I'm thinking that you'd need a whole new executable for that compiled without SSE2 optimizations.