You could take advantage of SIMD instructions to speed up calculations, but you will have to write your code in assembly, which is not a trivial task.

You would immediately benefit from speed improvements by compiling the above code for 64-bit platform (which would require Delphi XE 2 or FreePascal), as there are more registers that compiler can use to optimize the generated code.

I would suggest instead trying multi-threaded approach first by taking advantage of CPU's multiple cores.

Further optimizations would involve solving memory bandwidth bottleneck, where you would do better by using GPGPU techniques with tools like CUDA.