WILL, you've posted this before, but as I see, didn't take any of my feedback into account at all , here goes again:
  • You're going the wrong way, in X first and Y second, it should be the other way around so memory access is sequential and pleasing for you're cash.
  • using precalculated values just once in a procedure is noting, you might as well you real functions for more precision
  • Rounding in the innermost loop is costly, round aCos, aSin outside of it multiplied by a power of two, and in the loop just use fixed point shifts
  • Function calls in the innermost loop is bad, extract those put/get routines and better yet make different versions of ypu're procedure with 32bit->32bit, 8bit -> 8bit, etc. and in the main one check surface formats and call an appropriate one.