Apparently it takes a huge ammount of time to read .bits and .pitch.
Nope, reading them should be like reading any other variable. To explain the speedup: even without counting hidden additions and the like, the first variant had visible 4 extra multiplications and 2 extra additions per iteration so for textures with resolutions 256x256 we?¢_~d get 262144 extras multiplications and 131072 extra additions (and things like r0.Bits turn to r0 address + Bits offset so better store it in a variable before the loop to avoid unnecessary additions), another thing is that it?¢_~s better to loop over data the way it is stored in memory (images are stored in rows) to minimize cash misses.