I guess this code has to be done by CPU (instead of GPU).
So on my 2,8 GHz P4, it took about 140 cycles (2,8e9 Hz / 2e7 Pixel) to do 4+ and 4* operations.
Getting rid of the * operations (and going in the right direction ) reduced it to 28 cycles (2,8e9 Hz/ 1e7 Pixel).

I think the P4 pipeline is 20 instructions long so one would expect 20 cycles saved per multiplication.

Makes sense.