I don't know if it belongs here, but I currently have the below problem, and it is too slow, and I wanted to speed it up using SSE2 (or 3, the machines are recent core2s).

So now I'm looking for mmx/sse/SSE2 tutorials with commented example using assembler. (not intel intrinsics) Most of what I've been able to find is heaps of sites that list the instructions, and a very few sites that use intel intrinsics.

Note that suggestions to do this in the GPU are not constructive. Transfering each image to and from the GPU is more expensive than even the straight calculation using floating point, even before I introduce SSE. (8600GT, 20ms to card, probably the same to get it back)

-----------------
For the interested people, here is my problem: (from image processing, but the problem there is getting decent FPS there too :-)

- I've a 28mpixel image (4096x7000), but 8bpp gray.
- I need to correct each pixel for lighting circumstances, which means multiplying with a float.
- however the lighting circumstances are not an image but a vectorthat maps to the row (correction is the same for each row).

So I have
- 7000 vectors of 4096 bytes.
- And I need to multiply each vector with another 4096 <item) vector with corrections.

The correction is currently an array[0..4095] of single, but it could be a unsigned 16-bit fixed point value.

So for each byte one would do something like

Code:
var b &#58; byte;
    w,w2&#58; word;

for x &#58;=0 to 4095 do
begin
  b&#58;=imagebyte&#91;x&#93;;
  w&#58;=correctionword&#91;x&#93;;
  w2&#58;= saturate16&#40;b* w&#41;; // saturated 16bit.
  imagebyte&#91;x&#93;&#58;=saturate8&#40;w2 shr fixedpoints bits&#41;;  // scale down and fit into 8 bits    again using saturation
end;