Warning: this post contains vague, fuzzy optimisation info.

First of all, and before all else, ensure that you're working with a system memory surface rather than a video memory surface. This is very important. Reading from a video memory surface is most definitely not a good plan. If possible, specify the DDLOCK_READONLY or DDLOCK_WRITEONLY when locking as they sometimes help.

The first trick is based on the assumption that your GUI will be relatively static. It's not applicable if it changes all the time. If it's static then you can precompute the blending amount for that half of the equation - instead of calculating it each time you'd read back the 50% blended value. That's potentially half of your work cut out right there! Of course, this assumption breaks if your GUI constantly changes (maybe it will, maybe not). If it changes a little then you'd still get away with recalculating the values on each change. Of course, there's a cut-off point to this. Repeat: store your GUI in 50% blended format, rather than 100% normal format, if possible, so you can simply add it directly to the other side of the equation without any more thought.

Next, the value [iX + iY * dwSrcOffset] is invariant inside the inner-x loop. You can calculate it once and reuse it in multiple places inside there. That'll save you 3 muls and 3 additions per inner loop, which is quite handy. If you think about it, each pixel will be a successively higher array index: array[0], array[1], and so on. There's no reason to recalculate it each pixel. Instead, you'd initialise it to 0 outside of all the loops and then inc it each inner x loop. Much simpler! As always, though, compare before and after FPS to ensure that it *is* an improvement - don't assume! You can also use pointers directly to the elements and inc them instead - this sometimes helps.
Looping down to zero can give a speed boost - from ix := 0 to whatever - 1 => from ix := whatever - 1 downto 0 do. This is a little micro-optimisation, though, so it may not buy you much. It's only possible if you don't rely on array indices (which would go back to front). Instead, you'd use pointers to the first element and inc them in the inner loop so that they still go the same way.

You might also want to unroll the inner loop. This sometimes helps, but sometimes doesn't. Unrolling is simply repeating your inner loop code several so that you do more work per loop. A quick example:

[pascal]for i := 1 to 100 do
begin
something
end;

//becomes

for i := 1 to 25 do
begin
something
something
something
something
end;[/pascal]

This can help because it reduces the loop overhead - you have roughly four times less checking-of-loop-vars (remember that the loop has to check if it's finished each iteration!). Test this first; sometimes you'll blow the cache by making your inner loops too large, which makes things slower!

It's not a good plan to have an if statement in an inner loop - a mispredicted branch is very slow. The CPU is always grabbing various things prematurely in the expectation that they'll be used. This gives a speed boost if the branch is taken and the stuff can be used as expected, but if it's not then the CPU has to chuck out the prematurely grabbed info (a speed hit) and get the real info. The CPU has branch prediction, which attempts to predict whether a jump will be taken, but it's not perfect. If you are going to have if statements in the inner loop then try to have the if do the least likely case first - for example, if it's more likely that you'll have blended rather than empty pixels, you might say "if this_pixel = 0 then continue". In general, "if something_unlikely then do_unlikely_thing else do_likely_thing" or, better yet, "do_likely_thing; if very_unlikely_thing then begin undo_likely_thing; do_very_unlikely_thing; end;". The above arrangement of if/else is from my memory; confirm it yourself rather than believing me immediately, please.

It's probably possible to get rid of that if statement with a little precalc (maybe fiddle it so that it blends to only the underlying surface with full intensity, i.e. no blending). You can also consider using run-length encoding to get rid of the if statement. I can't think of a nice way to explain this, however. You'd fiddle with the surface so that it had the number of consecutive transparent pixels, the number of non-transparent pixels, and the values themselves. You could code it like this (pseudocode)

[pascal]var this_index := 0;
var however_many := read_amount_of_transparent_pixels(surface);
inc(this_index, however_many);

however_many := read_amount_of_gui_pixels(surface);
for i := 0 to however_many - 1 do
begin
blend_this_pixel[this_index]
inc(this_index)
end;[/pascal]

(Repeat for each block of unused then blended run of pixels.)

You could probably optimise the above significantly, of course. The idea with that pseudocode is to avoid an if statement - you'll know how many blended/not blended pixels there are in sequences, so you don't have to check each one and you can directly skip over the empty ones!

By far the most important tip: DO NOT ASSUME ANYTHING. Test it! For example, I assumed for my effect that precalcuting a certain var would be quicker, but it wasn't . Sometimes, reading from memory isn't as quick as recalculating because of the memory transfer speed -- however, sometimes reading from memory w/ a precalc is quicker!

The only way to be certain is to try out the different possibilities. Sometimes, the fastest method isn't the most obvious. As an aside, you could consider using the GDI/VCL as a quick test cradle. It'll be effected by the blit speed, but you should still be able to try out different ideas and see whether they are a speed boost. The real trick after this, though, is transferring those results to DX .

Here's the main loop from my effect (after the convolution filter and other stuff)... the actual blending is something similar to this: "this_pixel := FTransPic[y * BITMAP_WIDTH + x] + FTransValues[DestPixel^];" This is the precalc bit I talked about - it was quicker doing the y * BITMAP_WIDTH + x in this case rather than having a var set to 0 and inc'ing it. Unintuitive! The background picture is stored in a pre-blended format (you load it up and store **the 50%** version of it, rather than the proper picture). The particle palette is also precalc'ed - it's stored as 256 different 50% colours. As a result, the blending is reduced to some reading and an addition, rather than anything more complicated. Woot.

I don't know whether the above is tremendously helpful, mind you, since my effect had certain constraints (e.g. background not changing). Constraints are always a massive boost for optimisation since you can precalc a bunch of stuff. In your case, the main source for optimisation is the GUI not changing often - aim there, precalc if possible.

Bear in mind that system memory blits are much slower than video memory blits. You might want to use MMX to copy over the results, four pixels at a time, onto your back buffer (rather than using a standard Blt).

I wish you the best of luck! If it's possible for my effect to be > 100FPS using 32bit and the GDI, you should be able to do blending at a suitable speed w/ DX, which rocks for pixel manipulation.