Hmm good question. I just tried three components and sse seems to be pretty slow with the code above.

I thought that sse would check boundaries when you used movups but it seems it doesn't

@Setharian, my initial benchmarks shows you beat me with 7%
Seems I need to redesign some of my other sse functions in my vector library

Edit: wait a minute. What's going on in my code...
Edit2: Further optimizing got me this superfast code
[pascal]
function Normalize3(vec: tvector4f): tvector4f;
asm
movups xmm0, [vec]
movaps xmm3, xmm0
mulps xmm0, xmm0
shufps xmm1, xmm0, $00
shufps xmm2, xmm0, $10
addps xmm1, xmm0
addps xmm2, xmm1
rsqrtps xmm2, xmm2
shufps xmm2, xmm2, $AA
mulps xmm3, xmm2
movups [result], xmm3
end;[/pascal]

You will need a fourth component to use sse. If you use Turbo delphi(or fpc or any pascal language with operator overloading) then you could create an implicit overload of a record which transparently will create a four component vector and the other way