Wow .. again, this is really interesting. If I had more time, I could indulge into this field of expertise aswell.

You are using a separate assembler right? Because writing plain IA-32 machine code is sheer hell. Also, do you allready use a linker or does it only generate a single module with all the right addresses directly in it? If this project goes any deeper, you could write a nice article about it. Any plans of showing us the source?

Also your timings are very suprising to me. I also doubt that it will be significantly faster than C, but I am very curious about it's performance anyway (and the things you have to give up in order to squeeze more code into the same nanoseconds).