I'm more and more tempted by the idea of 16-bit integer physics. 32768 is actually a lot, if you use it right. I have experience, after all - that game for MS-DOS used 16-bit physics.
I also learned a lot since, the problem of velocity discretization at low speeds is easily circumvented by defining speed not per tic but per interval of N tics, where slow objects would move slowly, one jump per hundreds of tics (and just interpolated by any object interacting with them).
SSE offer unique possibilities of speeding things up, PMULHW is tailor made for such things, multiplying 8 numbers per tact in the basic version and up to 32 in its AVX512 incarnation.
Also, sines, cosines and reverse square roots -- all of these could be made using lookup tables with linear interpolation, maybe normalized using BSR - but anyway much faster than any floating-point counterparts.