Small addition: doing matrix multiplication with embedded cycle is really bad for CPUs. For performance reason you need to unroll your loop.