I attempted to implement the Julia technique in Cython using BLAS dgemm but was not able to beat Jake's original version. Not sure if I am doing something wrong or linking against a non-optimal BLAS library on my machine. Comments/feedback is welcome:
https://gist.github.com/synapticarbors/5790459