c++performance matrix multiplication strassen

Why is Strassen matrix multiplication so much slower than standard matrix multiplication?

I have written programs in C++, Python and Java for matrix multiplication and tested their speed for multiplying two 2000 x 2000 matrices (see post). The standard ikj-implentation - which is in enter image description here - took:

C++: 15 seconds (Source)
Python: 6 minutes 13 seconds (Source)

Now I have implemented the Strassen algorithm for matrix multiplication - which is in enter image description here - in Python and C++ as it was on wikipedia. These are the times I've got:

C++: 45 minutes (Source)
Python: Killed after 10 hours (Source)

Why is Strassen matrix multiplication so much slower than standard matrix multiplication?

Ideas:

Some cache effects
Implementation:
- error (the resulting 2000 x 2000 matrix is correct)
- null-multiplication (shouldn't be that important for 2000 x 2000 -> 2048 x 2048)

This is especially astonishing, as it seems to contradict the experiences of others:

Why is my Strassen Matrix multiplier so fast?
Matrix multiplication: Strassen vs. Standard - Strassen was also slower for him, but it was at least in the same order of magnitude.

edit: The reason why Strassen matrix multiplication was slower in my case, were:

I made it fully recursive (see tam)
I had two functions strassen and strassenRecursive. The first one resized the matrix to a power of two, if required and called the second one. But strassenRecursive didn't recursively call itself, but strassen.

Solution

The basic problem is that you're recursing down to a leaf size of 1 with your strassen implementaiton. Strassen's algorithm has a better Big O complexity, but constants do matter in reality, which means in reality you're better off with a standard n^3 matrix multiplication for smaller problem sizes.

So to greatly improve your program instead of doing:

if (tam == 1) {
        C[0][0] = A[0][0] * B[0][0];
        return;
    }

use if (tam == LEAF_SIZE) // iterative solution here. LEAF_SIZE should be a constant that you have to experimentally determine for your given architecture. Depending on the architecture it may be larger or smaller - there are architectures where the constant factors for strassen are so large that it's basically always worse than a simpler n^3 implementation for sensible matrix sizes. It all depends.