xtensor's "operator/" slower than numpy's "/"

I'm trying to transfer some code I've previously written in python into C++, and I'm currently testing xtensor to see if it can be faster than numpy for doing what I need it to.

One of my functions takes a square matrix d and a scalar alpha, and performs the elementwise operation alpha/(alpha+d). Background: this function is used to test which value of alpha is 'best', so it is in a loop where d is always the same, but alpha varies.

All of the following time scales are an average of 100 instances of running the function.

In numpy, it takes around 0.27 seconds to do this, and the code is as follows:

def kfun(d,alpha):
    k = alpha /(d+alpha)
    return k

but xtensor takes about 0.36 seconds, and the code looks like this:

xt::xtensor<double,2> xk(xt::xtensor<double,2> d, double alpha){
    return alpha/(alpha+d);
}

I've also attempted the following version using std::vector but this something I do not want to use in long run, even though it only took 0.22 seconds.

std::vector<std::vector<double>> kloops(std::vector<std::vector<double>> d, double alpha, int d_size){
    for (int i = 0; i<d_size; i++){
        for (int j = 0; j<d_size; j++){
            d[i][j] = alpha/(alpha + d[i][j]);
        }
    }
    return d;
}

I've noticed that the operator/ in xtensor uses "lazy broadcasting", is there maybe a way to make it immediate?

EDIT:

In Python, the function is called as follows, and timed using the "time" package

t0 = time.time()
for i in range(100):
    kk = k(dsquared,alpha_squared)
print(time.time()-t0)

In C++ I call the function has follows, and is timed using chronos:

//d is saved as a 1D npy file, an artefact from old code
auto sd2 = xt::load_npy<double>("/path/to/d.npy");

shape = {7084, 7084};
    xt::xtensor<double, 2> xd2(shape);
    for (int i = 0; i<7084;i++){
        for (int j=0; j<7084;j++){
            xd2(i,j) = (sd2(i*7084+j));
        }
    }

auto start = std::chrono::steady_clock::now();
for (int i = 0;i<10;i++){
    matrix<double> kk = kfun(xd2,4000*4000,7084);
}
auto end = std::chrono::steady_clock::now();
std::chrono::duration<double> elapsed_seconds = end-start;
std::cout << "k takes: " << elapsed_seconds.count() << "\n";

If you wish to run this code, I'd suggest using xd2 as a symmetric 7084x7084 random matrix with zeros on the diagonal.

The output of the function, a matrix called k, then goes on to be used in other functions, but I still need d to be unchanged as it will be reused later.

END EDIT

To run my C++ code I use the following line in the terminal:

cd "/path/to/src/" && g++ -mavx2 -ffast-math -DXTENSOR_USE_XSIMD -O3 ccode.cpp -o ccode -I/path/to/xtensorinclude && "/path/to/src/"ccode

Thanks in advance!

Solution

A problem with the C++ implementation may be that it creates one or possibly even two temporary copies that could be avoided. The first copy comes from not passing the argument by reference (or perfect forwarding). Without looking at the rest of the code its hard to judge if this has an impact on the performance or not. The compiler may move d into the method if its guaranteed to be not used after the method xk(), but it is more likely to copy the data into d.

To pass by reference, the method could be changed to

xt::xtensor<double,2> xk(const xt::xtensor<double,2>& d, double alpha){
    return alpha/(alpha+d);
}

To use perfect forwarding (and also enable other xtensor containers like xt::xarray or xt::xtensor_fixed), the method could be changed to

template<typename T>
xt::xtensor<double,2> xk(T&& d, double alpha){
    return alpha/(alpha+d);
}

Furthermore, its possible that you can save yourself from reserving memory for the return value. Again, its hard to judge without seeing the rest of the code. But if the method is used inside a loop, and the return value always has the same shape, then it can be beneficial to create the return value outside of the loop and return by reference. To do this, the method could be changed to:

template<typename T, typename U>
void xk(T& r, U&& d, double alpha){
    r = alpha/(alpha+d);
}

If it is guaranteed that d and r do not point to the same memory, you can further wrap r in xt::noalias() to avoid a temporary copy before assigning the result. The same is true for the return value of the function in case you do not return by reference.

Good luck and happy coding!