Say I have an MxN matrix (SIG) and a list of Nx1 fractional indices (idxt). Each fractional index in idxt uniquely corresponds to the same position column in SIG. I would like to index to the appropriate value in SIG using the indices stored in idxt, take that value and save it in another Nx1 vector. Since the indices in idxt are fractional, I need to interpolate in SIG. Here is an implementation that uses linear interpolation:
void calcPoint(const Eigen::Ref<const Eigen::VectorXd>& idxt,
const Eigen::Ref<const Eigen::Matrix<short int, -1, -1>>& SIG,
double& bfVal) {
Eigen::VectorXd bfPTVec(idxt.size());
#pragma omp simd
for (int j = 0; j < idxt.size(); j++) {
int vIDX = static_cast<int>(idxt(j));
double interp1 = vIDX + 1 - idxt(j);
double interp2 = idxt(j) - vIDX;
bfPTVec(j) = (SIG(vIDX,j)*interp1 + SIG(vIDX+1,j)*interp2);
}
bfVal = ((idxt.array() > 0.0).select(bfPTVec,0.0)).sum();
}
I suspect there is a better way to implement the body of the loop here that would help the compiler better exploit SIMD operations. For example, as I understand it, forcing the compiler to cast between types, both explicitly as the first line does and implicitly as some of the mathematical operations do is not a vectorizable operation.
Additionally, by making the access to SIG dependent on values in idxt which are calculated at runtime I'm not clear if the type of memory read-write I'm performing here is vectorizable, or how it could be vectorized. Looking at the big picture description of my problem where each idxt corresponds to the same "position" column as SIG, I get a sense that it should be a vectorizable operation, but I'm not sure how to translate that into good code.
Clarification
Thanks to the comments, I realized I hadn't specified that certain values that I don't want contributing to the final summation in idxt are set to zero when idxt is initialized outside of this method. Hence the last line in the example given above.
Probably no, but I would expect manually vectorized version to be faster.
Below is an example of that inner loop, untested. It doesn’t use AVX only SSE up to 4.1, and should be compatible with these Eigen matrices you have there.
The pIndex
input pointer should point to the j-th element of your idxt
vector, and pSignsColumn
should point to the start of the j-th column of the SIG matrix. It assumes your SIG matrix is column major. It’s normally the default memory layout in Eigen but possible to override with template arguments, and probably with macros as well.
inline double computePoint( const double* pIndex, const int16_t* pSignsColumn )
{
// Load the index value into both lanes of the vector
__m128d idx = _mm_loaddup_pd( pIndex );
// Convert into int32 with truncation; this assumes the number there ain't negative.
const int iFloor = _mm_cvttsd_si32( idx );
// Compute fractional part
idx = _mm_sub_pd( idx, _mm_floor_pd( idx ) );
// Compute interpolation coefficients, they are [ 1.0 - idx, idx ]
idx = _mm_addsub_pd( _mm_set_sd( 1.0 ), idx );
// Load two int16_t values from sequential addresses
const __m128i signsInt = _mm_loadu_si32( pSignsColumn + iFloor );
// Upcast them to int32, then to fp64
const __m128d signs = _mm_cvtepi32_pd( _mm_cvtepi16_epi32( signsInt ) );
// Compute the result
__m128d res = _mm_mul_pd( idx, signs );
res = _mm_add_sd( res, _mm_unpackhi_pd( res, res ) );
// The above 2 lines (3 instructions) can be replaced with the following one:
// const __m128d res = _mm_dp_pd( idx, signs, 0b110001 );
// It may or may not be better, the dppd instruction is not particularly fast.
return _mm_cvtsd_f64( res );
}