Horizontal min on avx2 8 float register and shuffle paired registers alongside

After ray vs triangle intersection test in 8 wide simd, I'm left with updating t, u and v which I've done in scalar below (find lowest t and updating t,u,v if lower than previous t). Is there a way to do this in simd instead of scalar?

int update_tuv(__m256 t, __m256 u, __m256 v, float* t_out, float* u_out, float* v_out)
{
    alignas(32) float ts[8];_mm256_store_ps(ts, t);
    alignas(32) float us[8];_mm256_store_ps(us, u);
    alignas(32) float vs[8];_mm256_store_ps(vs, v);
    
    int min_index{0};    
    for (int i = 1; i < 8; ++i) {
        if (ts[i] < ts[min_index]) {
            min_index = i;
        }
    }

    if (ts[min_index] >= *t_out) { return -1; }

    *t_out = ts[min_index];
    *u_out = us[min_index];
    *v_out = vs[min_index];

    return min_index;
}

I haven't found a solution that finds the horizontal min t and shuffles/permutes it's pairing u and v along the way other than permuting and min testing 8 times.

Solution

First find horizontal minimum of the t vector. This alone is enough to reject values with your first test. Then find index of that first minimum element, extract and store that lane from u and v vectors.

// Horizontal minimum of the vector
inline float horizontalMinimum( __m256 v )
{
    __m128 i = _mm256_extractf128_ps( v, 1 );
    i = _mm_min_ps( i, _mm256_castps256_ps128( v ) );
    i = _mm_min_ps( i, _mm_movehl_ps( i, i ) );
    i = _mm_min_ss( i, _mm_movehdup_ps( i ) );
    return _mm_cvtss_f32( i );
}

int update_tuv_avx2( __m256 t, __m256 u, __m256 v, float* t_out, float* u_out, float* v_out )
{
    // Find the minimum t, reject if t_out is larger than that
    float current = *t_out;
    float ts = horizontalMinimum( t );
    if( ts >= current )
        return -1;
    // Should compile into vbroadcastss
    __m256 tMin = _mm256_set1_ps( ts );
    *t_out = ts;

    // Find the minimum index
    uint32_t mask = (uint32_t)_mm256_movemask_ps( _mm256_cmp_ps( t, tMin, _CMP_EQ_OQ ) );
    // If you don't yet have C++/20, use _tzcnt_u32 or _BitScanForward or __builtin_ctz intrinsics
    int minIndex = std::countr_zero( mask );

    // Prepare a permutation vector for the vpermps AVX2 instruction
    // We don't care what's in the highest 7 integer lanes in that vector, only need the first lane
    __m256i iv = _mm256_castsi128_si256( _mm_cvtsi32_si128( (int)minIndex ) );

    // Permute u and v vector, moving that element to the first lane
    u = _mm256_permutevar8x32_ps( u, iv );
    v = _mm256_permutevar8x32_ps( v, iv );

    // Update the outputs with the new numbers
    *u_out = _mm256_cvtss_f32( u );
    *v_out = _mm256_cvtss_f32( v );
    return minIndex;
}

While relatively straightforward and probably faster than your current method with vector stores followed by scalar loads, the performance of the above function is only great when that if branch is well-predicted.

When that branch is unpredictable (statistically, results in random outcomes), a completely branchless implementation might be a better fit. Gonna be more complicated though, load old values with _mm_load_ss, conditionally update with _mm_blendv_ps, and store back with _mm_store_ss.