I am using multiplication (with the addition of other operations) as a substitution for integer division. My solution eventually requires me to multiply 2 32-bit numbers together and take the top 32 bits (just like the mulhi function), but AVX2 does not offer a 32-bit variant of _mm256_mulhi_epu16 (Ex: there's no '_mm256_mulhi_epu32' function).
I have tried various methods such as checking the functions of AVX512, or even manipulating the 32-bit integers to be 2 hi/lo 16-bit integers. I'm very new to working with low-level programming, so I'm unaware what is optimal, or even just possible.
This can be done by doing the following:
__m256i t1 = _mm256_mul_epu32(m, n);
t1 = _mm256_srli_epi64(t1, 32);