c++visual-studio-code arbitrary-precision mmx

How do i use MMX mulH and mulL for two 64 bit integers to get one 128 bit integer

Hello, I'm working on yet another arbitrary precision integer library. I wanted to implement multiplication but I got stuck when _m_pmulhw in <mmintrin.h> just didn't work. there is very little documentation on MMX instructions. When I test it out, it just gives me gibberish when I multiply two UINT64_MAXs.

uint_fast64_t mulH(const uint_fast64_t &a, const uint_fast64_t &b)  {  
    return (uint_fast64_t)_m_pmulhw((__m64)a,(__m64)b);
}
uint_fast64_t mulL(const uint_fast64_t &a, const uint_fast64_t &b)  {  
    return (uint_fast64_t)_m_pmullw((__m64)a,(__m64)b);
}
int main() {
    uint64_t a = UINT64_MAX;
    uint64_t b = UINT64_MAX;
    std::cout <<  std::bitset<64>(mulH(a,b)) << std::bitset<64>(mulL(a,b));
}

output: 00000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000100000000000000010000000000000001

I don't know why it's not working i have an A6-4400M APU...

coreinfo's output:MMX * Supports MMX instruction set

So I think I can say, it isn't unsupported. If anyone can give me some tips on how to make this work thanks.

Compiler: gcc

IDE: visual studio code

Solution

I think you misunderstood what _m_pmulhw does. It's actually very clearly documented on Intel's Intrinsics Guide: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_m_pmulhw&expand=4340. The corresponding instruction is pmulhw, which is also clearly documented on e.g. Felix Cloutier's x86 instructions guide: https://www.felixcloutier.com/x86/pmulhw

It multiplies four pairs of 16-bit integers which are packed inside the two operands, and then produces the high half of all four multiplies (Packed Multiply High - Word). This means that, for inputs 0x12345678abcdef01, 0x9876543210fedcba, it would multiply 0x1234 * 0x9876, 0x5678 * 0x5432, 0xabcd * 0x10fe, 0xef01 * 0xdcba, and pack the high 16 bits of each result into the output.

For your example, you're multiplying 0xffff * 0xffff four times, producing the 32-bit result 0x00000001 (-1 * -1, since this is a signed 16-bit multiply), and therefore get 0x0000000000000000 in the high half and 0x0001000100010001 in the low half - which is exactly what you see in the bitset output.

If you're looking for a 128-bit multiply, there isn't actually an intrinsic for that (except _mulx_u64, but that uses the new mulx instruction which isn't that widespread). Microsoft has the non-standard _mul128 intrinsic, but on other platforms you can just use a __int128 type (or the local equivalent) to get a 64x64=>128 bit multiply.

Also, I'd seriously recommend using the SSE instruction set rather than the older MMX set; the SSE instructions are faster in most cases and enable you to operate on much wider vector types (256-bit is standard now, with AVX512 now available), which can provide a significant speed boost.