floating-point arm instructions half-precision-float

List of ARM instructions implementing half-precision floating-point arithmetic

Arm Architecture Reference Manual for A-profile architecture (emphasis added):

FPHP, bits [27:24]

0b0011 As for 0b0010, and adds support for half-precision floating-point arithmetic.

A simple question: where is to find a list of ARM instructions implementing half-precision floating-point arithmetic?

UPD. Per Clang for Arm (armclang) documentation:

The __fp16 data type is not an arithmetic data type. The __fp16 data type is for storage and conversion only.
The _Float16 data type is an arithmetic data type. Operations on _Float16 values use half-precision arithmetic.

Hence, when using Clang for Arm I need to use _Float16 (not __fp16).

Per GCC for Arm documentation:

The __fp16 type may only be used as an argument to intrinsics defined in <arm_fp16.h>, or as a storage format. For purposes of arithmetic and other operations, __fp16 values in C or C++ expressions are automatically promoted to float. It is recommended that portable code use the _Float16 type defined by ISO/IEC TS 18661-3:2015.

Hence, when using GCC for Arm I need to use _Float16 (not __fp16).

However, then why in this example from Nate Eldredge GCC for Arm generates vmul.f16 instead of half<->float conversions followed by vmul.f32? Per quote above __fp16 values in C or C++ expressions are automatically promoted to float. Why they are not promoted to float in this case?

Solution

It's not really a separate list. When this feature is present, basically all the floating-point instructions that already exist gain support for half-precision.

In AArch64 state, you use the same floating-point instruction mnemonics, using h registers or vector element sizes to specify a half-precision operation. For example, fadd h0, h1, h2 does a half-precision floating-point add (scalar), and fadd v0.8h, v1.8h, v2.8h does eight such adds in parallel (vector).

In AArch32 state, you use a .f16 suffix on the mnemonic. So vadd.f16 s0, s1, s2 (in 32-bit state the h register names are not used, and the result is zero-extended into the 32-bit s register). Or (untested) vadd.f16 d0, d1, d2 for a four-element vector add, or vadd.f16 q0, q2, q4 for eight elements.

If you really want a list of all the instruction forms added by the FP16 feature, you can skim the tables in the Instruction Set Encoding chapters of the Architecture Reference Manual and look for FP16 in the Feature column. Or search for (FEAT_FP16) in the instruction descriptions chapter.