On my various M4 and M7 Nucleo boards I use a trivial assembler timing loop (a SUBS and BNE) in conjunction with a blinking LED. On Cortex M4 these instructions consume 3 processor clocks, which is readily confirmed. On my Nucleo-H723ZG Cortex M7 board these two instructions in total only consume a single clock cycle. This performance improvement is due to Dual Issuing of the two instructions with the branch effectively having zero latency. However with my Nucleo-H743ZI2 board the loop instructions take TWO processor clocks not ONE. As these are both M7 processors using identical code I need help in understanding why Dual Fetching appears not to be working!
As I already mentioned in reply to your previous question the cycle count of these instructions is variable there is no documented way to predict what it will be. It will depend on the wider context of the program, especially the alignment of the instructions, their position within a cache line and distance from branch targets, among other things.
Also as I already mentioned, if you want to make the execution time more predictable (but worse) then you must set the DISFOLD bit in the Auxiliary Control Register.
I assume that you are trying to make a delay loop, but using instructions without defined cycle counts to perform timing delays is wrong. You need to use some kind of clock or timer.
There are two options that I have used successfully on STM32 for this purpose with very high accuracy and very low overhead, the SysTick and the Debug Cycle Counter. These both provide a register that you can read reapeatedly and calculate the difference.
Here is an example of some code for accurate delay functions on Cortex M.