I have to write an x86 assembly code that should run on Intel x86 processor.
Actually have to write like addition or move instructions to see the effect of these instructions of the performance of processor w.r.t temperature. That means my code should be capable of controlled heat generation from processor.
If you people have such a code or any one having experience to write such type of code please share.
For maximum heat, you want as many transistors as possible changing state every clock cycle. The floating point FMA units have a lot of transistors; keeping them busy makes a lot of heat, especially for 256b AVX vectors.
e.g. see the "stress testing" section of this Skylake overclocking guide, where you can see that Prime95 version 28 and Linpack are the hottest-running workloads. There's also a table of whole-system power consumption.
See also http://agner.org/optimize/ to learn more about CPU internals, especial Agner's microarch guide. You should be able to make less or more heat by having a loop that fits in the loopback buffer or not. The x86 decoders are much more power-intensive than reusing already-decoded uops. See this Q&A about uop throughput for various loop sizes, for the case where there aren't significant dependencies between the instructions so only the frontend limits throughput. (See also the x86 tag wiki).
I doubt you'll see very much different in heat from integer add reg, reg
vs. mov reg, reg
or something. Maybe saturating the throughput of the integer mul
unit would make a measurable heat / power difference, but the different cost of an adder vs. a mov or a simple boolean op is probably dwarfed by the power cost of out-of-order execution tracking the add
through the pipeline.
Loads or stores that keep the cache and store-buffer hardware active might be a different story, but add
can have a memory source or dest too. Just make sure you don't bottleneck your loop on the store-forwarding latency of a single memory-destination add.
For minimum heat without actually sleeping, use the pause
instruction in a loop. On Skylake, it sleeps much longer (~100 cycles) than on previous Intel microarchitectures (~5 cycles), IIRC.
According to powertop
on Linux, the kernel uses mwait
with different hints to enter different levels of sleep on Intel CPUs (e.g. my Skylake desktop). You might be able to do this from user-space if you want, or use nanosleep
to alternate sleep/wake and run a heat-producing workload with a certain duty cycle.
Sleeping frequently may prevent the OS from ramping the CPU up to full clock speed, depending on your setup. Why does this delay-loop start to run faster after several iterations with no sleep?
For other ideas on reducing throughput in a loop, see Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs. Stalls that are just slow without flipping a lot of transistors to recover might be a good way to make a loop that doesn't make much heat.
Without pause
, you'll see significant heating from just a simple infinite loop like .repeat: jmp .repeat
, especially on a CPU that can "turbo" up to a high voltage/frequency for as long as thermal limits allow.