Why do we have instructions such as RDRAND instead of an I/O which would gives us similar results?

I'm wondering what was the reason behind designing a CPU specific instruction to generate random numbers?

The Intel processor has RDRAND and RDSEED. The PPC also has an equivalent instruction.

Wouldn't it make more sense to have a separate chip and just do some I/O to get those numbers? It seems to me that it makes the CPU even more complex for a very specialized instruction (most software never use random numbers!) when I/O has been around for ages and should work just fine.

Solution

We have both.
The TPM can generates crypto-secure random numbers (after all, it's a "crypto-chip") and the TPM is present on many, if not all, Intel based motherboards since Haswell.
Proprietary CSRNG PCI(e) cards are commercially available too.

I once attended a presentation of an home-made CSRNG with Arduino.
That guy had no notion of statistics or algebra. To be honest, the whole presentation was pathetic.
You can't just make a chip and claim it is a CSRNG, you must gain certifications, there are standards and methods.
Getting those certifications is expensive and hard.

Furthermore, to handle a wide bandwidth you also need a fast (u)-processor.
One of the goal of the TPM commitee was to make it cheap, the end result was that TPM chips are slow.

If you add the, relatively, low market demand for such chips we can clearly see that CSRNG chip are indeed expensive.

External devices are also prone to physical attacks, the chip can be easily desoldered/decapped or the bus tapped or replaced.
This is true even for the CSRNG inside the CPU, attacks are known, one reduce its entropy by altering its transistors.
However that requires a whole different kind of tools.

PCI(e) CSRNGs will probably use DMA to transfer a required number of bytes of entropy, that requires some coordination interface with the OS to known, for example, when a transfer is in progress.

And, of course, the payload would be in memory and that means a bigger software surface attack and an extra step to get it in the registers.
Accessing the memory is in the order of ~200-300 cycles.

Using port-mapped IO (i.e. the in instruction) will bring the payload directly in a register, but only 32-bit at a time and it's no faster than an ordinary load.

RDRAND is an user-mode instruction, allowing user-mode applications access to a CSRNG without any extra burden but to check for its support.
It comes with almost all of the recent CPUs, it almost feels like it's free.

Finally, there is a marketing aspect playing.
If your manufacturing process has improved and has given you a few [insert-a-reasonable-length-unit-here] squared of space in the die, you can improve the micro-architecture or add a new feature.
The former is hard, the latter is relatively easy to design and it may give you a boost over your competitors: common tools run faster on your CPU and that just because you could afford more space in the die.