assembly x86 cpu-architecture machine-code

Is it possible, or how hard it is, to change the op code of an instruction in x86 architecture?

For instance, PUSH imm32 has the op code 68h. Is it possible to use another number, for example, 69h, to "represent" this instruction (assume this number is not being used by other instructions)?

By "represent", I mean wherever there is a PUSH instruction in the assembly, 69h will appear in the binary executable. When it is eventually being fetched and executed by the CPU, it will be transfer back to 68h.

I understand each op code is specifically designed according to the CPU circuit, but is it possible that I just want to use another hex number as a surrogate?

Of course I won't make any change on the CPU and I still want the instruction be executed on x86 architecture.

Update: why do I ask this question?

Probably you know of the Return Oriented Attack, which purposefully mis-interpret the stream of machine languages and take advantage that there are many C3 (that is, ret) among standard library. My initial thought was, if we are able to change the opcode of return from C3 to some other code, preferably 2 bytes, then the ROA will not work. I am not an expert in architecture field and I just found my thought won't work in reality. Thanks for all your responses.

Solution

My initial thought was, if we are able to change the opcode of return from C3 to some other code, preferably 2 bytes, then the ROA will not work.

No, x86 instruction encodings are fixed, and mostly hard-wired in the silicon of the decoders inside the CPU. (Micro-coded instructions redirect to microcode ROM for the definition of the instruction, but the opcode that's recognized as an instruction is still hard-wired.)

I think even a microcode update from Intel or AMD couldn't change their existing CPUs to not decode C3 as ret. (Although possibly they could make some other multi-byte sequence also decode as a very slow micro-coded ret, but probably only by taking over the encoding for an existing micro-coded instruction.)

A CPU that didn't decode C3 as ret would not be an x86 CPU anymore. Or I guess you could make it a new mode, where instruction encodings were different. It wouldn't be binary-compatible with x86 anymore, though.

It's an interesting idea, though. Single-byte RET on x86 makes it significantly easier to chain gadgets together (https://en.wikipedia.org/wiki/Return-oriented_programming#On_the_x86-architecture). (Or means there are many more gadgets that can be chained, giving you a larger toolbox.)

I wouldn't hold my breath waiting for CPU vendors to provide a new mode where ret uses a 2-byte opcode. It would be possible, though (for CPU vendors to make a new design, not for you to hack your existing CPU). By making it a separate mode (like 64-bit long mode vs. 32-bit compat mode under a 64-bit kernel, vs. "legacy mode" with a 32-bit kernel) OSes would still work on such CPUs, and you could mix/match user-space processes under the same kernel, some compiled for x86 and some for new86.

If vendors were going to introduce a new incompatible mode that couldn't run existing binaries, hopefully they'd make other cleanups to the instruction set. e.g. removing the false dependency on FLAGS for variable count shifts by having them always write FLAGS even if the count = 0. Or redoing the opcodes entirely to not spend so much coding space on 1-byte xchg eax, r32, and shorten the encodings for SIMD instructions. But then they couldn't share as many decoder transistors with the regular x86 decoders. And any changes like EFLAGS semantics for shifts could require changes in the back-end, not just the decoders.

They could also make [rsp+disp8/32] addressing modes 1 byte shorter, maybe using a different register as the one that always needs a SIB byte even with no index. (-fomit-frame-pointer is typical now, so it sucks that addressing relative to the stack-pointer costs an extra byte.)

See Agner Fog's Stop the instruction set war blog post for more details about how much of a mess x86 instruction encoding is.

How much change to the CPU circuit design would be required at minimum to make c3 the start of a 2-byte instruction that required the 2nd byte to be 00?

Intel CPUs decode in multiple stages:

The instruction-length pre-decoder finds instruction boundaries, placing instruction bytes in a queue (processing up to 16 bytes or 6 instructions, whichever is lower, per cycle). See https://www.realworldtech.com/sandy-bridge/3/ for a block diagram.
The decoders grab 4 (or 5 in Skylake) instructions from that queue, and feed them in parallel to the actual decoders. Each one outputs 1 or more uops. (See the next page in David Kanter's SnB writeup).

Some CPUs mark instruction boundaries in the L1i cache, and do this decoding as a line arrives from L2. (AMD did this more recently than Intel, but IIRC Ryzen doesn't, and Intel hasn't in P6 or SnB-family. See Agner Fog's microarch guide.)

The fact that c3 is a one-byte opcode with no following bytes is hard-wired into the instruction-length decoders, so that would have to change.

But then how to handle the 2nd byte? You could either have the decoder that gets c3 xx check that xx == 00 and raise a #UD exception if not (UnDefined instruction, aka illegal instruction).

Or it could decode it as an imm8 operand, and have an execution unit check that the operand was 0.

It's probably easier to have the decoders do this mode-dependent check on the next byte, because they have to decode other insns differently for different modes anyway.

00 isn't "special". The regular decoders probably receive instruction bytes in a wide input that's probably 15 bytes long (max x86 instruction length). But there's no reason to assume they would look at bits/bytes past the instruction length and fault if it wasn't zero-extended. It might be designed that way, but just as likely the handing for 1-byte opcodes like c3 is hard-wired and doesn't have any higher bits ANDed, ORed, or XORed with any of the opcode bits.

An opcode or whole insn isn't an integer that has to be zero-extended. You can't assume that there's anything like an "instruction register".

Making c3 xx not decode as ret for xx!=0 would still break essentially all existing binaries, and still require a new mode if you were making a CPU that could operate that way.

On CPUs that mark instruction boundaries in L1i cache, always treating ret as a 2-byte instruction (not including prefixes) wouldn't work. It's not that rare for the byte right after a ret to be a jump target, or a different function. Jumping to the "middle" of another instruction would force such a CPU to redo the instruction-boundary marking, starting from that point in the cache line, and then you'd have another problem when you ran the ret again.

Also, a c3 in the last byte of a page, followed by an unmapped page, must not pagefault. But that would happen if the instruction-length decoding stage always fetched another byte after c3 before letting it decode. (Running code from uncacheable memory would also make this count as an observable change. UC is the CPU equivalent of volatile)

I suppose you could maybe have the length-decoding stage tack on a fake 00 byte for the decoders if running in a mode where ret is single byte. ret is an unconditional jump, but it can fault if [rsp] isn't readable. But I think the exception frame would just have the start address of the instruction, not a length. So it might ok for the rest of the pipeline to think it was a 2-byte instruction when it was actually only 1.

But it still has to go in the uop-cache somehow, and the uop cache needs to care about insn start/end addresses even for unconditional jumps. For an instruction that spans a 64-byte cache-line boundary, it would need to invalidate the instruction if either changed.

My understanding is that real-life CPU design is always harder and more complex than you imagine from looking at block diagrams like David Kanter's articles.

And BTW, it's not particularly relevant how small a change in the decoders would be needed. The fact that only a CPU vendor could make this change in a new design makes your idea a total non-starter, outside of instruction-set design ideas. It's slightly more plausible than a complete re-organization of x86 machine code, because it can still share almost all of the decoder transistors with existing modes.

Supporting a whole new mode for this would be significant, requiring changes to the CPU's code segment descriptor (GDT entry) decoding.

It would be a much easier change to create a CPU that always requires c3 to be followed by 00, but then it wouldn't be an x86 and couldn't run the vast majority of code. There's zero chance of Intel or AMD ever selling a CPU like that.