Search code examples
assemblygcccpucpu-architecture

Is there anyway to make GCC generate extra NOP instruction to align instruction execution to a certain block size?


Context: Hello, I have recently been building a custom CPU that has 16, 32, and 48bit instruction length. The CPU fetches 64bit blocks of data, and this is all fine until an instruction is caught in between two blocks of data; This makes my CPU fetch two blocks of data which affects its performance.

Question: I wonder if there is any way to make gcc align instructions with NOPS to 64bit blocks by adding aditional parameters to the compilation process. Or what it the correct approach to make GCC align instructions with NOPS.

This is how an instruction gets caught in between two block.

          +---------------+ 
          |Unaligned ins  |
          +---------------+
 +---------------+ +---------------+
 |     64Bits    | |     64Bits    |
 +---------------+ +---------------+

The ideal way for 16bit execution and 48bit execution that I want GCC to achieve. Each empty block represent a 16bit instruction, but in the last big empty block represents a 48bit instruction; which will make unaligned the posterior instructions if another 48bit or 32bit instruction follows it, it will get caught in between two data blocks. I want GCC to generate a NOP instruction to prevent unaligned instruction execution. As shown in the last empty block.


+---+---+---+---+ +-----------+---+
|   |   |   |   | |           |   |
+---+---+---+---+ +-----------+---+
+---------------+ +---------------+
|     64Bits    | |     64Bits    |
+---------------+ +---------------+

What I have already tried: I tried to add parameters to GCC such as -falign-loops=## -falign-functions=## -falign-jumps= ## But they dont achieve what im looking for.


Solution

  • Can you have GCC print .p2align 3,,4 before every 48-bit instruction, and .p2align 3,,2 before every 32-bit instruction? I don't know exactly where to modify GCC's source code to do that, but it avoids needing to actually track instruction sizes and current alignment.

    That will pad for alignment to a 2^3 byte (64-bit) boundary, but only if it requires at most 4 bytes (or 2 bytes) of padding.

    With those limits, it won't pad before a 6-byte instruction if it's 6 bytes ahead of a chunk boundary (and thus can fit). Same for a 4-byte instruction.

    More optimal would be instruction-scheduling that's aware of boundaries and tries to re-order to pack into chunks without leaving large gaps to fill with NOPs.


    If your GAS doesn't know how to generate 2 or 4 byte NOPs itself, the bad simple way would be to use .p2alignw 3, 0x1234, 4 to tell it to fill with 2-byte sequences of 0x1234. (Where 0x1234 is a placeholder for the encoding of a 2-byte NOP instruction.)

    Slightly less bad would be teaching GAS to emit 2-byte or 4-byte NOP instructions instead of 2x 2-byte NOPs, but this is just a dirty hack you can do without modifying GAS.