parsing bytecode disassembly parser-generator

Parser generator for writing a disassembler for a proprietary VM format

I am trying to write a disassembler for Mindstorms EV3 VM binaries, preferably in Python, since I am very familiar with it. I have docs detailing the instructions set, opcodes, parameters, data types etc. which is available here but I am having trouble building an actual disassembler from that.

As I understand it, writing a disassembler is not much different from writing a compiler in that it's just a glorified deterministic finite-State machine. However, beyond that, I can not find any books or guides on writing one. I attempted to use compiler writing tools such as Lex and Yacc (or in python PLY), however all of those tools fail me in some way.

The Lex and Yacc combo does not appear to be something I can use because it does the reverse of what I want to do. It generates tokens from text, which you then turn into bytecode, whereas I have bytecode and want to turn it into tokens, which I then turn into text. As I see it the only option I have is writing my own parser generator, which I would like to avoid as it seems like a lot of work.

Another issue I have is that I do not know how to deal with things like null-terminated strings and the parameter encoding (some arguments are prefixed by a byte with certain bits set telling the VM what length and type to expect, which I presume means I can not parse the whole bytecode with a simple DFA at a byte level)

How would I write a disassembler for a format like this?

Solution

Writing binary disassemblers really mean you want to specify patterns of binary data, and relationships between decode entities (if you recognize two instructions, it is reasonable to assume they don't overlap).

Conventional parsers don't do this very well (although you can certainly write a grammar for a sequence of instructions assuming individual bits or bytes as tokens; you'll still have to handle gaps between instuction sequences).

The smartest approach I've seen for this is a tool called SLED by Norman Ramsey and his team, which lets you write succinct patterns for binary data, and assembles it into binary encoders and decoders automatically. This research paper discusses SLED and a number of similar systems. A key point: these are much more than a "parser generator" but the concept is similar: many patterns to describe the data, a code generator to assemble the patterns into an efficient engine to match them all.

To give you a flavor of what these tools look like, I offer a fragment of a partial x86-64 encoding based on some work I have done in this area. The idea is to define named patterns (with constraints) that compose so that you can eventually write an instruction definition. Here's a brief sample of a small part (the whole thing is some 1200 lines):

datatype SIB
 { ScaleSize:     unsigned encoding { Times1=0 Times2=1 Times4=2 Times8=3} 2 bits
   ScaledIndex:   unsigned encoding { EAX=0 ECX=1 EDX=2 EBX=3 none=4 EBP=5         ESI=6 EDI=7 } 3 bits
   IndexRegister: unsigned encoding { EAX=0 ECX=1 EDX=2 EBX=3 ESP=4  EBP,disp32=5  ESI=6 EDI=7 } 3 bits
 }

encoding Grp1     { ADD=0 OR ADC SBB AND SUB XOR CMP }
encoding Grp1A    { POP=0 * * * * * * * }
encoding Grp2     { ROL=0 ROR RCL RCR SHL,SAL SHR  * SAR }
encoding Grp3     { TESTIbIz=0 * NOT NEG MULAX,MULrAX IMULAL,IMULrAX DIVAL,DIVrAX IDIVAL,IDIVrAX }
encoding Grp4     { INCEb=0  DECEb * * * * * * }
encoding Grp5     { INCEv=0  DECEv CALLNEv CALLFEp  JMPNEv JMPFEp  PUSHEv * }
encoding Grp6     { SLDTRvMW=0 STRRvMw LLDTEw LTREw VERREw VERWEw * * }
encoding Grp7mem  { SGDTMs=0                          SIDTMs        LGDTMs LIDTMs SMSWMwRv * LMSWEw INVLPGMb }
encoding Grp7reg  { VMCALL,VMLAUNCH,VMRESUME,VMXOFF=0 MONITOR,MWAIT *      *      SMSWMwRv * LMSWEw SWAPGS }
encoding Grp8     { *=0 * * * BT BTS BTR BTC }
encoding Grp9mem  { * CMPXCH8BMq,CMPXCH16BMdq * * * * VMPTRLDMq,VMCLEARMq,VMXONMq VMPTRSTMq }
encoding Grp9reg  { *=0 * * * * * * * }
encoding Grp10    { * * * * * * * }
encoding Grp11Ib  { MOVEbIb * * * * * * * }
encoding Grp11Iz  { MOVEvIz * * * * * * * }
encoding Grp12mem { * * * * * * * * }
encoding Grp12reg { *=0 * * PSRLWNqIb,PSRLWUdqIb *             PSRAWNqIb,PSRAWUdqIb * PSLLWNqIb,PSLLWUdqIb * }
encoding Grp13mem { * * * * * * * * }
encoding Grp13reg { *=0 * * PSRLDNqIb,PSLRDUdqIb *             PSRADNqIb,PSRADUdqIb * PSLLDNqIb,PSLLDUdqIb * }
encoding Grp14mem { * * * * * * * * }
encoding Grp14reg { *=0 * * PSRLQNqIb,PSRLQUdqIb  PSRLDQUdqIb  *                    * PSLLQNqIb,PSLLQUdqIb PSLLDQUDqIb }
encoding Grp15mem { FXSAVE=0 FXRSTOR LDMXCSR STMXCSR * * * CFLUSH }
encoding Grp15reg { *=0 * * * LFENCE MFENCE SFENCE }
encoding Grp16mem { PREFETCHNTA=0 PREFETCHT0 PREFETCHT1 PREFETCHT2 * * * } 
encoding Grp16reg { * * * * * * * * }

...

instruction { ADCr64Immediate => Grp1.ADC
      ADDr64Immediate => Grp1.ADD
      ANDr64Immediate => Grp1.AND
      CMPr64Immediate => Grp1.CMP
      ORr64Immediate  => Grp1.OR
      SBBr64Immediate => Grp1.SBB
      SUBr64Immediate => Grp1.SUB
      XORr64Immediate => Grp1.XOR
    }
   (Target: Register32, Immediate: signed 32 bits)
   BasicInstruction
   & prefix_length0
   & if Intel64 => prefix_REX(W=On R=Target:3)
   & OneByteOpcode & Subcode=ImmGrp1EvIz
   & ModRM(Mod=reg RSlashM=Target:2-0 reg=*)

If you are really decoding just a simple virtual machine with a simple instruction set, maybe you don't need all this capability, because a "simple VM" doesn't pack bits or split data across byte boundaries, and/or you can hack your way through the few cases that violate these assumptions. As people's VMs get more complex (they evolve over years), they necessarily get more complicated. YMMV.