Search code examples
cassemblyx86jitvm-implementation

Simple bytecode translator


I'm trying to build a fast JIT using C/C++. It should translate mine bytecode to IA32. Yes, I know about libjit and similar, but I'm sure they're not simpler than this. I thought I've found a faster way to build the instruction, but I was wrong - the traditional switch/case method is ~2x faster than mine. My way is to copy a whole block and then fill a template. I know switch/case on modern compiler creates a jump table, so I didn't had a jump table implementation.

I've used a 50mb file that contains the contents of code[] looped to benchmark. My configuration is: i7, 8gb ram, compiled using VS2010.

Here's my code:

#include <stdio.h>
#include <stdlib.h>
#include <memory.h>
#include <time.h>

//unsigned char code[] = {1,2,3,4,4};

void mycopy(unsigned char* to, unsigned char* from, int size) {
    int i = 0;
    while(i < size) {
        /*if (i < size-3) {
            if (*(unsigned int*) &from[i] == 0xFFFFFFFF) {
                *(unsigned int*) &to[i] = 0xC3C3C3C3;
                i += 4;
                continue;
            }
        }*/

        to[i] = from[i];
        i++;
    } 
}

void translateA(unsigned char* code, unsigned char* output, int size) {

    unsigned char A[] = { 3, 1, 1, 1 }; // { size, <bytes...> }
    unsigned char B[] = { 2, 2, 2 };
    unsigned char C[] = { 8, 3, 3, 3, 3, 0xFF, 0xFF, 0xFF, 0xFF };
    unsigned char D[] = { 1, 4 };

    void* templat[] = { &A, &B, &C, &D };

    int i = 0;
    int total = 0;
    while(i < size) {
        int op_index = (int) code[i] - 1;
        unsigned char* instr_buffer = (unsigned char*) templat[op_index];
        int size = (int) instr_buffer[0];
        instr_buffer++;
        mycopy(output+total, instr_buffer,size);
        total += size;
        i++;
    }
}




void translateB(unsigned char* code, unsigned char* output, int size) {
    for(int i = 0; i < size; i++) {
        switch(code[i]) {
            case 1:
                output[0] = 1;
                output[1] = 1;
                output[2] = 1;
                output += 3;
                break;
            case 2:
                output[0] = 2;
                output[1] = 2;
                output += 2;
                break;
            case 3: 
                output[0] = 3;
                output[1] = 3;
                output[2] = 3;
                output[3] = 3;
                output[4] = 0xC3;
                output[5] = 0xC3;
                output[6] = 0xC3;
                output[7] = 0xC3;               
                output += 8;            
                break;
            case 4:
                output[0] = 4;
                output++;
                break;
        }
    }
}

int main(int argc, char* argv[]) {
    // load the 'code' to an array
    FILE* f = fopen("testops.bin", "r+");

    fseek(f, 0, SEEK_END);
    long fsize = ftell(f);
    fseek(f, 0, SEEK_SET);

    unsigned char* code = (unsigned char*) malloc(fsize);
    fread(code, fsize, 1, f);
    fclose(f);

    unsigned char* output = (unsigned char*) malloc(fsize*10);
    memset(output, 0x7A, fsize*10);

    // benchmark it
    time_t start = clock();

    // Replace with translateB. It's ~2x faster
    translateA(code, output, fsize);

    printf("\nelapsed %fs\n\n", (float) (clock()-start) / 1000); 

    printf("OUTPUT: ");
    for(int i=0;i<1024;i++) {
        if (output[i] == 0x7A) break;
        printf("%X", output[i]); 
    }

    printf("\n");

    system("PAUSE");
    return 0;
}

EDIT: My question is, why is the switch code faster? Is there something I'm doing wrong?


Solution

  • Build then disassemble your program, you will have your response.

    Switch case is more efficient due to compiler optimization, because it creates a jump table as you already said it.

    Your method makes more address lookup and have one more function call and loop per bytecode.

    Moreover, why do you use while for a loop which iteration number you know? The compiler may have additional optimization for for loop.

    And with a 4 cases switch, I'm not sure your compiler will create a jump table, so real examples may be even faster.