Search code examples
compiler-constructionllvm-irboxing

how do compilers refer unboxed types?


I am currently in the middle of building my first compiler in python. I have completed the lexer, parser and analyzer. I was planning on using the llvmlite library to emit ir. I am having trouble converting my dynamically typed language to statically typed llvm ir. my current approach is:

  1. During analysis i statically infer type, when possible, and add this information to my ast nodes.
  2. During code generation i then attempt to box dynamic values in a simple llvm struct with a type tag (i32 const) and a general pointer to the data. all my functions can then take these boxes as arguments and i "simply" unbox them before proceeding to my function body

I am realizing the hard way that i might be slightly in over my head. I can box values but i am having issues unboxing them and using them in my functions - while still adhering to llvm's SSA. i am currently switching on the type tag and creating variables for my arguments accordingly. here is the code:

... 

; x is a parameter
switch i32 %type_tag, label %done [ 
  i32 1, label %handle_int_x 
  i32 2, label %handle_float_x 
  i32 3, label %handle_bool_x 
  i32 4, label %handle_string_x 
] 
handle_int_x:
  %int_ptr = bitcast i8* %value_ptr to i32* ; Cast pointer to i32*
  %unboxed_int = load i32, i32* %int_ptr ; Load the integer value 
  store i32 %unboxed_int, i32* %x_int 

...

after this switch block has run i will have 4 different variable(one for each type) the argument will be in one of them - depending on what switch block ran. my issue is this, How do i refer to the argument inside my function? say i want to return x + y how do i know which variable to use, x_int or x_float (or the other types)? do i add another switch statement on the box's type tag each time a param is referred? this does not seem sustainable at all. how do actual compilers go about this?

I thought about using phi nodes but they also require each branch to return one type while in my case i have many.

Is there a way i could unify all these variables into one? i would like to simply refer to the parameter by its name rather than having to compute, which variable it is in each time it is referred.

how do actual compilers refer to the unboxed variables(in a dynamic setting)?


Solution

  • i "simply" unbox them before proceeding to my function body

    That's not the way to do it.

    If you want to do it that way, you'll end up having to generate one version of the function body for each possible combination of parameter types, giving you tp different versions of the function body where t is the number of types in your language and p is the number of parameters (though your type inference may cut that number down somewhat). And that's not even taking into account that your function may contain calls to other functions whose return type is also unknown.

    How it's usually done is that you unbox values only when primitive operations are applied to them that need the unboxed value and then you put the result back into a boxed value.

    while still adhering to llvm's SSA

    Usually, when generating LLVM code, you'd put everything into allocas and then let LLVM take care of converting them to registers. That way you don't have to worry about SSA. That's not going to help you with your current issue, but it will still make your life easier.