c++algorithm parallel-processing compiler-construction hardware-acceleration

Parallelizing my custom compiler with out of order execution

I have built my own parser that, for now at least, will only deal with statements similar to,

return a * b * c * (d + e) / f

What I will see while walking along the Abstract Syntax Tree (AST) that it generates looks like this

(in the exact order that I see it, and in a hopefully descriptive pseudo-language):

RETURN_STMT
BO(left: a, right: BO, op: *)
BO(left: BO, right: f, op: /)
BO(left: BO, right: PE, op: *)
BO(left: b, right: c, op: *)
PE(BO)
BO(left: d, right: e, op: +)

where BO is an abbreviation for Binary Operator, and PE stands for Parenthesis Expression.

I have now moved to building my compiler that will be able to take the AST above, and run it in my custom hardware that I have also generated.

A little info on the hardware: A VHDL like language that contains a set of basic components like multiplyers, dividers, adders, etc. It could look like this (again in hopefully descriptive pseudo-code):

reg a, b, c, d, e, f
int i <- 0
b <- buffer[4]                                 // a simple FIFO queue

...                                            // some initialization

m1 <- MUL(a, b, if: i == 0)                    // 2 cycles
m2 <- ADD(d, e, if: i == 0){i++}               // 1 cycle
m3 <- MUL(c, m2, if: i == 1){i++}              // 2 cycles
m4 <- MUL(m1, m3, if: i == 2){i++, b.put(m4)}  // 2 cycles
m4 <- DIV(b.get(), f, if: i == 3){i++}         // 6 cycles

The if specifier will decide when (in each cycle) the operation will fire. In this way I can have my hardware execute operations in parallel.

If we imagine an infinite amount of streaming a, b, c, d, e, f values, then we need to contend with the delays, in cycles, for each operator. If in the case that a MUL operations requires 2 cycles, and a DIV requires 6, then I will have performed 3 multiplications before I get the chance to finish one division. Hence I can store the intermediate values in a buffer that will ensure (in the streaming scenario discussed here) that nothing will get lost.

In practice this is what my compiler will have to generate if I want to tie my custom c-like language with my custom (emulated for now) hardware.

A number of issues arrise:

I can reuse hardware components, but what is the minimum amount of components (MUL, DIV, etc) that I need?
What would the minimum size of the required buffers be?
Given the output of my (recursive) walk algorithm originally shown above, how can I create an optimized series of basic operations, while at the same time respecting the hardware delay requirements?

Solution

You're essentially solving a constraint programming problem. You have constraints like "operation a depends on result of b", "operation a result is available in 3 clock cycles", "a and c share the same resource" (e.g., a multiplier), etc. One quick and dirty trick of how to get such solutions for free is to simply reuse an existing CLP(FD) engine, such as SWI-Prolog. In my HLS compiler [1], for example, I just generate SWI-Prolog programs and run them to get resources scheduled. If you want to implement your own constraint solver, be prepared to go down a very, very deep rabbit hole.

Now, comes the hard part - resource allocation. On one hand you have your expression (I suppose, an inner loop body?), that must be executed with a maximum possible throughput, ideally one result retired per clock cycle. On the other hand you have a limited area, so you may want to share expensive resources, such as dividers, square root, etc. One way to do it is to have a feedback loop in your compiler. First you split your expression into parts around a resource you picked for sharing, then you complete the compilation pipeline and measure the resulting throughput and area. If they're within your constraints, carry on, if not, backtrack and select another set of resources for sharing. Start with trying the most expensive ones first.

Another thing - don't do buffering. Don't represent your expression as an FSM. It's not efficient. Pipeline it instead, and store the results that came too early (ideally you won't have too much as your constraint solver should schedule them all efficiently) in pipeline registers. This way your hardware-compiled expression will be suitable as an inner loop body, taking new input values every clock cycle and retiring the results every clock cycle (or slower if you share resources and re-issue tasks back into pipeline).

[1] https://github.com/combinatorylogic/soc/tree/a1d282d793548030cba940496bed90ff3a29c0ba