What's the general procedure for compiling an HDL Program for an FPGA?

I have a question regarding the compilation of HDL programs within the context of FPGA design.

1) Why does the compilation process take so long? Is it really the compilation process that takes a long time, or is it the writing of individual logic gates that take a long time?

2) Why are the compiled files generally referred to as 'bitfiles'? What are the format of these bitfiles? I'm picturing a 2 dimensional matrix of gates that will either be opened or closed depending on the bits in the bitfile.

Thanks for any help!

Solution

1) Why does the compilation process take so long? Is it really the compilation process that takes a long time, or is it the writing of individual logic gates that take a long time?

To begin, if you want to see all the toil and hard work your FPGA tools do, just turn on verbose mode/detailed reports, and skim/read them.

I'm going to answer with a Xilinx viewpoint, since that's what I know. Although the processes may have different names/groupings/ordering, the idea is the same across vendors.

The HDL->bytecode process differs slightly from how one would compile, say, Java. It's not just conversion of each line to some bytecode, but an involved process in which the entire design is converted to a hardware implementation. You're not converting a program to hardware, but a description of hardware to hardware. You only call a pile of Verilog or VHDL a program when it's running a testbench in a simulator.

Remember that timing constraints are a thing, and thus optimization for timing/depth of logic is a top priority.

In practice, synthesis encompasses conversion of behavioral Verilog/VHDL to RTL representation, including FSM synthesis, extraction of boolean functions, optimization, decoders/encoders, muxes, ROMs, etc. Additionally, the synth step will duplicate registers whose values are needed in multiple areas on the FPGA, so that the routing delays to those areas are minimized. Some synth tools, such as XST, will provide a rough estimate of timing and device utilization at this stage.

Additionally, remember that synthesis involves some level of inference. HDL code that matches certain motifs/patterns will be converted to hardware macros or instantiations of certain primitives. If I write code that accesses a large reg[7:0] foo [2047:0] synchronously based on an address (and possibly a write enable) then the synth tool will want to detect that and put a block RAM in place. It will also try to optimize un-needed logic and may do fairly in-depth logical analysis in that optimization.

Translation/mapping involves tons of hardware logic intricacies as well--at this stage the software will try to stuff your logic functions into lookup tables in optimal ways, fit those into slices alongside the flipflops that they may drive, and optimize again. At this step, redundant or superfluous components left over from optimization may be removed.

Placing and routing is by far one of the more intensive steps in some designs. Now that mapping gave a sea of lookup tables and registers connected by a slew of wires, they all need to be placed using limited interconnect resources. The limitations include number of lines in a row/column, what bits can connect to other bits at certain distances, as well as clock distribution. Remember again that timing constraints exist. PAR may be able to place a design quickly, but spend a very long time trying to tweak the placement to fit those constraints. Placing and routing isn't an easy-to-solve problem, and involves tons of brute-force, random placement based on cost tables, and other unique approaches. Needless to say, this can take a long time.

Imagine trying to organize the below-shown circuit with no more than two crossings per wire and no more than 25cm of wire in the timing-critical path, just on the scale of an FPGA:

^source

2) Why are the compiled files generally referred to as 'bitfiles'? What are the format of these bitfiles? I'm picturing a 2 dimensional matrix of gates that will either be opened or closed depending on the bits in the bitfile.

You're pretty close, though not exactly. The bitstream configures the following parameters:

Routing. What signals go where, over what wires. This typically sets multiplexers and cross-connections. Pretty spot-on to what you mention, though they're really not gates more than connections (although fully buffered to avoid capacitance effects)
Slices. Each slice contains a few lookup tables used for function generators, as well as more multiplexers and such. The bitstream also specifies the contents of the lookup tables, whether they should be bypassed or linked, whether the output should go straight to routing or to a flip-flop, whether that flipflop should have an async reset, whether it should be posedge or negedge, and so on. For distributed memory slices, configuration related to writing/shifting the LUT under external control.
Other function blocks: How DSP/multiplier tiles should be configured, parameters/connectivity for clock-handling circuitry such as DCMs/PLLs/MMCMs/etc, widths/fallthrough/initial contents of block RAMs, the parameters for transcievers, et cetera.
Metadata. Possibly prevent reading back the bitstream over the configuration port/JTAG, if it should not be copied.