Search code examples
verilogsignal-processingfftfpgahdl

Partitioning combination and sequential logic for reliable and low latency butterfly module for a 4 stage FFT design


I am building a 4 stage FFT using a simple butterfly module to do the complex multiply and accumulate. I have build the simply butterfly module (see below). I need some input on having the minimal latency and at the same time have a stable design.

module simple_butterfly #(parameter WIDTH =16)(
    input clk, rst,
    input signed [WIDTH-1:0] a_real, a_imag,
    input signed [WIDTH-1:0] b_real, b_imag,
    input [WIDTH-1:0] w_real, w_imag,
    output reg [WIDTH-1:0] p_real, p_imag,
    output reg [WIDTH-1:0] q_real,q_imag
    );

   wire [2*WIDTH-1:0] bw_real_real, bw_imag_imag, bw_real_imag, bw_imag_real; // complex product outputs
   wire [WIDTH-1:0] bw_real, bw_imag;
   wire [WIDTH-1:0] p_real_w, p_imag_w, q_real_w, q_imag_w;


    assign bw_real_real = b_real * w_real; // ac
    assign bw_imag_imag = b_imag * w_imag; // bd
    assign bw_real_imag = b_real * w_imag; // ad
    assign bw_imag_real = b_imag * w_real; // bc 
 
    assign bw_real = bw_real_real[27:12] - bw_imag_imag[27:12] ;  // ac - bd
    assign bw_imag = bw_real_imag[27:12] + bw_imag_real[27:12] ;  // ad + bc 
   
    assign p_real_w = a_real + bw_real;
    assign p_imag_w = a_imag + bw_imag;

    assign q_real_w = a_real - bw_real;
    assign q_imag_w = a_imag - bw_imag;

    always@(posedge clk)
        if(rst) begin
            p_real <= 16'b0;
            p_imag <= 16'b0;
            q_real <= 16'b0;
            q_imag <= 16'b0;
        end else begin
            p_real <= p_real_w;
            p_imag <= p_imag_w;
            q_real <= q_real_w;
            q_imag <= q_imag_w;
        end
endmodule

My question is : Can I keep all the multiplication and vector extraction all as combinational logic and only register the output at each stage, in order to pipeline the design. Can the below be considered as a safe/reliable design, if not can anyone who have experience building DSP modules suggest things to improve in my design.


Solution

  • In FPGA, when designing modules that multiply and add you want those resources to synthesize as DSP blocks. The DSP blocks have a built in multiply and accumulate. In the code shown there are 4 multiplies and 4 adds, if this synthesizes as 4 DSP blocks this is the optimum result. The 'pipeline' (registers) end up built into the DSP blocks, so they are somewhat free.

    This is an optimal use of resources, however the part has a finite number of DSP blocks which will limit the size of the transform.

    Another limitation is the bit width; the Xilinx DSP48 block max size for multiply is 25x18. Check other vendors for different sizes.

    The goal is to have the synthesis tool infer the DSP block during synthesis. Generally, best practice is to NOT instantiate a DSP block in Verilog or VHDL RTL code. There are exceptions, but start with an inference rather than instantiation workflow.

    To that end, run the design thru the synthesis flow and look at the utilization. Optimally 4 DSP blocks and a small number of LUTS would be inferred by synthesis for a 4-point transform. If it does something else (example using a lot of LUTS and registers) then re-code to utilize the DSP blocks.

    Each vendor has coding style recommendations with the goal of mapping to DSP blocks.

    This Xilinx paper (DSP blocks used in Xilinx 7-Series) is somewhat helpful. I say somewhat because it focus on FIR, but many of the same ideas apply to FFT. https://www.xilinx.com/support/documentation/user_guides/ug479_7Series_DSP48E1.pdf

    The clk->out delay (timing performance) of the DSP block is about that of a reg.
    Generally, if a design meets the clock timing constraint after place & route, you are good to go to the lab ( WRT timing :) ). Some design organizations recommend/require that all module IO are registered. This may be overkill. I have found that with 7-series the designs meet timing over the years as the parts get fuller and fuller during development if the number of level of logic is about 10 or less. This is a rule of thumb not a strict brick-wall rule. With 12 levels of logic its starting to get marginal, at 15 it might be the critical path in the design.

    To determine the levels of logic, open the timing analyzer, right click on the worst timing path on the clock, and 'analyze' or 'show schematic' for the path. The analysis shows the number of levels of logic. Vivado can show you a schematic view of the timing path so you can see if it spans several modules to help answer the question 'how much do I need to pipeline this module'.

    It also depends on the clock rate, the device and the speed grade. If you are doing anything > 200MHz in a > 50% utilized device, then additional pipelining and keeping the combinational logic down to 2-3 levels is needed. This also is my experience and a rough guideline, not a rule or requirement. There may be FPGAs that have generous timing margins for the fabric at 400 MHz, and are relatively full; I have not seen one though.