Search code examples
verilogfpgaquartusintel-fpga

How can I prevent that DSP blocks are synthesized away if they are not connected to a top level output?


I am using an Intel Stratix 10 FPGA and Quartus Prime Pro 21.4 to develop a power test project.

I cannot figure out how keep Quartus from optimizing away my DSP blocks.

I want to use all 3000 DSP blocks in our FPGA so that I can see the max current draw of the DSP block. Of course, we can use the power estimator, but we require a real-world physical test.

I actually don't need the output from the DSP block. I only care that they are running and using FPGA resources.

I have instantiated the Intel fixed DSP core IP as a multiplier:

https://www.intel.com/content/www/us/en/docs/programmable/683450/current/native-fixed-point-dsp-intel-stratix-51840.html

I am using a generate for loop to generate 3000 of these DSP IP blocks. My problem is that the DSP blocks are synthesized away unless I connect the output from each of the DSP blocks directly to a top level output. I only have ~1000 outputs available so this is not possible.

I thought I could just connect each output with a register array to catch the output. But it seems that if I don't actually use the output values or connect it outright to a top level output pin, then Quartus thinks we don't need it and optimizes it away.

The 2nd solution I tried is to use combinational logic:

top_output = DSP_out[0] || DSP_out[1] || DSP_out[2] || DSP_out[3]

this solution will generate 4 DSP blocks even though the generate loop runs 3000 times. I tried doing this in a loop, but it did not work. Is there a way to trick the system into synthesizing all the DSP blocks even if I don't connect the block to a top level output?

I seem to be able to access the output of the DSP block with no issues. For instance, I was able to turn on or off an LED based on the numbers I fed into a single multiplier.

Here is the full code:

`timescale 1ps/1ps
`default_nettype none

module power_test_design (
    input wire         clk_i,
    output reg [0:0] outputa,
    output reg  [0:0] outputb       
);

localparam           NUM_DSP_BLOCKS     = 3000;

genvar               i;
wire                 reset;
integer                        k;

//input stimulus signals for the DSP
reg [17:0]           ay_r;
reg [17:0]           by_r;
reg [17:0]           ax_r;
reg [17:0]           bx_r;
//create wires and registers to hold outputs from multiplier
(* keep = "true" *) wire [36:0]          resulta [NUM_DSP_BLOCKS-1:0];
(* keep = "true" *) reg [36:0]           resulta_r [NUM_DSP_BLOCKS-1:0];
(* keep = "true" *) wire [36:0]          resultb [NUM_DSP_BLOCKS-1:0];
(* keep = "true" *) reg [36:0]           resultb_r [NUM_DSP_BLOCKS-1:0];
reg [2:0]            ena_r;


// Stratix10 system reset
reset_release U_RESET (
    .ninit_done (reset )  //  output,  width = 1, ninit_done.ninit_done
);

// DSP stimulus
always @(posedge clk_i) begin : DSP_SET_FF          
    if (reset) 
    begin
        ay_r      <= {18{1'b0}};
        by_r      <= {18{1'b0}};
        ax_r      <= {18{1'b0}};
        bx_r      <= {18{1'b0}};
        ena_r     <= {3{1'b0}};
    end else
    begin
    
        ena_r <= 3'b001;
        ay_r <= $unsigned(ay_r) + 1;
        by_r <= $unsigned(by_r) + 1;
        ax_r <= $unsigned(ax_r) + 2;
        bx_r <= $unsigned(bx_r) + 3;
    
    end 
end
    
generate
    for (i=0; i<NUM_DSP_BLOCKS; i=i+1) begin : GEN_DSPS 

        dsp_fixed U_DSP  ( 
            .ay      (ay_r),      //   input,  width = 18,      ay.ay
            .by      (by_r),      //   input,  width = 18,      by.by
            .ax      (ax_r),      //   input,  width = 18,      ax.ax
            .bx      (bx_r),      //   input,  width = 18,      bx.bx
            .resulta (resulta[i]), //  output,  width = 37, resulta.resulta
            .resultb (resultb[i]), //  output,  width = 37, resultb.resultb
            .clk0    (clk_i),    //   input,   width = 1,    clk0.clk
            .clk1    (),    //   input,   width = 1,    clk1.clk
            .clk2    (),    //   input,   width = 1,    clk2.clk
            .ena     (ena_r)     //   input,   width = 3,     ena.ena
        );
        
    //bring result to a register to assign output logic
    assign resulta_r[i] = resulta[i];
    assign resultb_r[i] = resultb[i];

    end 
endgenerate

//output logic -this code generates 6 DSP blocks....I need to generate all 3000
always @(posedge clk_i) begin : outputLogic
    for (k=1; k<50; k=k+1) 
    begin
        outputa = resulta_r[k] || resulta_r[k+1] || resulta_r[k+2];
        outputb = resultb_r[k+3] || resultb_r[k+4] || resultb_r[k+5];
    end
end

endmodule
`resetall

So far, I tried several ways to assign this output. first:

always @(resulta_r[0], resulta_r[1], resulta_r[2], resulta_r[3]) begin
    if (resulta_r[0] == 4) 
    begin
        outputa = 1;
    end 
    else if (resulta_r[1] == 6) 
    begin
        outputa = 1;
    end
    else if (resulta_r[2] == 6) 
    begin
        outputa = 1;
    end
    else if (resulta_r[3] == 6) 
    begin
        outputa = 1;
    end
    else 
    begin
        outputa = 0;
    end
 end

With this code, DSP blocks are generated for each if statement. So, the next idea was

always @(posedge clk_i) begin : outputLogic
    for (k=1; k<50; k=k+1) 
    begin
        outputa = resulta_r[k] || resulta_r[k+1] || resulta_r[k+2];
        outputb = resultb_r[k+3] || resultb_r[k+4] || resultb_r[k+5];
    end
end

This works in a similar way. I get a DSP block generated for each result[k] in the combinational statement. But this only generates 6 DSP blocks in total when synthesizing. It only generates blocks based on how many DSP block outputs are in this combinational statement.


Solution

  • I solved this issue using Virtual pins in quartus. I can assign each output pin to only be a virtual pin and not an actual pin. With this setup I can have as many output pins as I require and not really connect them to anything.

    Quartus Virtual Pins

    The design still doesn't scale up to 3000 for some reason, but I have reached out to Intel for that. The original issue of optimizing away the DSP blocks unless they are connected to an output is solved.

    The other solution that solved this issue was to chain several of these DSP blocks together. It also doesn't scale, but solves the original question asked here as well.