I am using an Intel Stratix 10 FPGA and Quartus Prime Pro 21.4 to develop a power test project.
I cannot figure out how keep Quartus from optimizing away my DSP blocks.
I want to use all 3000 DSP blocks in our FPGA so that I can see the max current draw of the DSP block. Of course, we can use the power estimator, but we require a real-world physical test.
I actually don't need the output from the DSP block. I only care that they are running and using FPGA resources.
I have instantiated the Intel fixed DSP core IP as a multiplier:
I am using a generate for
loop to generate 3000 of these DSP IP blocks. My problem is that the DSP blocks are synthesized away unless I connect the output from each of the DSP blocks directly to a top level output. I only have ~1000 outputs available so this is not possible.
I thought I could just connect each output with a register array to catch the output. But it seems that if I don't actually use the output values or connect it outright to a top level output pin, then Quartus thinks we don't need it and optimizes it away.
The 2nd solution I tried is to use combinational logic:
top_output = DSP_out[0] || DSP_out[1] || DSP_out[2] || DSP_out[3]
this solution will generate 4 DSP blocks even though the generate loop runs 3000 times. I tried doing this in a loop, but it did not work. Is there a way to trick the system into synthesizing all the DSP blocks even if I don't connect the block to a top level output?
I seem to be able to access the output of the DSP block with no issues. For instance, I was able to turn on or off an LED based on the numbers I fed into a single multiplier.
Here is the full code:
`timescale 1ps/1ps
`default_nettype none
module power_test_design (
input wire clk_i,
output reg [0:0] outputa,
output reg [0:0] outputb
);
localparam NUM_DSP_BLOCKS = 3000;
genvar i;
wire reset;
integer k;
//input stimulus signals for the DSP
reg [17:0] ay_r;
reg [17:0] by_r;
reg [17:0] ax_r;
reg [17:0] bx_r;
//create wires and registers to hold outputs from multiplier
(* keep = "true" *) wire [36:0] resulta [NUM_DSP_BLOCKS-1:0];
(* keep = "true" *) reg [36:0] resulta_r [NUM_DSP_BLOCKS-1:0];
(* keep = "true" *) wire [36:0] resultb [NUM_DSP_BLOCKS-1:0];
(* keep = "true" *) reg [36:0] resultb_r [NUM_DSP_BLOCKS-1:0];
reg [2:0] ena_r;
// Stratix10 system reset
reset_release U_RESET (
.ninit_done (reset ) // output, width = 1, ninit_done.ninit_done
);
// DSP stimulus
always @(posedge clk_i) begin : DSP_SET_FF
if (reset)
begin
ay_r <= {18{1'b0}};
by_r <= {18{1'b0}};
ax_r <= {18{1'b0}};
bx_r <= {18{1'b0}};
ena_r <= {3{1'b0}};
end else
begin
ena_r <= 3'b001;
ay_r <= $unsigned(ay_r) + 1;
by_r <= $unsigned(by_r) + 1;
ax_r <= $unsigned(ax_r) + 2;
bx_r <= $unsigned(bx_r) + 3;
end
end
generate
for (i=0; i<NUM_DSP_BLOCKS; i=i+1) begin : GEN_DSPS
dsp_fixed U_DSP (
.ay (ay_r), // input, width = 18, ay.ay
.by (by_r), // input, width = 18, by.by
.ax (ax_r), // input, width = 18, ax.ax
.bx (bx_r), // input, width = 18, bx.bx
.resulta (resulta[i]), // output, width = 37, resulta.resulta
.resultb (resultb[i]), // output, width = 37, resultb.resultb
.clk0 (clk_i), // input, width = 1, clk0.clk
.clk1 (), // input, width = 1, clk1.clk
.clk2 (), // input, width = 1, clk2.clk
.ena (ena_r) // input, width = 3, ena.ena
);
//bring result to a register to assign output logic
assign resulta_r[i] = resulta[i];
assign resultb_r[i] = resultb[i];
end
endgenerate
//output logic -this code generates 6 DSP blocks....I need to generate all 3000
always @(posedge clk_i) begin : outputLogic
for (k=1; k<50; k=k+1)
begin
outputa = resulta_r[k] || resulta_r[k+1] || resulta_r[k+2];
outputb = resultb_r[k+3] || resultb_r[k+4] || resultb_r[k+5];
end
end
endmodule
`resetall
So far, I tried several ways to assign this output. first:
always @(resulta_r[0], resulta_r[1], resulta_r[2], resulta_r[3]) begin
if (resulta_r[0] == 4)
begin
outputa = 1;
end
else if (resulta_r[1] == 6)
begin
outputa = 1;
end
else if (resulta_r[2] == 6)
begin
outputa = 1;
end
else if (resulta_r[3] == 6)
begin
outputa = 1;
end
else
begin
outputa = 0;
end
end
With this code, DSP blocks are generated for each if
statement. So, the next idea was
always @(posedge clk_i) begin : outputLogic
for (k=1; k<50; k=k+1)
begin
outputa = resulta_r[k] || resulta_r[k+1] || resulta_r[k+2];
outputb = resultb_r[k+3] || resultb_r[k+4] || resultb_r[k+5];
end
end
This works in a similar way. I get a DSP block generated for each result[k]
in the combinational statement. But this only generates 6 DSP blocks in total when synthesizing. It only generates blocks based on how many DSP block outputs are in this combinational statement.
I solved this issue using Virtual pins in quartus. I can assign each output pin to only be a virtual pin and not an actual pin. With this setup I can have as many output pins as I require and not really connect them to anything.
The design still doesn't scale up to 3000 for some reason, but I have reached out to Intel for that. The original issue of optimizing away the DSP blocks unless they are connected to an output is solved.
The other solution that solved this issue was to chain several of these DSP blocks together. It also doesn't scale, but solves the original question asked here as well.