I am implementing a max-pooling module on FPGA using SystemVerilog. The length of each word is 64 bits, a grid of 28 by 28 words is input data (which is an image 28x28 pixels). The filter size is 2 by 2 words and the stride step is 2.
The code runs well on simulation, the synthesizer in Vivado said nothing about any non-synthesizable part. I also thought this code should well synthesized because all the parameters are literals, so everything should be resolved in compilation step. But when I ask this question on chatGPT, it said this code isn't synthesizable.
This is my code:
`timescale 1ns / 1ps
module max_pooling_module
#(
parameter N = 64,
parameter WIDTH = 28,
parameter FILTER_SIZE = 2,
parameter STRIDE = 2
)
(
input logic signed [WIDTH-1:0][WIDTH-1:0][N-1:0] inputImage,
output logic signed [(WIDTH-FILTER_SIZE)/STRIDE:0][(WIDTH-FILTER_SIZE)/STRIDE:0][N-1:0] pooledImage
);
// generate hardwares to do pooling concurrently
generate
genvar row, col;
for(row = WIDTH-1; row + 1 >= STRIDE; row -= STRIDE) begin : stride_down
for(col = WIDTH-1; col + 1 >= STRIDE; col -= STRIDE) begin : stride_right
// create an array contains all the elements that will be passed to max module
logic signed [N-1:0] arr [FILTER_SIZE*FILTER_SIZE];
// add elements to arr
always_comb begin
for(int i = 0; i < FILTER_SIZE; i++) begin
for(int j = 0; j < FILTER_SIZE; j++) begin
arr[i*FILTER_SIZE+j] = inputImage[row-i][col-j];
end
end
end
// max module takes 2 signals, the input is an array of values to compare,
// and an output which receives max value
max pooling(.arr(arr), .maxValue(pooledImage[(row+1)/STRIDE-1][(col+1)/STRIDE-1]));
end
end
endgenerate
endmodule
This is max module:
module max #(parameter FILTER_SIZE = 2, parameter N = 64) (
input logic signed [N-1:0] arr[FILTER_SIZE*FILTER_SIZE],
output logic signed [N-1:0] maxValue
);
logic signed [N-1:0] res;
always_comb begin
res = arr[0];
for (int i = 0; i < FILTER_SIZE*FILTER_SIZE; i++) begin
if (res < arr[i]) res = arr[i];
else res = res;
end
end
assign maxValue = res;
endmodule
Is this code synthesizable? If not, what is the correct way of doing it?
This is the message I have after running synthesis and implementation:
The utilization of I/O and LUT are 31360% and 106% respectively.
There is a synthesis status "complete" indication.
I think its synthesizable because synthesis indicated that it finished (synthesis status "complete") and the tool inferred what was modeled in the RTL. The posted code models a bunch of compares, and (based on your comments) synthesis built and connected them using luts. The tool finished with no latches which is good.
Other indications which would be positive WRT synthesis quality of results
With the synthesized design open, f4 in Vivado should produce a schematic, which should be recognizable as your design. In your case you should see many instances of the max module.
A post synthesis netlist was generated. Vivado stores the post synthesis netlist into a file container .dcp. Its a lot like a zip file. It can be viewed or extracted using 7zip or similar. It should contain a .edn or .edf file. In project mode (GUI) Vivado puts this file where it wants to so you might need to used find or similar to find it.
A hard negative indication would be that the synthesis hangs for hours rather than completing the task.
The biggest issue I see with the posted code is utilization, which is shown in the UTLZ-1 error in implementation, which means the design is bigger that the part targeted WRT luts. You have 56 pounds of stuff in a 53 pound box. Maybe reduce the word size from 64 to 32 bits. Alternatively, you may be able to use less max modules if you figure out a way to use them as an engine and multiplex the data in and out, storing in registers. If you need to do N operations you don't necessarily need N operating engines, just time share the single operation and store results as needed.
The design's IO utilization is >> that the number of IO available in the physical part targeted. You will need to re-architect this to fit within the available IO. This will probably involve the use of a RAM to store samples. Maybe this could be the start of another question about how to make it fit.
You won't be able too deploy a .bit file to hardware until the design fits in the part.
Re the division operator:
The / operator is used as part of the generate loop which unrolled at elaboration, you are not asking Verilog RTL for divide; should be ok the way / is used. The divide you get is integer division.