How to design a 64 x 64 bit array multiplier in Verilog?

I know how to design a 4x4 array multiplier , but if I follow the same logic , the coding becomes tedious.

4 x 4 - 16 partial products
64 x 64 - 4096 partial products.

Along with 8 full adders and 4 half adders, How many full adders and half adders do I need for 64 x 64 bit. How do I reduce the number of Partial products? Is there any simple way to solve this ?

Solution

Whenever tediously coding a repetitive pattern you should use a generate statement instead:

module array_multiplier(a, b, y);

parameter width = 8;
input [width-1:0] a, b;
output [width-1:0] y;

wire [width*width-1:0] partials;

genvar i;
assign partials[width-1 : 0] = a[0] ? b : 0;
generate for (i = 1; i < width; i = i+1) begin:gen
    assign partials[width*(i+1)-1 : width*i] = (a[i] ? b << i : 0) +
                                   partials[width*i-1 : width*(i-1)];
end endgenerate

assign y = partials[width*width-1 : width*(width-1)];

endmodule

I've verified this module using the following test-bench: http://svn.clifford.at/handicraft/2013/array_multiplier/array_multiplier_tb.v

EDIT:

As @Debian has asked for a pipelined version - here it is. This time using a for loop in an always-region for the array part.

module array_multiplier_pipeline(clk, a, b, y);

parameter width = 8;

input clk;
input [width-1:0] a, b;
output [width-1:0] y;

reg [width-1:0] a_pipeline [0:width-2];
reg [width-1:0] b_pipeline [0:width-2];
reg [width-1:0] partials [0:width-1];
integer i;

always @(posedge clk) begin
    a_pipeline[0] <= a;
    b_pipeline[0] <= b;
    for (i = 1; i < width-1; i = i+1) begin
        a_pipeline[i] <= a_pipeline[i-1];
        b_pipeline[i] <= b_pipeline[i-1];
    end

    partials[0] <= a[0] ? b : 0;
    for (i = 1; i < width; i = i+1)
        partials[i] <= (a_pipeline[i-1][i] ? b_pipeline[i-1] << i : 0) +
                partials[i-1];
end

assign y = partials[width-1];

endmodule

Note that with many synthesis tools it's also possible to just add (width) register stages after the non-pipelined adder and let the tools register balancing pass do the pipelining.