Search code examples
vhdlintel-fpgaquartus

Trying to find Fmax in VHDL but getting extra cycle of delay


I want to see the speed of my VHDL design. As far as I know, it is indicated by Fmax in the Quartus II software. After compiling my design, it shows an Fmax of 653.59 MHz. I wrote a testbench and did some tests to make sure that the design is working as expected. The problem I have with the design is that at the rising edge of the clock, the inputs are set correctly, but the output only comes after one more cycle.

My question is: How can I check the speed of my design (longest delay between the input ports and the output port) and also get the output of the addition at the same time that the inputs are loaded/at the same cycle?

My testbench results are as follows:

a: 0001 and b: 0101 gives XXXX
a: 1001 and b: 0001 gives 0110 (the expected result from the previous calculation)
a: 1001 and b: 1001 gives 1010 (the expected result from the previous calculation)
etc

Code:

library ieee; 
use ieee.std_logic_1164.all; 
use ieee.numeric_std.all; 

entity adder is 
    port( 
        clk : in STD_LOGIC; 
        a : in unsigned(3 downto 0); 
        b : in unsigned(3 downto 0); 
        sum : out unsigned(3 downto 0)
    );  
end adder; 

architecture rtl of adder is 

signal a_r, b_r, sum_r : unsigned(3 downto 0); 

begin 
    sum_r <= a_r + b_r; 
    process(clk) 
    begin 
        if (rising_edge(clk)) then 
            a_r <= a;
            b_r <= b;
            sum <= sum_r;
        end if; 
    end process;
end rtl; 

Testbench:

library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
entity testbench is
end entity;

architecture behavioral of testbench is
    component adder is
        port( 
            clk : in STD_LOGIC; 
            a : in unsigned(3 downto 0); 
            b : in unsigned(3 downto 0); 
            sum : out unsigned(3 downto 0)
        ); 
    end component;
    signal a, b, sum : unsigned(3 downto 0);
    signal clk : STD_LOGIC;
begin
    uut: adder
        port map(
            clk => clk,
            a => a,
            b => b,
            sum => sum
        );
    stim_process : process
    begin
        wait for 1 ns;
        clk <= '0';
        wait for 1 ns;
        clk <= '1';
        a <= "0001";
        b <= "0101";
        wait for 1 ns;
        clk <= '0';
        wait for 1 ns;
        clk <= '1';
        a <= "1001";
        b <= "0001";
        wait for 1 ns;
        clk <= '0';
        wait for 1 ns;
        clk <= '1';
        a <= "1001";
        b <= "1001";
    end process;
end behavioral;

Solution

  • is there any issue with using sum_r as your output?

    You dont need the input and output registers, if you consider this ALU as a pure combinatorial logic. The Fmax once you deleted them will disappear, will then be dependent and what its connected from and what its connected to and only if incoming is from registers and outgoing is to registers. If it is only logic going from in to out and from input pin to output pin, I think its extremely difficult to say what the propagation delay is and vendors software like Altera and other modern vendors do not have tools which are adequate for this kind of analysis.

    Thats why you will hear people talking about difficulties in design asynchronous logic.

    I think such fine analysis is difficult to perform with certainty and accuracy. Since for you, the propagation delay would be in picoseconds. Even literature is difficult to find any quantitative answers on propagation delay.

    Why is it difficult? remember that propagation delay is determined by the total path capacitance, there is a way to estimate propagation delay for transistors but I dont know the deep details about how the LUTs are internally constructed so I cannot give you a good estimation. So it depends heavily on the family, the process of manufacture, the construction of FPGA and if the load is connected to IO.

    You may however make your own estimations by going to the logic planner, look at the path and assume about 20-100ps propagation delay per LUT that it travels through

    See the image below.

    enter image description here

    What you are trying to design is an ALU. By definition, an ALU should be in theory simply a combinatorial logic.

    Therefore, strictly speaking, your adder code should only be this.

    library ieee; 
    use ieee.std_logic_1164.all; 
    use ieee.numeric_std.all; 
    
    entity adder is 
        port( 
            a : in unsigned(3 downto 0); 
            b : in unsigned(3 downto 0); 
            sum : out unsigned(3 downto 0)
        );  
    end adder; 
    
    architecture rtl of adder is 
    begin 
        sum <= a + b; 
    end rtl; 
    

    Where no clock is required since this function is really a combinatorial process.

    However if you want to make your ALU go into a stage like how i have described, what you should be doing is actually this

    library ieee; 
    use ieee.std_logic_1164.all; 
    use ieee.numeric_std.all; 
    
    entity adder is 
        port( 
            clk : in STD_LOGIC; 
            a : in unsigned(3 downto 0); 
            b : in unsigned(3 downto 0); 
            sum : out unsigned(3 downto 0)
        );  
    end adder;
    
    architecture rtl of adder is 
    
    signal a_r, b_r, sum_r : unsigned(3 downto 0); 
    signal internal_sum : unsigned(3 downto 0);
    
    begin 
        sum <= sum_r;
        internal_sum <= a_r + b_r; 
    
        process(clk) 
        begin 
            if (rising_edge(clk)) then 
                a_r <= a;
                b_r <= b;
                sum_r <= internal_sum;
            end if; 
        end process;
    end rtl; 
    

    You have not mentioned about carry out so i will not discuss that here.

    Finally if you are using Altera, they have a very nice RTL viewer that you can have a look to see your synthesized design. Under Tools->Netlist Viewer-> RTL Viewer.