VHDL: Why is output delayed so much?

I'm learning VHDL in order to describe and demonstrate the work of a superscalar-ish pipelined CPU with hazard detection and branch prediction, etc.

I'm starting small, so for practice I tried making a really simple "calculator" design, like this:

library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
use ieee.std_logic_signed.all;

entity calculator is
    port(
        in_A  : in  std_logic_vector(3 downto 0);
        in_B  : in  std_logic_vector(3 downto 0);
        add : in std_logic;
        sub : in std_logic;
        out_C : out std_logic_vector(3 downto 0)
    );
end entity calculator;

architecture RTL of calculator is
    signal next_output : std_logic_vector(3 downto 0);
begin
    process(in_A, in_B, add, sub)
        variable temp_x, temp_y, temp_z : integer;
    begin

        temp_x := conv_integer(in_A);
        temp_y := conv_integer(in_B);

        if(add = '1') and (sub = '0') then
            temp_z := temp_x + temp_y;
            next_output <= std_logic_vector(to_unsigned(temp_z, 4));
        elsif(add = '0') and (sub = '1') then
            temp_z := temp_x - temp_y;
            next_output <= std_logic_vector(to_unsigned(temp_z,4));
        else
            temp_z := 0;
            next_output <= std_logic_vector(to_unsigned(temp_z,4));
        end if;

        out_C <= next_output;
    end process;
end architecture RTL;

However, I can't figure out why the output is set only after the input is changed, as is demonstrated here (the test bench code I guess is irrelevant):

Model-Sim

I would like to know what I should do in order to make the output correct and available without delay. If add is 1, then the output should be set according to the input, without delay (well, I want it to be, the way I wrote it, its not :) )

Also, can someone explain to me when the output will be remembered in flip-flops, and if it being remembered in flip-flops the way I wrote my description.

I would also really appreciate all advice, criticism and guidance to help me out. This is only a simple ADD/SUB calculator, and I gotta describe a whole processor with an instruction set in about two months! Maybe you can point me to good learning tutorials, because the classes I had were useless :(

Thanks in advance! :)

Solution

The easiest thing to do would be to move the assignment

out_C <= next_output;

outside the process (make it a concurrent signal assignment).

You could also make next_output a variable declared in the process and leave the signal assignment where it is.

The delay happens because signal assignments don't take effect in the simulation cycle they occur in. Without a process sensitive to next_output it's new value will be seen the next time the process otherwise executes.

A concurrent signal assignment statement has an equivalent process where signals on the right hand side are in the sensitivity list.

Making next_output a variable makes it's value immediately available.

You could also re-write your process:

    process(in_A, in_B, add, sub)
        variable temp_x, temp_y, temp_z : integer;
    begin

        temp_x := conv_integer(in_A);
        temp_y := conv_integer(in_B);

        if(add = '1') and (sub = '0') then
            temp_z := temp_x + temp_y;
        elsif(add = '0') and (sub = '1') then
            temp_z := temp_x - temp_y;
        else
            temp_z := 0;
        end if;

        out_C <= std_logic_vector(to_unsigned(temp_z,4));
    end process;

And eliminate next_output.