VHDL - variable vs. signal behaviour in queue

In an university course about configurable embedded systems (on ZYNQ-7010) we were recently implementing a (naive) low-pass image filter that would apply a 1-dimensional gaussian kernel (0.25*[1 2 1]) to data coming from block RAM.

We decided to cache (i.e. queue) three pixels and then operate on them on-line in the data output process. Our first approach was to have three process variables and have them roll over in a

pixel[k-2] := pixel[k-1];
pixel[k-1] := pixel[k];
pixel[k]   := RAM(address);

fashion; The following being the full process:

process (clk25)
    -- queue
    variable pixelMinus2  : std_logic_vector(11 downto 0) := (others => '0');
    variable pixelMinus1  : std_logic_vector(11 downto 0) := (others => '0');
    variable pixelCurrent : std_logic_vector(11 downto 0) := (others => '0');

    -- temporaries
    variable r : unsigned(3 downto 0);
    variable g : unsigned(3 downto 0);
    variable b : unsigned(3 downto 0);
begin
    if clk25'event and clk25 = '1' then
        pixelMinus2  := pixelMinus1;
        pixelMinus1  := pixelCurrent;
        pixelCurrent := RAM(to_integer(UNSIGNED(addrb)));

        IF slv_reg0(3) = '0' THEN 
            -- bypass filter for debugging
            dob <= pixelCurrent;
        ELSE
            -- colors are 4 bit each in a 12 bit vector
            -- division by 4 is done by right shifting by 2
            r := (
                          ("00" & unsigned(pixelMinus2(11 downto 10)))
                        + ("00" & unsigned(pixelMinus1(11 downto 10)))
                        + ("00" & unsigned(pixelMinus1(11 downto 10)))
                        + ("00" & unsigned(pixelCurrent(11 downto 10)))
                    );

            g :=  (
                          ("00" & unsigned(pixelMinus2(7 downto 6)))
                        + ("00" & unsigned(pixelMinus1(7 downto 6)))
                        + ("00" & unsigned(pixelMinus1(7 downto 6)))
                        + ("00" & unsigned(pixelCurrent(7 downto 6)))
                    );

            b :=  (
                          ("00" & unsigned(pixelMinus2(3 downto 2)))
                        + ("00" & unsigned(pixelMinus1(3 downto 2)))
                        + ("00" & unsigned(pixelMinus1(3 downto 2)))
                        + ("00" & unsigned(pixelCurrent(3 downto 2)))
                    );

            dob <= std_logic_vector(r) & std_logic_vector(g) & std_logic_vector(b);
        END IF;
    end if;
end process;

However this turned out to be horribly wrong; Synthesis would take ages and result in an estimated LUT usage of approximately 130% of the device's capability.

We later changed the implementation to using signals instead of variables and this resolved all problems; The hardware behaved as expected and LUT usage went down to some percent.

My question is what's causing the problem here when using variables as, from our understanding, it should work like that.

Solution

When a variable is used for pixelCurrent in the process, then the value is updated and available immediately, where the value of a signal is not ready until the next cycle.

So when a variable is use, this line implements a RAM with asynchronous read based on addrb:

pixelCurrent := RAM(to_integer(UNSIGNED(addrb)));

Where an assign to a signal will implements a RAM with synchronous read, where the value read from the RAM is not available until next cycle.

The typical FPGA technologies has dedicated hardware for RAMs with synchronous read, but RAMs with asynchronous are made with combinatorial logic (look up tables / LUT).

So the huge amount of LUTs that appears when using a variable for pixelCurrent is because the synthesis tool tries to map the RAM with asynchronous read into LUTs, which typically requires a huge amount of LUTs and makes the resulting RAM very slow.

In the pipelined design it sounds like the asynchronous RAM read is not required, so if pixelCurrent is a signal, a synchronous RAM is used instead and the synthesis tool will map the RAM to an internal RAM hardware block, with code like:

pixelMinus2  := pixelMinus1;
pixelMinus1  := pixelCurrent;
pixelCurrent <= RAM(to_integer(UNSIGNED(addrb)));