In an university course about configurable embedded systems (on ZYNQ-7010) we were recently implementing a (naive) low-pass image filter that would apply a 1-dimensional gaussian kernel (0.25*[1 2 1]) to data coming from block RAM.
We decided to cache (i.e. queue) three pixels and then operate on them on-line in the data output process. Our first approach was to have three process variables and have them roll over in a
pixel[k-2] := pixel[k-1];
pixel[k-1] := pixel[k];
pixel[k] := RAM(address);
fashion; The following being the full process:
process (clk25)
-- queue
variable pixelMinus2 : std_logic_vector(11 downto 0) := (others => '0');
variable pixelMinus1 : std_logic_vector(11 downto 0) := (others => '0');
variable pixelCurrent : std_logic_vector(11 downto 0) := (others => '0');
-- temporaries
variable r : unsigned(3 downto 0);
variable g : unsigned(3 downto 0);
variable b : unsigned(3 downto 0);
begin
if clk25'event and clk25 = '1' then
pixelMinus2 := pixelMinus1;
pixelMinus1 := pixelCurrent;
pixelCurrent := RAM(to_integer(UNSIGNED(addrb)));
IF slv_reg0(3) = '0' THEN
-- bypass filter for debugging
dob <= pixelCurrent;
ELSE
-- colors are 4 bit each in a 12 bit vector
-- division by 4 is done by right shifting by 2
r := (
("00" & unsigned(pixelMinus2(11 downto 10)))
+ ("00" & unsigned(pixelMinus1(11 downto 10)))
+ ("00" & unsigned(pixelMinus1(11 downto 10)))
+ ("00" & unsigned(pixelCurrent(11 downto 10)))
);
g := (
("00" & unsigned(pixelMinus2(7 downto 6)))
+ ("00" & unsigned(pixelMinus1(7 downto 6)))
+ ("00" & unsigned(pixelMinus1(7 downto 6)))
+ ("00" & unsigned(pixelCurrent(7 downto 6)))
);
b := (
("00" & unsigned(pixelMinus2(3 downto 2)))
+ ("00" & unsigned(pixelMinus1(3 downto 2)))
+ ("00" & unsigned(pixelMinus1(3 downto 2)))
+ ("00" & unsigned(pixelCurrent(3 downto 2)))
);
dob <= std_logic_vector(r) & std_logic_vector(g) & std_logic_vector(b);
END IF;
end if;
end process;
However this turned out to be horribly wrong; Synthesis would take ages and result in an estimated LUT usage of approximately 130% of the device's capability.
We later changed the implementation to using signals instead of variables and this resolved all problems; The hardware behaved as expected and LUT usage went down to some percent.
My question is what's causing the problem here when using variables as, from our understanding, it should work like that.
When a variable is used for pixelCurrent
in the process, then the value is
updated and available immediately, where the value of a signal is not ready
until the next cycle.
So when a variable is use, this line implements a RAM with asynchronous
read based on addrb
:
pixelCurrent := RAM(to_integer(UNSIGNED(addrb)));
Where an assign to a signal will implements a RAM with synchronous read, where the value read from the RAM is not available until next cycle.
The typical FPGA technologies has dedicated hardware for RAMs with synchronous read, but RAMs with asynchronous are made with combinatorial logic (look up tables / LUT).
So the huge amount of LUTs that appears when using a variable for
pixelCurrent
is because the synthesis tool tries to map the RAM with
asynchronous read into LUTs, which typically requires a huge amount of LUTs
and makes the resulting RAM very slow.
In the pipelined design it sounds like the asynchronous RAM read is not
required, so if pixelCurrent
is a signal, a synchronous RAM is used instead
and the synthesis tool will map the RAM to an internal RAM hardware block, with
code like:
pixelMinus2 := pixelMinus1;
pixelMinus1 := pixelCurrent;
pixelCurrent <= RAM(to_integer(UNSIGNED(addrb)));