Shift Register or FIFO in block RAM (Xilinx)

I have to buffer some data in a quite big buffer. It is not a usual shift register or a FIFO, because I will have to be able to read data also from the middle of the buffer. I managed to implement that in a way so I can use it as I need it. The problem is, that it does make use of LUTs for that, which takes a lot of space in my design. I would like to change my design so, that the buffer gets inferred as Block RAM. Using ram_style "block" didn't help. Any ideas or suggestions how I could achieve that? Update: buf_size is declared in a package: constant buf_size : natural := 5;

library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

entity deriv_buffer is
  generic(
    NSAMPLES : natural := 16
  );
  port(   
    clk : in std_logic;
    rst : in std_logic;
    deriv_s : in t_deriv_array( NSAMPLES - 1 downto 0 );
    deriv_buf : out t_deriv_array( buf_size * NSAMPLES - 1 downto 0 )
  );
end deriv_buffer;

architecture Behavioral of deriv_buffer is

  signal deriv_buf_s : t_deriv_array( (buf_size-1) * NSAMPLES - 1 downto 0 );
  attribute ram_style : string;
  attribute ram_style of deriv_buf_s : signal is "block";

begin

  deriv_buf( buf_size * NSAMPLES - 1 downto (buf_size - 1) * NSAMPLES ) <= deriv_s;

  buffer_p : process( rst, clk )
  begin
    if rst = '1' then
      deriv_buf_s <= ( others => ( others => '0' ) );
    elsif rising_edge( clk ) then
      deriv_buf_s( (buf_size - 1) * NSAMPLES - 1 downto (buf_size - 2) * NSAMPLES ) <= deriv_s;
      deriv_buf_s( (buf_size - 2) * NSAMPLES - 1 downto (buf_size - 3) * NSAMPLES ) <= deriv_buf_s( (buf_size - 1) * NSAMPLES - 1 downto (buf_size - 2) * NSAMPLES );
      deriv_buf_s( (buf_size - 3) * NSAMPLES - 1 downto (buf_size - 4) * NSAMPLES ) <= deriv_buf_s( (buf_size - 2) * NSAMPLES - 1 downto (buf_size - 3) * NSAMPLES );
      deriv_buf_s( (buf_size - 4) * NSAMPLES - 1 downto (buf_size - 5) * NSAMPLES ) <= deriv_buf_s( (buf_size - 3) * NSAMPLES - 1 downto (buf_size - 4) * NSAMPLES );
    end if;
  end process buffer_p;

  deriv_buf( (buf_size-1)*NSAMPLES - 1 downto 0 ) <= deriv_buf_s;

end Behavioral;

Solution

If you want you use a block RAM, you need to consider that a block RAM only has 2 ports. You cannot look freely into the data in the RAM: you need to access it through either port.

Furthermore, reading and/or writing takes a clock cycle to process.

So if we look at your code, it already starts out problematically:

entity deriv_buffer is
    [...]
    port(
        [...]
        deriv_buf : out t_deriv_array( buf_size * NSAMPLES - 1 downto 0 )

You have your whole RAM connected to an output port! I don't know what you are doing with the contents in the entity using this component, but as I said: you don't have free access to the contents of a block RAM. You need to follow proper block RAM design guidelines.

Refer to the Xilinx Synthesis User Guide for instance for proper block RAM instantiation. (Chapter 4 HDL Coding Techniques, section RAM HDL Coding Techniques)

Next problem: reset

if rst = '1' then
    deriv_buf_s <= ( others => ( others => '0' ) );

Resetting a RAM is not possible. If you really want to clear the RAM, you need to write a (others=>'0') to each separate address location. Thus you need control logic to do so. But now, using this reset code will not allow a block RAM to be instantiated.

Then in your code you have the part

deriv_buf_s( (buf_size - 1) * NSAMPLES - 1 downto (buf_size - 2) * NSAMPLES ) <= deriv_s;
deriv_buf_s( (buf_size - 2) * NSAMPLES - 1 downto (buf_size - 3) * NSAMPLES ) <= deriv_buf_s( (buf_size - 1) * NSAMPLES - 1 downto (buf_size - 2) * NSAMPLES );
deriv_buf_s( (buf_size - 3) * NSAMPLES - 1 downto (buf_size - 4) * NSAMPLES ) <= deriv_buf_s( (buf_size - 2) * NSAMPLES - 1 downto (buf_size - 3) * NSAMPLES );
deriv_buf_s( (buf_size - 4) * NSAMPLES - 1 downto (buf_size - 5) * NSAMPLES ) <= deriv_buf_s( (buf_size - 3) * NSAMPLES - 1 downto (buf_size - 4) * NSAMPLES );

This code has two big issues:

You try to read and write within one clock cycle. But like I said, it takes one clock cycle to read the block RAM and a second clock cycle to write.
This code instantiates 4 write ports and 3 read ports. Like I said: a block RAM only has 2 ports.

You could implement the code to use 4 block RAM instances. But then still all the ports of these block RAMs would be occupied. So no port would be left to provide random access to all the data in the RAM, like you wish.

Conclusively: I think you should reconsider your requirement. What you want is not possible in block-RAM. If you want to use block RAM, you should change your algorithm.