VHDL Place and route path analysis

my problem is that when I implement my design using Xilinx ISE 14.7 + XPS I often obtain a very different number of analyzed paths in the static timing analysis, also having very few differences in the .vhd files. In particular, the only file that I change (or that I think to change...) is something like:

entity my_entity is(
    ...
    data_in : in std_logic_vector(N*B-1 downto 0);
    ...
);
end entity my_entity;

architecture bhv of my_entity is
    signal data : std_logic_vector(B-1 downto 0);
    signal idx_vect : std_logic_vector(log2(N)-1 downto 0);
    signal idx : integer range 0 to N-1;
    ...
begin
    process(clk)
    begin
        if(rising_edge(clk))then
            idx_vect <= idx_vect + 1;
        end if;
    end process;

    idx <= to_integer(unsigned(idx_vect));

    data <= data_in((idx+1)*B-1 downto idx*B);

end architecture bhv;

I'm not sure the problem comes from here, but I'm not finding any other possible cause to a decrease of five times in the number of analyzed paths. Are there some syntax that one must avoid in order to obtain a correct implementation? Is it possible that indexing an array using an integer (as in the example codec) breaks up in some way the paths, making them not analyzed?

The code change is something like:

process(shift_reg, data_in)
    for i in range 0 to N-1 loop
        if(shift_reg(i) = '1')then
            data <= data_in((i+1)*B-1 downto i*B);
        end if;
    end loop;
end process;

in which instead of increment idx_vect I have a circular one-hot shift register of N bits. Thanks in advance.

Solution

The coding style of the multiplexer at this line

data <= data_in((idx+1)*B-1 downto idx*B);

can heavily influence the logic synthesis. This results in very different number of paths to analyze for timing.

The original multiplexer

I first checked the synthesis of the above line using this small example:

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;

entity mux1 is
    generic (
        B : positive := 32;
        M : positive := 7); -- M := ceil(log_2 N)
    port (
        d : in  STD_LOGIC_VECTOR ((2**M)*B-1 downto 0); -- input data
        s : in  STD_LOGIC_VECTOR (M-1 downto 0);        -- selector
        y : out  STD_LOGIC_VECTOR(B-1 downto 0));       -- result
end mux1;

architecture Behavioral of mux1 is
    constant N : positive := 2**M;
    signal idx : integer range 0 to N-1;
begin
    idx <= to_integer(unsigned(s));
    y <= d((idx+1)*B-1 downto idx*B);
end Behavioral;

If one synthesizes this for a Spartan-6, XST reports this (excerpt):

Macro Statistics
# Adders/Subtractors                                   : 2
 13-bit subtractor                                     : 1
 8-bit adder                                           : 1
...
 Number of Slice LUTs:                 1516  out of   5720    26%  
...
Timing constraint: Default path analysis
  Total number of paths / destination ports: 139264 / 32

Thus, no multiplexer was detected and the timing analyzer has to analyze a huge number of paths. The logic utilization is ok.

Optimized implementation

The same multiplexing can be achieved with: (EDIT: bugfix and simplification)

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;

entity mux2 is
    generic (
        B : positive := 32;
        M : positive := 7); -- M := ceil(log_2 N)
    port (
        d : in  STD_LOGIC_VECTOR ((2**M)*B-1 downto 0);
        s : in  STD_LOGIC_VECTOR (M-1 downto 0);
        y : out  STD_LOGIC_VECTOR(B-1 downto 0));
end mux2;

-- !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
-- !! The entire architecture has been FIXED and simplified. !!
-- !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
architecture Behavioral of mux2 is
    constant N : positive := 2**M;
    type matrix is array (N-1 downto 0) of std_logic_vector(B-1 downto 0);
    signal dd : matrix;
begin
    -- reinterpret 1D vector 'd' as 2D matrix, i.e.
    -- row 0 holds d(B-1 downto 0) which is selected in case s = 0
    row_loop: for row in 0 to N-1 generate
        dd(row) <= d((row+1)*B-1 downto row*B);
    end generate;

    -- select the requested row
    y <= dd(to_integer(unsigned(s)));
end Behavioral;

Now, the XST report looks much better:

Macro Statistics
# Multiplexers                                         : 1
 32-bit 128-to-1 multiplexer                           : 1
...
 Number of Slice LUTs:                 1344  out of   5720    23%  
...
Timing constraint: Default path analysis
  Total number of paths / destination ports: 6816 / 32

It detects that for each output-bit a 128-to-1 multiplexer is required. The optimized synthesis of such a wide multiplexer is built-in to the synthesis tool. The number of LUTs is only reduced slightly. But, the number of paths to be processed by the timing analyzer is reduced dramatically by a factor of 20!

Implementation using one-hot selector

The above examples use a binary-encoded selector signal. I checked also the variant with the one-hot encoded one:

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;

entity mux3 is
    generic (
        B : positive := 32;
        N : positive := 128);
    port ( d : in  STD_LOGIC_VECTOR (N*B-1 downto 0);
           s : in  STD_LOGIC_VECTOR (N-1 downto 0);
           y : out  STD_LOGIC_VECTOR(B-1 downto 0));
end mux3;

architecture Behavioral of mux3 is

begin
    process(d, s)
    begin
        y <= (others => '0'); -- avoid latch!
        for i in 0 to N-1 loop
            if s(i) = '1' then
                y <= d((i+1)*B-1 downto i*B);
            end if;
        end loop;
    end process;

end Behavioral;

Now, the XST report is different again:

Macro Statistics
# Multiplexers                                         : 128
 32-bit 2-to-1 multiplexer                             : 128
...
Number of Slice LUTs:                 2070  out of   5720    36%  
...
Timing constraint: Default path analysis
  Total number of paths / destination ports: 13376 / 32

2-to-1 multiplexer are detected, because a priority mux analog to this scheme was described:

if s(127) = '1' then
  y <= d(128*B-1 downto 127*B);
else
  if s(126) = '1' then
    y <= d(127*B-1 downto 126*B);
  else
    ...
                             if s(0) = '1' then
                               y <= d(B-1 downto 0);
                             else
                               y <= (others => '0');
                             end if;
  end if; -- s(126)
end if; -- s(127)

I have not used elsif here for didactical reasons. Each if-else stage is a 32-bit wide 2-to-1 mutiplexer. The problem here is, that the synthesis does not know, that s is a one-hot encoded signal. Thus, a little more logic is required as in my optimized implementation.

The number of paths to analyze for timing changes again significantly. The number is 10 times lower than in the original implementation, but 2 times higher than in my optimized one.