access four elements from array at the same time vhdl

how can i access four elements from a 2d array or array of array in one process at the same time? in this sample, i am trying to access intg1 at the same time, the synthesis is taking for ever.

type img_whole is array (78 downto 0, 130 downto 0) of std_logic_VECTOR(7 downto 0);
signal img1: img_whole;

signal i1_1: integer range 0 to 79:=0;
signal j1_1:integer range 0 to 131:=0;

type intg is array (78 downto 0, 130 downto 0) of integer range 0 to 1751998;--no double??
signal intg1 : intg;



integral :process (clka,finished,finished1)
variable tempo: integer range 0 to 1751998;


begin

if clka'event and clka = '1' then
if finished="1" and finished1="0" then
if i1_1 < 78 and j1_1 <130 then 

j1_1<=j1_1+1;
elsif j1_1=130 and i1_1<78 then
j1_1<=0 ;
i1_1<=i1_1+1;
elsif j1_1<130 and i1_1=78 then
j1_1<=j1_1+1;
elsif j1_1=130 and i1_1=78 then
    finished1<="1";
end if; 
tempo:= to_integer(unsigned('0' & img1(i1_1,j1_1)));

if i1_1-1>=0 then
tempo:=intg1(i1_1-1,j1_1)+tempo;

end if;
if j1_1-1>=0 then
tempo:=intg1(i1_1,j1_1-1)+tempo;

end if; 
if i1_1-1>=0 and j1_1-1>=0 then
tempo:=tempo-intg1(i1_1-1,j1_1-1);

end if;

   intg1(i1_1,j1_1)<=tempo;
    end if;
end if;
end process;

i am trying to access intg1 at the same time, the synthesis is taking for ever. this code is for getting an integral image, out of a 2d array.

Solution

There are both functional and synthesis issues in the code.

Functional issues:

finished1 is only driven to '1' in the process, but never to '0', so if the initial value is '0' then the operation in the process can only be done once after power up, since the finished1 value of '1' will then inhibit further updates due to the process enable condition.
i1_1 and j1_1 are signals that are driven in the start of the process, and then used later in the process, but since signals, the value assigned with <= is not available until next process evaluation. Is that intentional?

Use a simulator to ensure correct functionality, which can be done before synthesis.

Synthesis issues:

intg1 is a table with at least 79 * 131 > 10 K entries, each of log2(1751999) <= 18 bits, thus a pretty large table. The design requires asynchronous lookup in the table, since there is no extra cycle (clock edge) available from a new value of index e.g. i1_1 and until the output of the process is generated based on the table lookup. An asynchronous lookup in a large table requires a huge mux network, which is probably the reason for the long synthesis time. And this lookup is even done multiple times based on different index values.
Minor: finished, and finished1 are not needed in the sensitivity list of the process, since this is a process clocked by the clka.

The above list of issues may not be complete.

To fix the table lookup problem (first synthesis issue), make a pipe-lined design with cycles e.g.:

Index values i1_1 etc. are generated
intg1 table lookup synchronously
Intermediate tempo is generated, and intg1 is updated.

The current design does step 2. and 3. in a single cycle, whereby it is not possible to make a synchronous lookup in the table, since there is only one clock edge in the cycle, and this is used for writing back to the intg1 table. So by splitting the lookup and write back operation in two cycles, it is possible both to have a clock edge for reading the table (synchronous read) and for writing the table. Such a synchronous read using a clock edge is much more efficient based on the available hardware resources in typical FPGAs, since these contains large synchronous RAMs similar to the intg1 table, thus the implementation will be smaller and faster. The synchronous intg1 lookup is made by simply adding a clocked process where signals are driven directly by the intg1 output based in the required index values. All the required reads must be made, then the subsequent process can then determine which of the read value that are actually used.

The specific pipeline implementation must be adapted to the design requirements.