Cache miss rate of array

I'm trying to figure out how to calculate the miss rate of an array. I have the answer, but I'm not understanding how the answer was arrived at. I have the following code:

int C[N1][N2];
int A[N1][N3];
int B[N3][N2];

initialize_arrays(A, B, C, N1, N2, N3);

for(i=0; i<N1; ++i)
   for(j=0; j<N2; ++j)
      for(k=0; k<N3, ++k)
         C[i][j] += A[i][k] * B[k][j];

I also have the following info: N1=N2=N3=2048 (what does this mean??). The processor has an L1 data cache of 32kB with line size of 64B (no L2 cache). (what is line size?)

I know the miss rate of array C is (N^2/16)/N^3. I know the formula is (total misses)/(total accesses). I see that there are N^3 total accesses, but how did they get the total misses?

I also know the miss rate of array B: (N^3)/(N^3) and A: (N^2/16)/N^3. Could someone please explain to me how they got the total misses here too?

Solution

Cache is always accessed / managed with the granularity of a line. Here the line size is 64B. That means that in one cache line there will be 16 elements in one cache line (64B / 4B = 16). Next thing that is important here is every 17th element is in a new line, and once a line is brought to cache, it can possibly give 15 hits.

Having got that out of the way, let us look at the access pattern. A[0][0], B[0][0], C[0][0], A[0][1], B[1][0], C[0][0], A[0][2], B[2][0], C[0][0], ...... A[0][16], B[16][0], C[0][0], A[0][17], B[17][0], C[0][0], ...... and so on

Now Since a line has 16 elements and it is safe to assume a row-major placement of elements in the memory A[0][0] - A[0][15] will belong to one the first cache line (let us call this AL1) and similarly let B[0][0] - B[0][15] belong to line BL1. From this information we can write the access pattern in terms of cache lines. AL1, BL1, CL1, AL1, BL129, CL1, AL1, BL257, CL1, ..... AL1, BL2049, CL1, AL2, BL2176, CL1,.... and so on.

Now the cache size is 32kB that means the cache can hold 512 lines. Since this is L1, let us assume it is a two-way associative cache. Therefore there are 256 sets. This means that after every 256 lines of an array, the 257th line maps to the same set as the 1st line.

The cache contents upon access to lines mapping to set 1 looks like this

1st Iteration - inner

Access to - AL1 - Miss
 AL1 
Access to - BL1 - Miss
 AL1
 BL1
Access to - CL1 - Miss
 CL1
 BL1

2nd Iteration - inner

Access to - AL1 - Miss
 CL1
 AL1
Access to - CL1 - Hit
 CL1
 AL1

3rd Iteration - inner

Access to - AL1 - Hit
 CL1
 AL1
Access to - BL257 - Miss
 BL257
 AL1
Access to - CL1 - Miss
 BL257
 CL1

4th Iteration - Inner

Access to - AL1 - Miss
 AL1
 CL1
Access to - CL1 - Hit
 AL1
 CL1

5th Iteration - Inner

Access to - AL1 - Hit
 CL1
 AL1
Access to - BL513 - Miss
 BL513
 AL1
Access to - CL1 - Miss
 BL513
 CL1

. . .

16th Iteration - Inner

Access to - AL1 - Miss
 AL1
 CL1
Access to - CL1 - Hit
 AL1
 CL1

17th Iteration - Inner

Access to - BL2049 - Miss
 BL2049
 CL1
Access to - CL1 - Hit
 BL2049
 CL1

18th Iteration - Inner

Access to - CL1 - Hit
 BL2049
 CL1

For the 1st iteration of the middle loop. We have for set 1,

A -> M, M, H, M, H, M, H, M, H, M, H, M, H , M, H, M, ~ , ~ , ~ , ~ , ....... 
=> after 2048 iterations - 7 hits 9 misses, 16 accesses

B -> M, ~ , M, ~ ,  M, ~ , ........
=> after 2048 iterations - 0 Hits, 1024 misses, 1024 accesses

C -> M, H, M, H, M, H, M, H, M, H, M, H, M, H, M, H, H , H, H , H, ......
=> after 2048 iterations - 2040 hits, 8 misses, 2048 accesses

for 2nd - 16th iteration of the middle loop,

Exact hit / miss / access pattern as before.

for 17th - 2048th iteration of the middle loop,

A -> same as before
B -> No accesses
C -> No accesses

To summarize - for 1 iteration of the outer loop, we have for set 1,

A -> N*9 misses , N * 16 accesses
B -> N/2 * 16 Misses , N/2 * 16 accesses
C -> 8 * 16 Misses , N * 16 accesses

In the outer loop, we can see that every alternate iteration will have the same hit / miss / access pattern as the 1st iteration (summarized above) for arrays C and A. Every iteration of the outer loop will have the same hit / miss / access pattern a the 1st iteration for array B.

Hence, (Finally!)

For Set 1, at the end of the program, we have

A -> N^2 * 9 / 2 misses, N^2 * 8 accesses
B -> N^2 * 8 misses, N^2 * 8 accesses
C -> N * 64 misses, N^2 * 8 accesses

The access pattern will be similar across all sets. Therefore by the end of the program we have, across all sets

A -> N^2 * 1152 misses, N^3 accesses
B -> N^3 misses, N^3 accesses
C -> N^2 * 8 misses, N^3 accesses

I know this is different from what is indicated in the question. I could not figure out how. I will be glad to hear other responses.