Does accessing arrays slow down my function?

I have 2 functions:

unsigned long long getLineAsRow(unsigned long long board, int col) {
    unsigned long long column = (board >> (7-(col - 1))) & col_mask_right;
    column *= magicLineToRow;
    return (column >> 56) & row_mask_bottom;
}

unsigned long long getDiagBLTR_asRow(unsigned long long board, int line, int row) {
    unsigned long long result = board & diagBottomLeftToTopRightPatterns[line][row];
    result = result << diagBLTR_shiftUp[line][row];
    result = (result * col_mask_right) >> 56;
    return result;
}

The only big difference I see is the access to a 2-dim-array. Defined like

int diagBRTL_shiftUp[9][9] = {};

I call both functions 10.000.000 times:

getLineAsRow ... time used: 1.14237s
getDiagBLTR_asRow ... time used: 2.18997s

I tested it with cl (vc++) and g++. Nearly no difference. It is a really huge difference, do you have any advice?

Solution

The question what creates the difference between the execution times of your two functions really cannot be answered without knowing either the resulting assembler code or which of the globals you are accessing are actually constants that can be compiled right into the code. Anyway, analyzing your functions, we see that

function 1
1. reads two arguments from stack, returns a single value
2. reads three globals, which may or may not be constants
3. performs six arithmetic operations (the two minuses in 7-(col-1) can be collapsed into a single subtraction)
function 2
1. reads three arguments from stack, returns a single value
2. reads one global, which may or may not be a constant
3. dereferences two pointers (not four, see below)
4. does five arithmetic operations (three which you see, two which produce the array indices)

Note that accesses to 2D arrays actually boil down to a single memory access. When you write diagBottomLeftToTopRightPatterns[line][row], your compiler transforms it to something like diagBottomLeftToTopRightPatterns[line*9 + row]. That's two extra arithmetic instructions, but only a single memory access. What's more, the result of the calculation line*9 + row can be recycled for the second 2D array access.

Arithmetic operations are fast (on the order of a single CPU cycle), reads from memory may take four to twenty CPU cycles. So I guess that the three globals you access in function 1 are all constants which your compiler built right into the assembler code. This leaves function 2 with more memory accesses, making it slower.

However, one thing bothers me: If I assume you have a normal CPU with at least 2 GHz clock frequency, your times suggest that your functions consume more than 200 or 400 cycles, respectively. This is significantly more than expected. Even if your CPU has no values in cache, your functions shouldn't take more than roughly 100 cycles. So I would suggest to take a second look at how you are timing your code, I assume that you have some more code in your measuring loop which spoils your results.