CUDA error: too much shared data (0x4018 bytes, 0x4000 max): where do the extra 0x18bytes come from?

I am trying to implement this CUDA example:
http://devblogs.nvidia.com/parallelforall/efficient-matrix-transpose-cuda-cc/
Because I have 0x4000 bytes available, I tried to use TILE_DIM = 128, so that

__shared__ unsigned char tile[TILE_DIM][TILE_DIM];

would be of size 0x4000 bytes = 16384 bytes = 128*128 bytes.

However, this gives me the following error:

CUDACOMPILE : ptxas error : Entry function '_Z18transposeCoalescedPh' uses too much shared data (0x4018 bytes, 0x4000 max)

So I have 0x18 (24) extra bytes in shared memory. Where do they come from, and is it possible to remove them?

I could compile for Compute version 2.0+ higher to remove the error (my hardware is version 3.0), but that would use memory from the L1 cache which is supposedly slower.

Solution

So I have 0x18 (24) extra bytes in shared memory. Where do they come from, and is it possible to remove them?

Referring to the programming guide:

The total amount of shared memory required for a block is equal to the sum of the amount of statically allocated shared memory, the amount of dynamically allocated shared memory, and for devices of compute capability 1.x, the amount of shared memory used to pass the kernel's arguments (see __noinline__ and __forceinline__).

As long as you compile for a cc1.x architecture, you won't be able to elminate the use of shared memory to carry kernel parameters.

I think the solution as you've already indicated, is to compile for a cc2.0 or cc3.0 architecture. It's not clear why you wouldn't want to do this.