Consecutive 1 byte write to global memory results in multiple transactions

I'm doing a project where each thread writes 1-byte value to global memory, and I'm trying to minimize global memory write latency in my project.

In section 5.3.2. of CUDA C Programming guide(link) it says:

Global memory resides in device memory and device memory is accessed via 32-, 64-, or 128-byte memory transactions. These memory transactions must be naturally aligned: Only the 32-, 64-, or 128-byte segments of device memory that are aligned to their size (i.e., whose first address is a multiple of their size) can be read or written by memory transactions.

So I thought consecutive 1-byte write to global memory should be handled with one transaction, given that they are properly aligned.

But when I profile the following minimal example with Nsight in Visual Studio, although it takes 1 transaction(as expected) for 4 threads accessing consecutive 1-byte address, in the case of 5 threads, it takes 2 transactions.

__global__ void copyKernel(const unsigned char* a, unsigned char* b)
{
    int i = threadIdx.x;
    a[i] = b[i];
}

int main()
{
    char *d_a;
    char *d_b;

    // ... (stuffs like cudaMalloc)

    // to check that the address is aligned
    printf("%p\n", d_a); // aligned to 512-Byte
    printf("%p\n", d_b); // aligned to 512-Byte

    // copy 4 elements
    copyKernel<<<1, 4>>>(d_a, d_b);

    // copy 5 elements
    copyKernel<<<1, 5>>>(d_a, d_b);

    // ...
}

The profiling results are as follows. (left - 4 threads / right - 5 threads)

What am I missing here? How should I write my code to make it perform writes in one transaction?

Environment: Windows 10, Visual Studio 2015, GeForce GTX 1080 (cc 6.1)

Solution

It seems that I was looking at the results from wrong experiment. Nsight provides a number of experiments to "Profile CUDA Application", and the images posted in the question was from the result of "Memory Statistics - Global" experiment. From the User Guide for Nsight, the "Global" experiment reports following data:

The Transactions Per Request chart shows the average number of L1 transactions required per executed global memory instruction, separately for load and store operations.

So the number of write transactions shown in the "Global" experiment actually was to the L1 cache, not to L2. (Although in Nsight UI, it says it is to L2.)

On the other hand, "Memory Statistics - Caches" seems to show the number of L2 transactions, which contained data more relevant to what I was looking for. The numbers there were same as commented by Robert Crovella.

Test result for 1M threads:

Update

It seems that L2 transactions are in 32-byte granularity. Looking at profiling result for 4-byte consecutive store, the number of L2 store transactions reported for 1M threads is 131,072, which is equal to 1M(#threads) times 4(size of data) divided by 32.

So I came to a conclusion that the statement from the quote in my question saying "device memory can be accessed via 128-byte transaction" cannot be verified with Nsight, becuase it seems doesn't count transactions between L2 and device memory. (test code)